<a href="https://colab.research.google.com/github/max36067/chinese-news/blob/master/bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import glob
fn = glob.glob("/content/drive/My Drive/chinese/*")
fn

['/content/drive/My Drive/chinese/bert_model.ckpt.index',
 '/content/drive/My Drive/chinese/bert_config.json',
 '/content/drive/My Drive/chinese/bert_model.ckpt.meta',
 '/content/drive/My Drive/chinese/vocab.txt',
 '/content/drive/My Drive/chinese/bert_model.ckpt.data-00000-of-00001',
 '/content/drive/My Drive/chinese/content.csv',
 '/content/drive/My Drive/chinese/news_content.csv']

In [0]:
import pandas as pd
train_df = pd.read_csv(fn[6], encoding="utf-8")
train_df = train_df.dropna().drop([0])
train_df.columns = ["content", "label"]
train_df

Unnamed: 0,content,label
1,鴻海 (2317-TW) 與日本系統晶片 (SoC) 業者索思未來 (Socionext)、...,2
2,秋冬旅遊補助政策下激勵飯店業者業績表現，業者公布的去年 12 月合併營收來看，陸續繳出亮眼成...,2
3,材料 - KY(4763-TW) 擴產效益顯現，去年第 4 季營收 6.07 億元創新高， ...,2
4,美國消費性電子展 (CES) 北美時間 7 日正式開展，台廠除聯發科 (2454-TW) 發...,2
5,美系外資指出，LED 廠晶電 (2448-TW) 今年 Mini LED 專案持續增加，負責...,2
...,...,...
199,水處理環保設備及技術服務商基士德 - KY (6641-TW) 2019 年 11 月營收以...,2
200,軸承廠兆利 (3548-TW) 今年積極搶灘折疊手機、雙螢幕筆電等新應用領域，全年業績成長估...,2
201,藥華藥 (6446-TW) 旗下創新生物藥百斯瑞明 (Besremi) 去年初取得歐盟核准後...,2
203,看好伺服器產業回溫，加上市場傳微軟取得美國國防部雲端合約，有助台廠供應鏈營運，外資看好緯創 ...,2


In [0]:
import codecs
token_dict = {}
with codecs.open(fn[3], "r", "utf8") as reader:
  for line in reader:
      token = line.strip()
      token_dict[token] = len(token_dict)

In [0]:
trans = {"2": '正面',
         '1': '中立',
         '0': '負面'}

In [0]:
from bert4keras.tokenizer import Tokenizer
from bert4keras.snippets import sequence_padding
import numpy as np
# 如果要用fit_generator的話，需要使用yield將參數先傳入記憶體
tokenize = Tokenizer(token_dict, do_lower_case=True)
def data_generator(x, y, batch=32):
  while True:
    count = 0
    idx = np.arange(len(x))
    # 把資料打亂才不會一次看同樣的東西
    np.random.shuffle(idx)
    indices, labels = [], []
    for l in idx:
      # 把字tokenize output會出現字的索引+區別1, 2段的token
      ids, segments = tokenize.encode(first_text=x[l], max_length=256)
      indices.append(ids)
      labels.append(y[l])
      count += 1
      # 因為fit_generator不會有區別x, y值，需要將兩者一次傳入
      if count == batch or l == idx[-1]:
        indices = sequence_padding(indices)
        labels = sequence_padding(labels)
        yield [np.array(indices), np.zeros_like(indices)], np.array(labels)
        indices, labels = [], []
        count = 0

In [0]:
from sklearn.model_selection import train_test_split
batch = 16
x_train, x_test, y_train, y_test = train_test_split(train_df['content'], train_df['label'], test_size=0.1)
x_train = x_train.values
x_test = x_test.values
y_train = y_train.values
y_test = y_test.values
train = data_generator(x_train, y_train, batch)
test = data_generator(x_test, y_test, batch=4)
x = print(test)
y = next(test)[1]
x

<generator object data_generator at 0x7fcd8ac019e8>


In [0]:
from bert4keras.bert import build_bert_model
from bert4keras.backend import set_gelu
from bert4keras.optimizers import Adam
from keras.layers import *
from keras.models import Model

set_gelu("tanh")
checkpoint_path = fn[4].split('.')[0] + '.ckpt'
bert_model = build_bert_model(
    fn[1],
    checkpoint_path,
    return_keras_model=False,
    with_pool=True
)
# bert後不用接太多層，本身就有不錯的預測值
x = Dropout(0.1)(bert_model.model.output)
x = Dense(3, activation="softmax", kernel_initializer=bert_model.initializer)(x)
model = Model(bert_model.model.input, x)
model.compile(
    loss="sparse_categorical_crossentropy",
    optimizer=Adam(2e-5),
    metrics=["accuracy"],
)


==> searching: bert/embeddings/word_embeddings, found name: bert/embeddings/word_embeddings
==> searching: bert/embeddings/token_type_embeddings, found name: bert/embeddings/token_type_embeddings
==> searching: bert/embeddings/position_embeddings, found name: bert/embeddings/position_embeddings
==> searching: bert/embeddings/LayerNorm/gamma, found name: bert/embeddings/LayerNorm/gamma
==> searching: bert/embeddings/LayerNorm/beta, found name: bert/embeddings/LayerNorm/beta
==> searching: bert/encoder/layer_0/attention/self/query/kernel, found name: bert/encoder/layer_0/attention/self/query/kernel
==> searching: bert/encoder/layer_0/attention/self/query/bias, found name: bert/encoder/layer_0/attention/self/query/bias
==> searching: bert/encoder/layer_0/attention/self/key/kernel, found name: bert/encoder/layer_0/attention/self/key/kernel
==> searching: bert/encoder/layer_0/attention/self/key/bias, found name: bert/encoder/layer_0/attention/self/key/bias
==> searching: bert/encoder/layer_

In [0]:
model.summary()

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
Input-Token (InputLayer)        (None, None)         0                                            
__________________________________________________________________________________________________
Sequence-Mask (ZeroMasking)     (None, None)         0           Input-Token[0][0]                
__________________________________________________________________________________________________
Input-Segment (InputLayer)      (None, None)         0                                            
__________________________________________________________________________________________________
Embedding-Token (Embedding)     (None, None, 768)    16226304    Sequence-Mask[0][0]              
____________________________________________________________________________________________

In [0]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

filepath = "/content/drive/My Drive/model/weights-improvement-{epoch:02d}-{val_loss:.2f}-{val_acc:.2f}.h5"
cp_callback = ModelCheckpoint(
    filepath=filepath, verbose=1, save_best_only=True,
)
save_check_point = EarlyStopping(patience=3, restore_best_weights=True)
callbacks_list = [cp_callback, save_check_point]
model.fit_generator(
    train,
    validation_data=test,
    validation_steps=len(x_test)/ 4,
    steps_per_epoch=len(x_train) / batch,
    verbose=1,
    epochs=10,
    callbacks=callbacks_list,
)

Epoch 1/10

Epoch 00001: val_loss improved from inf to 0.94736, saving model to /content/drive/My Drive/model/weights-improvement-01-0.95-0.53.h5
Epoch 2/10

Epoch 00002: val_loss improved from 0.94736 to 0.40766, saving model to /content/drive/My Drive/model/weights-improvement-02-0.41-0.95.h5
Epoch 3/10

Epoch 00003: val_loss did not improve from 0.40766
Epoch 4/10

Epoch 00004: val_loss did not improve from 0.40766
Epoch 5/10

Epoch 00005: val_loss improved from 0.40766 to 0.38243, saving model to /content/drive/My Drive/model/weights-improvement-05-0.38-0.95.h5
Epoch 6/10

Epoch 00006: val_loss did not improve from 0.38243
Epoch 7/10

Epoch 00007: val_loss improved from 0.38243 to 0.21064, saving model to /content/drive/My Drive/model/weights-improvement-07-0.21-0.89.h5
Epoch 8/10

Epoch 00008: val_loss did not improve from 0.21064
Epoch 9/10

Epoch 00009: val_loss did not improve from 0.21064
Epoch 10/10

Epoch 00010: val_loss did not improve from 0.21064


<keras.callbacks.History at 0x7fcd883664a8>

In [0]:
from keras.models import load_model
new_model = load_model('/content/drive/My Drive/model/weights-improvement-03-0.14-1.0.h5')

In [0]:
new_model.evaluate_generator(test, 4, verbose=1,)



[0.054578081766764326, 1.0]

In [0]:
text = '''美陸雙方15日正式簽署第一階段貿易協議，雙方貿易戰暫時休兵。但分析師認為，這項協議僅結束全球經濟的部分不確定性，但美國與其貿易夥伴的緊張關係恐將持續，預計川普政府下個目標將是對歐洲提高關稅。

外媒報導，針對美陸簽署第一階段貿易協議，Strategas政策研究主管Daniel Clifton認為，美國與大陸未來仍將維持緊張局勢，包括網路、國家安全和人權方面，「這些問題並不會消失，但對標普500指數成分股的獲利影響沒有關稅升級來的大」。

花旗集團經濟學家Cesar Rojas也預期，美陸貿易緊張局勢今年會持續上演。他表示，美方在第一階段協議中利用關稅向大陸施壓，但現在可能會改為對陸投資進行更嚴格的監管審查。

報導指出，世界貿易組織先前裁決空巴長期獲得政府非法補貼，讓美國揚言開徵報復性關稅；加上法國要求谷歌、蘋果等美企繳交數位稅，美方也考慮對法國葡萄酒等商品徵收關稅，預計川普對歐洲加徵關稅成為新風險。

美東時間15日，美國總統川普與大陸國務院副總理劉鶴正式簽署首階段貿易協議，本次協議主要內容包含陸方將在未來兩年採購超過2000億美元的美國產品，同時陸方將開放金融市場，停止強迫美方向陸方轉讓技術。'''
x1,x2 = tokenize.encode(first_text=text)

In [0]:
new_model.predict([x1,x2], verbose=1).shape



(472, 3)

In [0]:
np.array([np.array(x1).reshape(1, -1), np.array(x2).reshape(1, -1)]).shape

(2, 1, 399)

In [0]:
a, b = np.array(x1).reshape(1, -1), np.array(x2).reshape(1, -1)

In [0]:
np.argmax(new_model.predict([a,b])[0])

0