# 專題（二）：訓練 Bert 新聞觀點分類器並提升精準度

## 專案目標
- TL；DR：請試著使用含有 pair sentence 的 Training Dataset 去訓練 Bert 分類器 (BertForSequenceClassification)，並且在 Test Dataset 上驗證模型的精準度
- 資料集 in archive.zip：
    - 包含：train.csv、test.csv、solution.csv
    - 資料來源：https://www.kaggle.com/wsdmcup/wsdm-fake-news-classification
    - 資料中包含兩個新聞標題 title1_zh 和 title2_zh，並且給予這兩篇新聞的相關性，分別可能是：agreed, unrelated, disagreed

## 實作提示
- STEP1 - STEP4：資料處理
- STEP5：創造 train_batch 函數
- STEP6：創造 evaluate 函數
- STEP7：組合以上元素開始訓練，如果正確 validation accuracy 應該可以超過 85% 以上
- STEP8：對 testing dataset 進行測試，並計算 accuracy

## 重要知識點：專題結束後你可以學會
- 了解 BERT 的 2-Sequence Classification 任務如何進行
- 使用 TRAIN / VALID DATA 來了解深度學習模型的訓練情形
- 了解預訓練模型在 NLP 上的威力

In [1]:
# 連接個人資料 讀取 ＰＴＴ 訓練資料和儲存模型
#先連接自己的GOOGLE DRIVE 為了要儲存資料和訓練模型
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os

# Current directory
print(os.getcwd())

# change directory
os.chdir('/content/drive/MyDrive/python_training/NLP100Days-part2/project_1_5/')
print(os.getcwd())

/content
/content/drive/MyDrive/python_training/NLP100Days-part2/project_1_5


In [3]:
# from: https://www.kaggle.com/wsdmcup/wsdm-fake-news-classification
# !unzip archive.zip

In [4]:
!python --version

Python 3.7.11


In [5]:
!pip install torch
!pip install transformers
#!pip install -q transformers
# 設定 torchtext 版本 安裝完必須重新啟動執行階段
!pip install torchtext==0.6.0



In [6]:
import pandas as pd
import numpy as np

import torch
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import random_split
from tqdm.notebook import tqdm

from transformers import BertTokenizer, BertForSequenceClassification

In [7]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [8]:
len(df_train)

320552

In [9]:
df_train.sample(3)

Unnamed: 0,id,tid1,tid2,title1_zh,title2_zh,title1_en,title2_en,label
111742,112046,84468,84469,原来宁泽涛家底丰厚 宁爸喊话转型不游泳 所以最终决定退役？,宁泽涛为什么办理转业手续 转业和退役有什么区别？,"Originally Zhentao Zhaodong is rich, the fathe...",Ning Zhaotao why to transfer procedures for tr...,unrelated
152965,153315,106997,85591,娱乐：天价离婚！赵薇哥哥离婚支付5亿分手费！,曾被朋友排挤，被天价分手费谣言傍身，但她却不吭不声捐4亿元,Entertainment: Divorce extremely expensive! Zh...,"She was outclassed by her friends, and she was...",unrelated
247349,247903,15662,102329,涂磊现场自爆自己离婚的真正原因，全场观众感动落泪！,央视周涛爆离开央视原因，前夫自爆离婚内幕疑有小三，女儿自闭症,The whole audience moved to tears as the real ...,"CCTV Zhoutao's reason why he left CCTV, her hu...",unrelated


In [10]:
df_test.sample(3)

Unnamed: 0,id,tid1,tid2,title1_zh,title2_zh,title1_en,title2_en
19737,341077,173197,173224,赵本山病重暴瘦40斤！前妻回国照料，妻子态度却让网友愤怒,死亡率最高的5款车，给再多补贴都不敢买！,Zhao Benshan was seriously ill and lost weight...,"The five cars with the highest mortality rate,..."
47652,369020,180315,12617,阴阳眼男孩（太极八卦阵现形）,步步惊魂，8岁男孩拥有阴阳眼源于藏在心灵深处的魔鬼？,Yin Yang Eye Boys (Taiji Eight Diagrams),"In a panic, an 8-year-old boy has the yin-yang..."
42750,364108,50544,10001,重磅！中考取消！九年制义务教育将升级为十二年制！,【辟谣】中考取消？九年制义务教育将升级十二年制？官方这样回应,Blockbuster! Canceled! Nine-year compulsory ed...,[Resurrection] The exam will be abolished and ...


In [11]:
df_train = df_train[['title1_zh', 'title2_zh', 'label']].dropna(axis=0).reset_index(drop=True)
df_test = df_test[['id', 'title1_zh', 'title2_zh']].dropna(axis=0).reset_index(drop=True)

In [12]:
len(df_train)

320545

In [13]:
len(df_test)

80125

In [14]:
df_train.sample(3)

Unnamed: 0,title1_zh,title2_zh,label
235574,死掉的蛇还会逃走，小狗会说话？又来一个尸兄！,这条狗已经成精了，还会劝架，这让麻麻怎么忍心下手？,unrelated
192131,户口在农村的人，你的身价将倍增！赶紧围观！,好消息！户口是农村的人赶紧看看，你的身价将倍增！,agreed
62134,事实派 | 柿子和螃蟹同食导致3岁女童死亡？食物相克还要骗多少人？,水果染色？螃蟹注油？ 院士现场做实验破谣言,unrelated


In [15]:
df_test.sample(3)

Unnamed: 0,id,title1_zh,title2_zh
17169,338508,赵丽颖想谈恋爱了，据说对方是个大老板,自曝恋爱中很难相信对方 赵丽颖爆料曾经被甩
32081,353433,这张绿叶贰角纸币收藏价值飙升，快找找家里有没有？,辟谣！南阳卧龙大桥五女跳河全死亡？假的
11600,332914,谢娜曝光赵丽颖瞒了十年的恋情，是他？双方携手会见父母,老交警提示：驾驶证替人销分新规已全面实施，被查住扣12分罚5000


In [16]:
ALL_LABELS = ['agreed', 'unrelated', 'disagreed']

In [17]:
MODEL_NAME = 'bert-base-chinese'

In [18]:
# 建置數據集
class NewsPairDataset(Dataset):
    def __init__(self, tokenizer, df, max_len=512):
        self.tokenizer = tokenizer
        self.df = df
        self.max_len = max_len

    def __getitem__(self, idx):
        text1 = self.df.loc[idx, 'title1_zh']
        text2 = self.df.loc[idx, 'title2_zh']
        label = self.df.loc[idx, 'label'] if 'label' in self.df.columns else None

        text1_tokens = self.tokenizer.tokenize(text1)
        text2_tokens = self.tokenizer.tokenize(text2)
        len_all_tokens = len(text1_tokens) + len(text2_tokens) + 2
        if len_all_tokens > self.max_len:
            limit_num = (self.max_len - 2) // 2
            text1_tokens = text1_tokens[:limit_num]
            text2_tokens = text2_tokens[:limit_num]

        input = {}

        word_pieces = ['[CLS]'] + text1_tokens + ['[SEP]'] + text2_tokens
        input['input_ids'] = torch.tensor(self.tokenizer.convert_tokens_to_ids(word_pieces))

        pos_sep = word_pieces.index('[SEP]')
        input['token_type_ids'] = torch.tensor(
            [0] * (pos_sep + 1) + [1] * (len(word_pieces) - pos_sep - 1),
            dtype=torch.long
        )

        input['attention_mask'] = torch.tensor(
            [1] * len(word_pieces),
            dtype=torch.long
        )

        if label:
            label = torch.tensor(ALL_LABELS.index(label))

        return input, label

    def __len__(self):
        return len(self.df)


def create_mini_batch(samples):
    input_ids = []
    token_type_ids = []
    attention_mask = []
    labels = []
    for s in samples:
        input_ids.append(s[0]['input_ids'].squeeze(0))
        token_type_ids.append(s[0]['token_type_ids'].squeeze(0))
        attention_mask.append(s[0]['attention_mask'].squeeze(0))
        if s[1] != None: #######################################
           labels.append(s[1])
    # zero pad 到同一序列長度
    input_ids = pad_sequence(input_ids, batch_first=True)
    token_type_ids = pad_sequence(token_type_ids, batch_first=True)
    attention_mask = pad_sequence(attention_mask, batch_first=True)
 
    if len(labels):
        labels = torch.stack(labels)
        return input_ids, token_type_ids, attention_mask, labels
    else:
        return input_ids, token_type_ids, attention_mask

In [19]:
train_batch_size = 32
eval_batch_size = 512

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

dataset = NewsPairDataset(tokenizer, df_train)

train_size = int(0.8 * len(dataset))
valid_size = len(dataset) - train_size
train_dataset, valid_dataset = random_split(dataset, [train_size, valid_size])

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=train_batch_size,
    collate_fn=create_mini_batch,
    shuffle=True)
valid_loader = DataLoader(
    dataset=valid_dataset,
    batch_size=eval_batch_size,
    collate_fn=create_mini_batch)

In [20]:
len(dataset)

320545

In [21]:
train_size

256436

In [22]:
valid_size

64109

In [23]:
trexa0=train_dataset[0]
trexa0

({'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
  'input_ids': tensor([ 101,  753, 2773, 3189, 3315,  782, 6821, 1921, 3647, 2533, 3297, 1914,
          8024,  680, 1333, 2094, 2486, 3187, 1068,  102, 6382, 6237, 8038,  776,
           691, 4635, 3340, 3300, 1525,  763, 1947, 4385, 3175, 3791, 8024, 4635,
          3340, 2990, 4385, 2582,  720, 3082,  868, 3221, 1415, 2128, 1059, 8043]),
  'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])},
 tensor(1))

In [24]:
vaexa0=valid_dataset[0]
vaexa0

({'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
  'input_ids': tensor([ 101,  126,  702, 6395, 3209, 1912, 3215,  782, 2100, 1762, 4638, 6228,
          7574, 1333, 4276,  102, 3136,  872, 3297, 2571, 6862,  679, 1353, 2486,
          4638, 1121, 5503, 3175, 3791,  517,  676,  518,  671, 1453, 4607, 1061,
          3165,  679, 5709,  671, 1146, 7178, 3119, 5966, 1568]),
  'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])},
 tensor(1))

In [25]:
def train_batch(model, data, optimizer, device):
    model.train()
    input_ids, token_type_ids, attention_mask, labels = [d.to(device) for d in data]  
     
    # Code Here
    outputs = model(
        input_ids=input_ids,
        token_type_ids=token_type_ids,
        attention_mask=attention_mask,
        labels=labels
    )
    loss = outputs.loss
    # End

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

In [26]:
def evaluate(model, valid_loader):
    model.eval()
    device = 'cuda' if next(model.parameters()).is_cuda else 'cpu'

    tot_count = 0
    tot_loss = 0
    tot_correct = 0

    with torch.no_grad():
        for data in tqdm(valid_loader):
            input_ids, token_type_ids, attention_mask, labels = [d.to(device) for d in data]

            # Code Here
            outputs = model(
                input_ids=input_ids,
                token_type_ids=token_type_ids,
                attention_mask=attention_mask,
                labels=labels
            )
            
            tot_count += input_ids.size(0)
            tot_loss += outputs.loss.item()
            tot_correct += (outputs.logits.argmax(dim=-1) == labels).sum().item()
            # End
    
    evaluation = {
        'loss': tot_loss / tot_count,
        'acc': tot_correct / tot_count
    }
    return evaluation

In [27]:
# 訓練模型
max_iter = 3000
lr = 0.00001

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = BertForSequenceClassification.from_pretrained( # Code Here
    MODEL_NAME,
    num_labels=3,
    output_attentions=False,
    output_hidden_states=False,
    return_dict=True
)
model.to(device)

optimizer = optim.RMSprop(model.parameters(), lr=lr)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.1)

i = 0
is_running = True
while is_running:
    for train_data in train_loader:
        loss = train_batch(model, train_data, optimizer, device)

        if i > 0 and i % 100 == 0:
            train_size = train_data[0].size(0)
            print('train_loss: ', loss / train_size)

        if i > 0 and i % 1000 == 0:
            evaluation = evaluate(model, valid_loader)
            print('valid_evaluation: loss={loss}, acc={acc}'.format(**evaluation))
            scheduler.step()
        
        if i == max_iter:
            is_running = False
            break

        i += 1

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

train_loss:  0.012152617797255516
train_loss:  0.005161607638001442
train_loss:  0.010041726753115654
train_loss:  0.010328305885195732
train_loss:  0.006715657655149698
train_loss:  0.015198231674730778
train_loss:  0.005970424041152
train_loss:  0.005436922889202833
train_loss:  0.011263207532465458
train_loss:  0.00946464017033577


HBox(children=(FloatProgress(value=0.0, max=126.0), HTML(value='')))


valid_evaluation: loss=0.0005762199950051963, acc=0.8685831942472976
train_loss:  0.006535601802170277
train_loss:  0.012154866941273212
train_loss:  0.0058312914334237576
train_loss:  0.009797118604183197
train_loss:  0.013259915634989738
train_loss:  0.009110102429986
train_loss:  0.009444396942853928
train_loss:  0.0085610281676054
train_loss:  0.00699634337797761
train_loss:  0.009018557146191597


HBox(children=(FloatProgress(value=0.0, max=126.0), HTML(value='')))


valid_evaluation: loss=0.0005268898799639602, acc=0.8817638709073609
train_loss:  0.008791367523372173
train_loss:  0.004658510442823172
train_loss:  0.011942029930651188
train_loss:  0.008838261477649212
train_loss:  0.009457585401833057
train_loss:  0.008571647107601166
train_loss:  0.006406482309103012
train_loss:  0.010475623421370983
train_loss:  0.003468181937932968
train_loss:  0.0077444082126021385


HBox(children=(FloatProgress(value=0.0, max=126.0), HTML(value='')))


valid_evaluation: loss=0.000514975650941363, acc=0.8840568406931945


## Testing

In [28]:
# 測試
test_dataset = NewsPairDataset(tokenizer, df_test)

test_loader = DataLoader(
    dataset=test_dataset,
    batch_size=eval_batch_size,
    collate_fn=create_mini_batch)

with torch.no_grad():
    pred = []
    for data in tqdm(test_loader):
        input_ids, token_type_ids, attention_mask = [d.to(device) for d in data]

        outputs = model(
            input_ids=input_ids,
            token_type_ids=token_type_ids,
            attention_mask=attention_mask
        )
        indexes = outputs.logits.argmax(dim=-1).cpu().tolist()
        pred += [ALL_LABELS[i] for i in indexes]

df_result = df_test[['id']].copy()
df_result['pred'] = pred
df_result.to_csv('result.csv', index=None)

HBox(children=(FloatProgress(value=0.0, max=157.0), HTML(value='')))




In [29]:
tet0=test_dataset[0]
tet0

({'attention_mask': tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
  'input_ids': tensor([ 101, 5855, 2861, 6622,  782, 3698, 4255, 3476,  106, 1812, 1350, 2600,
          5320, 1920, 6848, 3313, 1346, 6848, 5815, 4636,  674, 6848, 4873, 4385,
           818, 2600, 5320, 1327, 1213, 2255, 1920,  102, 6792, 6469, 8013, 7027,
          3203, 2135, 3175, 1415, 6371, 6589, 1825, 2209, 1217, 4673, 1164, 4289,
          3855, 8024, 7410, 6887, 3221,  817, 3419, 3766, 6448, 2879, 8043]),
  'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
          1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])},
 None)

In [30]:
df_ans = pd.read_csv('solution.csv')
df_ans.rename(columns={'Id': 'id'}, inplace=True)
df = df_ans.merge(df_result, how='left')
test_acc = np.mean(df['Expected'] == df['pred'])
print(f'test accuarcy: {test_acc}')

test accuarcy: 0.8718518333624541


In [31]:
df

Unnamed: 0,id,Expected,Weight,Usage,pred
0,347448,unrelated,0.062500,Public,unrelated
1,347449,unrelated,0.062500,Private,unrelated
2,359100,unrelated,0.062500,Public,agreed
3,359101,unrelated,0.062500,Private,unrelated
4,359102,unrelated,0.062500,Private,unrelated
...,...,...,...,...,...
80121,398016,agreed,0.066667,Private,agreed
80122,398011,unrelated,0.062500,Private,unrelated
80123,384939,unrelated,0.062500,Private,agreed
80124,398013,agreed,0.066667,Private,agreed


In [33]:
df_result

Unnamed: 0,id,pred
0,321187,unrelated
1,321190,unrelated
2,321189,unrelated
3,321193,unrelated
4,321191,unrelated
...,...,...
80120,401559,unrelated
80121,401560,unrelated
80122,401562,unrelated
80123,401563,unrelated


In [32]:
df_ans

Unnamed: 0,id,Expected,Weight,Usage
0,347448,unrelated,0.062500,Public
1,347449,unrelated,0.062500,Private
2,359100,unrelated,0.062500,Public
3,359101,unrelated,0.062500,Private
4,359102,unrelated,0.062500,Private
...,...,...,...,...
80121,398016,agreed,0.066667,Private
80122,398011,unrelated,0.062500,Private
80123,384939,unrelated,0.062500,Private
80124,398013,agreed,0.066667,Private


test accuarcy: 0.8742855003369693