# 專題（二）：建置 Bert 新聞觀點分類器之資料集

## 專案目標
- 目標：請試著建製 BertForSequenceClassification 看得懂的兩個句子分類問題的資料集  NewsPairDataset
- 資料集 in archive.zip：
    - 包含：train.csv、test.csv、solution.csv
    - 資料來源：https://www.kaggle.com/wsdmcup/wsdm-fake-news-classification
    - 資料中包含兩個新聞標題 title1_zh 和 title2_zh，並且給予這兩篇新聞的相關性，分別可能是：agreed, unrelated, disagreed

## 實作提示
- STEP1：解壓縮 archive.zip，並且讀取 train.csv 和 test.csv 檔案
- STEP2：繼承 torch.utils.data.Dataset 並實作 NewsPairDataset，其中需要用到 bert tokenizer (請參考官方對BertForSequenceClassification的說明)，特別注意兩個句子間必須要有分隔符號 SEP
- STEP3：因為每一個從 NewsPairDataset 來的樣本長度都不一樣，所以需要實作 collate_fn，來zero padding 到同一序列長度
- STEP4：使用 torch.utils.data.DataLoader 來創造 train_loader 和 valid_loader

## 重要知識點：專題結束後你可以學會
- 如何讀取並處理 NLP 資料，產生可以適用 BertForSequenceClassification 兩個句子分類問題的資料集
- 了解 BERT 的 2-Sequence Classification 任務如何進行

In [1]:
# 連接個人資料 讀取 ＰＴＴ 訓練資料和儲存模型
#先連接自己的GOOGLE DRIVE 為了要儲存資料和訓練模型
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os

# Current directory
print(os.getcwd())

# change directory
os.chdir('/content/drive/MyDrive/python_training/NLP100Days-part2/project_1_4/')
print(os.getcwd())

/content
/content/drive/MyDrive/python_training/NLP100Days-part2/project_1_4


In [3]:
## from: https://www.kaggle.com/wsdmcup/wsdm-fake-news-classification
# !unzip archive.zip

In [4]:
!python --version

Python 3.7.11


In [5]:
!pip install torch
!pip install transformers
#!pip install -q transformers
# 設定 torchtext 版本 安裝完必須重新啟動執行階段
!pip install torchtext==0.6.0



In [6]:
import pandas as pd
import numpy as np

import torch
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import random_split
from tqdm.notebook import tqdm

from transformers import BertTokenizer, BertForSequenceClassification

In [7]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

In [8]:
df_train.sample(3)

Unnamed: 0,id,tid1,tid2,title1_zh,title2_zh,title1_en,title2_en,label
55512,55626,48509,48514,中国怒了！喊话美国：敢挑战中国底线，军舰入台，就是武统之日,蔡英文发推谢美国后，大陆再划武统台湾倒计时,China is angry! Shouts US: dare to challenge C...,Tsai Ing-wen tweet her thanks to the United St...,unrelated
66959,67118,56562,56566,什么样的体质更适合冬病夏治？专家教你一招辨别自己是否适合！,一招教你如何辨别刚买的手机是否是全新机，非常实用的手机功能,What kind of physique is more suitable for win...,One way to tell if the phone you just bought i...,unrelated
261204,261781,149851,84843,王思聪放弃网红追腾讯千金，两人机场牵手相约吃饭？网友：我反对,马化腾辟谣王思聪追求其女儿，网友戏侃：瞧不上王思聪？,Wang Sicong gives up the net to chase Tencent'...,Ma Huateng discours the rumor that the rumour ...,disagreed


In [9]:
df_test.sample(3)

Unnamed: 0,id,tid1,tid2,title1_zh,title2_zh,title1_en,title2_en
48776,370140,15679,15682,阿里CEO助理：网传“阿里内部培训”谣言与我司毫无关系,阿里辟谣了：网传“阿里内部培训”谣言与我公司毫无关系,Ali CEO's Assistant: The Rumor of Ali's Intern...,"Ali says: ""Ali in-training"" rumors have nothin..."
66687,378105,186161,186159,韩红希望自己能活久一点，但始终拒绝做体检，背后的原因让人心酸,韩红希望自己能活久一点，但始终拒绝做体检，到底发生了什么事,"Han Hongxiong hoped to live a little longer, b...",Han Hongxiong hoped she could live a little lo...
19852,341189,17020,17062,赵本山病重，徒弟也接连出走！妻子态度让人心寒，网友：缺德了,60岁赵本山瘦骨嶙峋卧病在床，妻子前妻大打出手，只为这些遗产,"Zhao Benshan is very ill, and his disciples ar...","60-year-old Zhao Benshan was in bed, his wife ..."


In [10]:
df_train = df_train[['title1_zh', 'title2_zh', 'label']].dropna(axis=0).reset_index(drop=True)
df_test = df_test[['id', 'title1_zh', 'title2_zh']].dropna(axis=0).reset_index(drop=True)

In [11]:
df_train.sample(3)

Unnamed: 0,title1_zh,title2_zh,label
186706,总是失眠睡不着？多吃点这4种食物，一觉睡到自然醒！,失眠怎么办？别担心，睡前用六个方法轻松入眠到天亮,unrelated
118892,吃这几种水果，比吃肉还容易胖！减肥的女人必看！,1克干燕窝相当于40个蛋？“国燕委”辟谣你怎么看？,unrelated
52593,严查棋牌室，60岁以下打麻将的拘留，央视都播了？,辟谣｜景德镇市新厂有人打麻将猝死了？还惊动了民警！,unrelated


In [12]:
df_test.sample(3)

Unnamed: 0,id,title1_zh,title2_zh
75398,396833,鸡蛋和它一起煮会中毒，甚至致癌，很多人都已经中招了！,鸡蛋和它一起吃，不但容易造成便秘，还会致癌，你还在吃吗？
51556,382931,马云再一次提起2018年赚钱离不开这几个项目。,马云揭秘2018年最赚钱 最吃香的行业 会带动一大批普通人致富 2
22608,343949,身体这几个地方，反映出女人的生育能力，学习了！,"女性朋友生育能力强不强,就看这4个地方!看过你会觉得很准哦!"


In [13]:
ALL_LABELS = ['agreed', 'unrelated', 'disagreed']

In [14]:
MODEL_NAME = 'bert-base-chinese'

In [15]:
# 建置數據集
class NewsPairDataset(Dataset):
    def __init__(self, tokenizer, df, max_len=512):
        self.tokenizer = tokenizer
        self.df = df
        self.max_len = max_len

    def __getitem__(self, idx):
        text1 = self.df.loc[idx, 'title1_zh']
        text2 = self.df.loc[idx, 'title2_zh']
        label = self.df.loc[idx, 'label'] if 'label' in self.df.columns else None

        # Code Here
        text1_tokens = self.tokenizer.tokenize(text1)
        text2_tokens = self.tokenizer.tokenize(text2)
        len_all_tokens = len(text1_tokens) + len(text2_tokens) + 2
        if len_all_tokens > self.max_len:
            limit_num = (self.max_len - 2) // 2
            text1_tokens = text1_tokens[:limit_num]
            text2_tokens = text2_tokens[:limit_num]

        input = {}
        word_pieces = ['[CLS]'] + text1_tokens + ['[SEP]'] + text2_tokens
        input['input_ids'] = torch.LongTensor(self.tokenizer.convert_tokens_to_ids(word_pieces))
        pos_sep = word_pieces.index('[SEP]')
        input['token_type_ids'] = torch.LongTensor(
            [0] * (pos_sep + 1) + [1] * (len(word_pieces) - pos_sep - 1)
        )
        input['attention_mask'] = torch.LongTensor([1] * len(word_pieces))

        # End

        if label:
            label = torch.tensor(ALL_LABELS.index(label))

        return input, label

    def __len__(self):
        return len(self.df)


def create_mini_batch(samples):
    input_ids = []
    token_type_ids = []
    attention_mask = []
    labels = []
    for s in samples:
        input_ids.append(s[0]['input_ids'].squeeze(0))
        token_type_ids.append(s[0]['token_type_ids'].squeeze(0))
        attention_mask.append(s[0]['attention_mask'].squeeze(0))
        if s[1]:
            labels.append(s[1])

    # zero pad 到同一序列長度
    # Code Here
    input_ids = pad_sequence(input_ids, batch_first=True, padding_value=0)
    token_type_ids = pad_sequence(token_type_ids, batch_first=True, padding_value=0)
    attention_mask = pad_sequence(attention_mask, batch_first=True, padding_value=0)

    # End
 
    #if labels:
    if len(labels):
        labels = torch.stack(labels)
        return input_ids, token_type_ids, attention_mask, labels
    else:
        return input_ids, token_type_ids, attention_mask

In [16]:
train_batch_size = 32
eval_batch_size = 512

tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)

dataset = NewsPairDataset(tokenizer, df_train)

train_size = int(0.8 * len(dataset))
valid_size = len(dataset) - train_size
train_dataset, valid_dataset = random_split(dataset, [train_size, valid_size])

train_loader = DataLoader(
    dataset=train_dataset,
    batch_size=train_batch_size,
    collate_fn=create_mini_batch,
    shuffle=True)
valid_loader = DataLoader(
    dataset=valid_dataset,
    batch_size=eval_batch_size,
    collate_fn=create_mini_batch)