# HW9 - Fine-tuning BERT!

> 任務: 以BERT為基礎，fine-tune 出一個可以判斷文章「是否為廣告」的模型（label "1" 為廣告、"0"為非廣告）

> 計分方式：
1. Preprocessing (20%)
2. Training and validating (30%)
3. Predicting (20%)
4. **有做兩組以上的嘗試** （10%)（需在Markdown寫出你嘗試的參數組合及結果）
5. 取Accuracy最高的結果：(Acc - 0.8)x100
   例如你最好的model正確率有0.88，那你就會再加上(0.88-0.8)x100=8分！

給定train_df和test_df兩份資料，其中train_df為4000筆未經前處理的資料，test_df為1000筆清理乾淨的資料。

# Step 1: Preprocessing steps with BERT 

💪 pretrained model請使用"bert-base-chinese" & BertForSequenceClassification (設定如下所示）

💪 
前處理步驟

1. 加入special tokens:
  - [CLS]: 每個句子的開頭 (ID 101)
  - [SEP]: 每個句子的結尾 (ID 102)
2. 每個句子長度相等:
  - 設定maximum sequence length，最多可到 512 tokens
  - Padding([PAD]) ：不足max length的句子以 ID 0補滿
  - Truncated: 太長的句子就切到max length
3. Attention mask:
  - List of 0/1 indicating whether the model should consider the tokens or not when learning their contextual representation. (special tokens也是1，只有[PAD] tokens為 0）

使用``tokenizer.encode_plus``這個function，得到結果會包含:

- input_ids: list of token IDs.
- attention_mask: list of 0/1 indicating which tokens should be considered by the model (return_attention_mask = True).


In [None]:
!pip install transformers

In [None]:
import pickle
import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import BertTokenizer, BertForSequenceClassification
from sklearn.model_selection import train_test_split

import pandas as pd
import numpy as np

from tabulate import tabulate
from tqdm import trange
import random

In [None]:
#train_df

Unnamed: 0,text,label
0,V領設計能夠修飾臉型😍\n減齡泡泡袖洋裝😉\n👉https://lihi1.com/JDym...,1
1,【20210303】\n能勇敢追夢的人\n身上都閃著和煦的光芒\n也是好生羨慕！\n-\n敬...,0
2,玉 耳環\n#耳環 #玉 #earrings,1
3,【美國瘋潮WWE Taiwan】\n不管是WWE Elite還是AEW Unrivaled系...,1
4,🌈\n尋晚的post 一po已經被秒殺好多件的Vintage sports windbrea...,1


In [None]:
#test_df

Unnamed: 0,text,label
4000,是黑暗的陽光少女吧伊達邵碼頭,0
4001,本週新品真的是馬不停蹄的一週春裝新品越來越多寶寶客人們都買不停身為女孩妳能有不漂亮的權利嗎讓...,1
4002,霧眉住家工作室預約台中霧眉,1
4003,中長版挺料西裝外套兩側口袋與微腰身設計真的很美奶杏色天空藍共色草帽女孩圖簡約圖也是時髦穿搭的...,1
4004,男生染髮焦糖棕色質感低調的焦糖色讓男生頭髮也能有不一樣的選擇喜歡記得按讚追蹤並儲存焦糖棕色霧...,1


In [None]:
text = train_df.text.values
labels = train_df.label.values

In [None]:
text[455:460]

array(['是那天在機場護送森尼的人原來是新經紀人啊我還以為是保鑣',
       '創業起步並不會導致失敗沒有堅持才會當你堅持下去就會發現自己的能力往往比想像中強大車語錄',
       '從白天聊到黑夜謝謝我的朋友們也成為了彼此的朋友',
       '渣男燙感情上當渣男當然不可以但燙個帥氣渣男燙變得人見人愛是很可以的設計師預約專線國父紀念館號出口左斜前方金色大門營業時間週二週六週日每週一公休東區髮廊東區染髮東區挑染東區髮廊推薦大安區髮廊藝人造型師藝人御用專業美髮挑染台北髮廊京喚羽護髮京喚羽系統修護台北剪髮台北染髮台北燙髮燙髮台北髮廊推薦',
       '新品上架啦天氣好熱快手刀把新品買下來喜歡請私訊留貨留貨代表購買喔衣服尺寸價格細節請私訊提供郵寄服務門市營業時間週一週日地址新北市樹林區大義路號一樓官方我們的'],
      dtype=object)

In [None]:
tokenizer = BertTokenizer.from_pretrained(
    "bert-base-chinese",
    do_lower_case = True
    )

In [None]:
def print_rand_sentence():
  '''Displays the tokens and respective IDs of a random text sample'''
  index = random.randint(0, len(text)-1)
  table = np.array([tokenizer.tokenize(text[index]), 
                    tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text[index]))]).T
  print(tabulate(table,
                 headers = ['Tokens', 'Token IDs'],
                 tablefmt = 'fancy_grid'))

print_rand_sentence()

╒══════════╤═════════════╕
│ Tokens   │   Token IDs │
╞══════════╪═════════════╡
│ 跟       │        6656 │
├──────────┼─────────────┤
│ 你       │         872 │
├──────────┼─────────────┤
│ 在       │        1762 │
├──────────┼─────────────┤
│ 一       │         671 │
├──────────┼─────────────┤
│ 起       │        6629 │
├──────────┼─────────────┤
│ 什       │         784 │
├──────────┼─────────────┤
│ 麼       │        7938 │
├──────────┼─────────────┤
│ 都       │        6963 │
├──────────┼─────────────┤
│ 是       │        3221 │
├──────────┼─────────────┤
│ 最       │        3297 │
├──────────┼─────────────┤
│ 好       │        1962 │
├──────────┼─────────────┤
│ 兩       │        1060 │
├──────────┼─────────────┤
│ 週       │        6867 │
├──────────┼─────────────┤
│ 年       │        2399 │
├──────────┼─────────────┤
│ 快       │        2571 │
├──────────┼─────────────┤
│ 樂       │        3556 │
╘══════════╧═════════════╛


In [None]:
token_id = []
attention_masks = []

def preprocessing(input_text, tokenizer):
  '''
  Returns <class transformers.tokenization_utils_base.BatchEncoding> with the following fields:
    - input_ids: list of token ids
    - token_type_ids: list of token type ids
    - attention_mask: list of indices (0,1) specifying which tokens should considered by the model (return_attention_mask = True).
  '''
  return tokenizer.encode_plus(
                        input_text,
                        add_special_tokens = True,
                        max_length = 32,
                        pad_to_max_length = True,
                        return_attention_mask = True,
                        return_tensors = 'pt'
                   )


for sample in text:
  encoding_dict = preprocessing(sample, tokenizer)
  token_id.append(encoding_dict['input_ids']) 
  attention_masks.append(encoding_dict['attention_mask'])


token_id = torch.cat(token_id, dim = 0)
attention_masks = torch.cat(attention_masks, dim = 0)
labels = torch.tensor(labels)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
token_id

tensor([[ 101, 7526, 6257,  ...,    0,    0,    0],
        [ 101, 5543, 1235,  ..., 4692, 4638,  102],
        [ 101, 4373, 5455,  ...,    0,    0,    0],
        ...,
        [ 101, 3362, 4197,  ..., 2523, 1599,  102],
        [ 101, 6258, 2157,  ...,    0,    0,    0],
        [ 101, 3615, 4638,  ..., 7415, 2130,  102]])

In [None]:
attention_masks

tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]])

In [None]:
labels

tensor([1, 0, 1,  ..., 0, 0, 0])

In [None]:
def print_rand_sentence_encoding():
  '''Displays tokens, token IDs and attention mask of a random text sample'''
  index = random.randint(0, len(text) - 1)
  tokens = tokenizer.tokenize(tokenizer.decode(token_id[index]))
  token_ids = [i.numpy() for i in token_id[index]]
  attention = [i.numpy() for i in attention_masks[index]]

  table = np.array([tokens, token_ids, attention]).T
  print(tabulate(table, 
                 headers = ['Tokens', 'Token IDs', 'Attention Mask'],
                 tablefmt = 'fancy_grid'))

print_rand_sentence_encoding()

╒══════════╤═════════════╤══════════════════╕
│ Tokens   │   Token IDs │   Attention Mask │
╞══════════╪═════════════╪══════════════════╡
│ [CLS]    │         101 │                1 │
├──────────┼─────────────┼──────────────────┤
│ 那       │        6929 │                1 │
├──────────┼─────────────┼──────────────────┤
│ 天       │        1921 │                1 │
├──────────┼─────────────┼──────────────────┤
│ 喇       │        1589 │                1 │
├──────────┼─────────────┼──────────────────┤
│ 叭       │        1375 │                1 │
├──────────┼─────────────┼──────────────────┤
│ 開       │        7274 │                1 │
├──────────┼─────────────┼──────────────────┤
│ 到       │        1168 │                1 │
├──────────┼─────────────┼──────────────────┤
│ 最       │        3297 │                1 │
├──────────┼─────────────┼──────────────────┤
│ 大       │        1920 │                1 │
├──────────┼─────────────┼──────────────────┤
│ 音       │        7509 │                1

# Step 2: Training and validation

💪 可調整參數(可探索各種組合)
- Max length (在前處理的tokenizer.encode_plus那一步就會設定了）
- Batch size
- Learning rate (Adam)

- Number of epochs
- validation ratio (training跟validating的資料比例）


In [None]:
'''
- Batch size: 16, 32
- Learning rate (Adam): 5e-5, 3e-5, 2e-5
- Number of epochs: 2, 3, 4

shuffle: 對原始數據進行隨機抽樣，保證隨機性。
stratify: 想要達到分層隨機抽樣的目的。特別是在原始數據中樣本標籤分佈不均衡時非常有用，一些分類問題可能會在目標類的分佈中表現出很大的不平衡：例如，負樣本可能比正樣本多幾倍。在這種情況下，建議使用分層抽樣

'''
val_ratio = 0.2
batch_size = 16 

# Indices of the train and validation splits stratified by labels
train_idx, val_idx = train_test_split(
    np.arange(len(labels)),
    test_size = val_ratio,
    shuffle = True,
    stratify = labels)

# Train and validation sets
train_set = TensorDataset(token_id[train_idx], 
                          attention_masks[train_idx], 
                          labels[train_idx])

val_set = TensorDataset(token_id[val_idx], 
                        attention_masks[val_idx], 
                        labels[val_idx])

# Prepare DataLoader
train_dataloader = DataLoader(
            train_set,
            sampler = RandomSampler(train_set),
            batch_size = batch_size
        )

validation_dataloader = DataLoader(
            val_set,
            sampler = SequentialSampler(val_set),
            batch_size = batch_size
        )

In [None]:
# Load the BertForSequenceClassification model
model = BertForSequenceClassification.from_pretrained(
    'bert-base-chinese',
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False,
)

# Recommended learning rates (Adam): 5e-5, 3e-5, 2e-5. See: https://arxiv.org/pdf/1810.04805.pdf
optimizer = torch.optim.AdamW(model.parameters(), 
                              lr = 5e-5,
                              eps = 1e-08
                              )

# Run on GPU
model.cuda()

Some weights of the model checkpoint at bert-base-chinese were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(21128, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

In [None]:
def b_tp(preds, labels):
  '''Returns True Positives (TP): count of correct predictions of actual class 1'''
  return sum([preds == labels and preds == 1 for preds, labels in zip(preds, labels)])

def b_fp(preds, labels):
  '''Returns False Positives (FP): count of wrong predictions of actual class 1'''
  return sum([preds != labels and preds == 1 for preds, labels in zip(preds, labels)])

def b_tn(preds, labels):
  '''Returns True Negatives (TN): count of correct predictions of actual class 0'''
  return sum([preds == labels and preds == 0 for preds, labels in zip(preds, labels)])

def b_fn(preds, labels):
  '''Returns False Negatives (FN): count of wrong predictions of actual class 0'''
  return sum([preds != labels and preds == 0 for preds, labels in zip(preds, labels)])

def b_metrics(preds, labels):
  '''
  Returns the following metrics:
    - accuracy    = (TP + TN) / N
    - precision   = TP / (TP + FP)
    - recall      = TP / (TP + FN)
    - specificity = TN / (TN + FP)
  '''
  preds = np.argmax(preds, axis = 1).flatten()
  labels = labels.flatten()
  tp = b_tp(preds, labels)
  tn = b_tn(preds, labels)
  fp = b_fp(preds, labels)
  fn = b_fn(preds, labels)
  b_accuracy = (tp + tn) / len(labels)
  b_precision = tp / (tp + fp) if (tp + fp) > 0 else 'nan'
  b_recall = tp / (tp + fn) if (tp + fn) > 0 else 'nan'
  b_specificity = tn / (tn + fp) if (tn + fp) > 0 else 'nan'
  return b_accuracy, b_precision, b_recall, b_specificity

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Recommended number of epochs: 2, 3, 4. See: https://arxiv.org/pdf/1810.04805.pdf
epochs = 3

for _ in trange(epochs, desc = 'Epoch'):
    
    # ========== Training ==========
    
    # Set model to training mode
    model.train()
    
    # Tracking variables
    tr_loss = 0
    nb_tr_examples, nb_tr_steps = 0, 0

    for step, batch in enumerate(train_dataloader):
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        optimizer.zero_grad()
        # Forward pass
        train_output = model(b_input_ids, 
                             token_type_ids = None, 
                             attention_mask = b_input_mask, 
                             labels = b_labels)
        # Backward pass
        train_output.loss.backward()
        optimizer.step()
        # Update tracking variables
        tr_loss += train_output.loss.item()
        nb_tr_examples += b_input_ids.size(0)
        nb_tr_steps += 1

    # ========== Validation ==========

    # Set model to evaluation mode
    model.eval()

    # Tracking variables 
    val_accuracy = []
    val_precision = []
    val_recall = []
    val_specificity = []

    for batch in validation_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        with torch.no_grad():
          # Forward pass
          eval_output = model(b_input_ids, 
                              token_type_ids = None, 
                              attention_mask = b_input_mask)
        logits = eval_output.logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()
        # Calculate validation metrics
        b_accuracy, b_precision, b_recall, b_specificity = b_metrics(logits, label_ids)
        val_accuracy.append(b_accuracy)
        # Update precision only when (tp + fp) !=0; ignore nan
        if b_precision != 'nan': val_precision.append(b_precision)
        # Update recall only when (tp + fn) !=0; ignore nan
        if b_recall != 'nan': val_recall.append(b_recall)
        # Update specificity only when (tn + fp) !=0; ignore nan
        if b_specificity != 'nan': val_specificity.append(b_specificity)

    print('\n\t - Train loss: {:.4f}'.format(tr_loss / nb_tr_steps))
    print('\t - Validation Accuracy: {:.4f}'.format(sum(val_accuracy)/len(val_accuracy)))
    print('\t - Validation Precision: {:.4f}'.format(sum(val_precision)/len(val_precision)) if len(val_precision)>0 else '\t - Validation Precision: NaN')
    print('\t - Validation Recall: {:.4f}'.format(sum(val_recall)/len(val_recall)) if len(val_recall)>0 else '\t - Validation Recall: NaN')
    print('\t - Validation Specificity: {:.4f}\n'.format(sum(val_specificity)/len(val_specificity)) if len(val_specificity)>0 else '\t - Validation Specificity: NaN')


Epoch:  33%|███▎      | 1/3 [00:25<00:50, 25.23s/it]


	 - Train loss: 0.4099
	 - Validation Accuracy: 0.8688
	 - Validation Precision: 0.9139
	 - Validation Recall: 0.8013
	 - Validation Specificity: 0.9321



Epoch:  67%|██████▋   | 2/3 [00:50<00:25, 25.02s/it]


	 - Train loss: 0.2723
	 - Validation Accuracy: 0.8213
	 - Validation Precision: 0.7653
	 - Validation Recall: 0.8994
	 - Validation Specificity: 0.7407



Epoch: 100%|██████████| 3/3 [01:15<00:00, 25.03s/it]


	 - Train loss: 0.1778
	 - Validation Accuracy: 0.8588
	 - Validation Precision: 0.8458
	 - Validation Recall: 0.8580
	 - Validation Specificity: 0.8528






In [None]:
#with open('/content/gdrive/MyDrive/NTU GIL/CLLT ta/HW9_fine-tuned BERT/model.pkl', 'wb')as f:
 #   pickle.dump(model, f)

# Predicting

⭐ 記得使用test_df!

In [None]:
# Testing data prediction

test_txt = test_df.text.values
test_lbl = test_df.label.values

# We need Token IDs and Attention Mask for inference on the new sentence
test_ids = []
test_attention_mask = []

# Apply the tokenizer
for sample in test_txt:
  encoding_dict = preprocessing(sample, tokenizer)
  test_ids.append(encoding_dict['input_ids']) 
  test_attention_mask.append(encoding_dict['attention_mask'])


test_ids = torch.cat(test_ids, dim = 0)
test_attention_mask = torch.cat(test_attention_mask, dim = 0)
test_lbl = torch.tensor(test_lbl)

print(test_ids[0])
print(test_attention_mask[0])
print(test_lbl[0])



tensor([ 101, 3221, 7946, 3266, 4638, 7382, 1045, 2208, 1957, 1416,  823, 6888,
        6939, 4826, 7531,  102,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0])
tensor(0)


In [None]:
test_dataset = TensorDataset(test_ids, test_attention_mask)
test_dataloader = DataLoader(
            test_dataset, # The validation samples.
            sampler = SequentialSampler(test_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [None]:
predictions = []

for batch in test_dataloader:
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        with torch.no_grad():        
            output= model(b_input_ids, 
                          token_type_ids=None, 
                          attention_mask=b_input_mask)
            logits = output.logits
            logits = logits.detach().cpu().numpy()

            pred_flat = np.argmax(logits, axis=1).flatten()
            predictions.extend(list(pred_flat))

test_df['prediction'] = predictions
test_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_df['prediction'] = predictions


Unnamed: 0,text,label,prediction
4000,是黑暗的陽光少女吧伊達邵碼頭,0,0
4001,本週新品真的是馬不停蹄的一週春裝新品越來越多寶寶客人們都買不停身為女孩妳能有不漂亮的權利嗎讓...,1,1
4002,霧眉住家工作室預約台中霧眉,1,1
4003,中長版挺料西裝外套兩側口袋與微腰身設計真的很美奶杏色天空藍共色草帽女孩圖簡約圖也是時髦穿搭的...,1,1
4004,男生染髮焦糖棕色質感低調的焦糖色讓男生頭髮也能有不一樣的選擇喜歡記得按讚追蹤並儲存焦糖棕色霧...,1,1


In [None]:
# Calculate validation metrics
preds = test_df['prediction']
labels = test_df['label']

b_tp = sum([preds == labels and preds == 1 for preds, labels in zip(preds, labels)])
b_fp = sum([preds != labels and preds == 1 for preds, labels in zip(preds, labels)]) #false positive
b_tn = sum([preds == labels and preds == 0 for preds, labels in zip(preds, labels)]) #true negative
b_fn =  sum([preds != labels and preds == 0 for preds, labels in zip(preds, labels)])

'''
  Returns the following metrics:
    - accuracy    = (TP + TN) / N
    - precision   = TP / (TP + FP)
    - recall      = TP / (TP + FN)
    - specificity = TN / (TN + FP)
'''

test_accuracy = (b_tp + b_tn) / len(labels)
test_precision = b_tp / (b_tp + b_fp) if (b_tp + b_fp) > 0 else 'nan'
test_recall = b_tp / (b_tp + b_fn) if (b_tp + b_fn) > 0 else 'nan'
test_specificity = b_tn / (b_tn + b_fp) if (b_tn + b_fp) > 0 else 'nan'

print(f'test_accuracy: {test_accuracy}')
print(f'test_precision: {test_precision}')
print(f'test_recall: {test_recall}')
print(f'test_specificity: {test_specificity}')


test_accuracy: 0.863
test_precision: 0.8532289628180039
test_recall: 0.8755020080321285
test_specificity: 0.850597609561753
