# Kaggle Competition Report

#### 在這次的competition中，我主要使用Roberta為架構，以訓練對文字八分類的model。一開始我有針對原始資料做在Lab1與Lab2中練習的feature extraction，像是移除stopwords以及標點符號，但是在這次的competition中，我發現這些feature extraction的方法對於model的表現並沒有太大的幫助(在kaggle上的public score約0.3)，反而是會讓model的表現下降，因此我最後選擇使用原始的資料進行訓練。原先是使用bert-base-uncased，但在我的實驗後發現roberta-base的表現最好。
#### 這次的資料集有非常嚴重的unbalanced問題，因此一開始我down sample了資料，讓所有資料的數量與最少數量的label(anger)資料數量相同。但是在實驗後發現，我自己test data的f1 score(~0.48)會比kaggle的public score(~0.42)高，因此我推斷kaggle上的test data也有unbalanced的問題，因此最後我使用全部的資料，在最後的訓練中，以其中99%作為進行訓練，0.5%為validation，0.5%為test。以下是我的各種模型在kaggle上的public score與private score。

| Model | public score | private score |
|:-----|-----|-----:|
|BERT (down sample)|0.42282|0.4066| 
|RoBERTa (down sample)|0.53448|0.51749|
|RoBERTa (all data)|**0.54686**|**0.52984**|

In [1]:
# Construct BERT model to do classification
import torch
from transformers import RobertaTokenizer, RobertaModel
from transformers import BertTokenizer, BertModel
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from transformers import AdamW, get_linear_schedule_with_warmup
import time
import datetime
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import f1_score, accuracy_score
import json
from tqdm import tqdm
import random
from collections import Counter

random.seed(1)
# check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

  from .autonotebook import tqdm as notebook_tqdm


cuda


In [2]:

def read_json_file(file_path):
    total_line = 0
    data = []
    with open(file_path, 'r') as file:
        for line in tqdm(file):
            total_line += 1
            json_line = json.loads(line)
            data.append(json_line)
    return data, total_line

file_path = '../../tweets_DM.json'
tweets_data, lines = read_json_file(file_path)

1867535it [00:33, 56367.52it/s] 


In [3]:
# Read data from json file
print((tweets_data[0]["_source"]["tweet"]))
clean_dicts = []
for i in range(lines):
    clean_dicts.append({"ids":tweets_data[i]["_source"]["tweet"]["tweet_id"], "text":tweets_data[i]["_source"]["tweet"]["text"], "Type":"train"})

{'hashtags': ['Snapchat'], 'tweet_id': '0x376b20', 'text': 'People who post "add me on #Snapchat" must be dehydrated. Cuz man.... that\'s <LH>'}


In [4]:
# read label from emotion.csv
emotion_df = pd.read_csv('../emotion.csv')
data_type = pd.read_csv('../data_identification.csv')

In [5]:
# determine the type(train or test) of each tweet
train_id_set = set(data_type[data_type['identification'] == 'train']['tweet_id'])
test_id_set = data_type[data_type['identification'] == 'test']['tweet_id']
emotion_dict = dict(zip(emotion_df['tweet_id'], emotion_df['emotion']))
for dic in tqdm(clean_dicts):
    if dic['ids'] in train_id_set:
        dic['label'] = emotion_dict[dic['ids']]
    else:
        dic['Type'] = 'test'

100%|██████████| 1867535/1867535 [00:02<00:00, 631198.37it/s]


In [6]:
# extract type == train from X
train_dicts = []
test_dicts = []
for idx, dic in enumerate(clean_dicts):
    if dic['Type'] == 'train':
        train_dicts.append(dic)
    else:
        test_dicts.append(dic)
print(len(train_dicts))
print(len(test_dicts))

1455563
411972


In [7]:
# do statistics on train_dicts

emotion_counter = Counter()
for dic in train_dicts:
    emotion_counter[dic['label']] += 1
print(emotion_counter)
# extract each emotion from train_dicts with the same number of least emotion
emotion_num = emotion_counter.most_common()[-1][1]
print(emotion_num)
train_dicts_same = []
for key in emotion_counter.keys():
    target_dict = [dic for dic in train_dicts if dic['label'] == key]
    random.shuffle(target_dict)
    train_dicts_same += target_dict[:]
    # train_dicts_same += target_dict[:emotion_num]
print(len(train_dicts_same))

Counter({'joy': 516017, 'anticipation': 248935, 'trust': 205478, 'sadness': 193437, 'disgust': 139101, 'fear': 63999, 'surprise': 48729, 'anger': 39867})
39867
1455563


In [8]:
# do one-hot encoding on train_dicts_same
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

label_encoder = LabelEncoder()
one_hot_encoder = OneHotEncoder()
X = [dic['text'] for dic in train_dicts_same]
y = [dic['label'] for dic in train_dicts_same]
label_y = label_encoder.fit_transform(y)
encode_y = one_hot_encoder.fit_transform(label_y.reshape(-1, 1)).toarray()



In [9]:
# train test split
from sklearn.model_selection import train_test_split

train_X, val_X, train_y, val_y = train_test_split(X, encode_y, test_size=0.01, random_state=1)
val_X, test_X, val_y, test_y = train_test_split(val_X, val_y, test_size=0.5, random_state=1)
# do statistics on train_y

train_y = np.argmax(train_y, axis=1)
val_y = np.argmax(val_y, axis=1)
test_y = np.argmax(test_y, axis=1)
print(Counter(train_y))
print(Counter(val_y))
print(Counter(test_y))

Counter({4: 510762, 1: 246450, 7: 203453, 5: 191447, 2: 137779, 3: 63362, 6: 48270, 0: 39484})
Counter({4: 2584, 1: 1287, 5: 1010, 7: 1008, 2: 638, 3: 336, 6: 231, 0: 184})
Counter({4: 2671, 1: 1198, 7: 1017, 5: 980, 2: 684, 3: 301, 6: 228, 0: 199})


In [10]:
# load BERT tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=True)

In [11]:
# define model class for BERT

class BertClassifier(torch.nn.Module):
    def __init__(self, freeze_bert=False):
        super(BertClassifier, self).__init__()
        # specify hidden size of BERT, hidden size of our classifier, and number of labels
        D_in, H, D_out = 768, 50, 8

        # instantiate BERT model
        self.bert = RobertaModel.from_pretrained('roberta-base')

        # instantiate an one-layer feed-forward classifier
        self.classifier = torch.nn.Sequential(
            torch.nn.Linear(D_in, H),
            torch.nn.ReLU(),
            torch.nn.Linear(H, D_out),
            torch.nn.Softmax(dim=1)
        )

        # freeze bert layers
        if freeze_bert:
            for param in self.bert.parameters():
                param.requires_grad = False

    def forward(self, input_ids, attention_mask):
        # feed input to BERT
        outputs = self.bert(input_ids=input_ids,
                            attention_mask=attention_mask)
        # extract last hidden state of the token `[CLS]` for classification task
        last_hidden_state_cls = outputs[0][:, 0, :]
        # feed input to classifier to compute logits
        logits = self.classifier(last_hidden_state_cls)
        return logits

In [12]:
model = BertClassifier(freeze_bert=False)
model.to(device)

# define optimizer and learning rate scheduler
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]
optimizer = AdamW(optimizer_grouped_parameters, lr=2e-5)
# number of training epochs
epochs = 1
# number of batches
batch_size = 32
# calculate number of training steps
num_train_steps = int(len(train_X) / batch_size * epochs)
# create the learning rate scheduler
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=0,
                                            num_training_steps=num_train_steps)
# define loss function
loss_fn = torch.nn.CrossEntropyLoss()

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
# define function to train model
def train(model, train_dataloader, val_dataloader=None, epochs=4, evaluation=False):
    # start training loop
    print("Start training...\n")
    for epoch_i in range(epochs):
        # =======================================
        #               Training
        # =======================================
        # print the header of the result table
        print(f"{'Epoch':^7} | {'Batch':^7} | {'Train Loss':^12} | {'Val Loss':^10} | {'Val Acc':^9} | {'Val F1 ':^9} | {'Elapsed':^9}")
        print("-" * 70)
        # measure the elapsed time of each epoch
        t0_epoch, t0_batch = time.time(), time.time()
        # reset tracking variables at the beginning of each epoch
        total_loss, batch_loss, batch_counts = 0, 0, 0
        # put the model into the training mode
        model.train()
        # for each batch of training data
        for step, batch in enumerate(tqdm(train_dataloader)):
            batch_counts += 1
            # load batch to GPU
            b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)
            # zero out any previously calculated gradients
            model.zero_grad()
            # perform forward pass
            logits = model(b_input_ids, b_attn_mask)
            # compute loss and accumulate the loss values
            loss = loss_fn(logits, b_labels)
            batch_loss += loss.item()
            total_loss += loss.item()
            # perform backward pass to calculate gradients
            loss.backward()
            # clip the norm of the gradients to 1.0 to prevent "exploding gradients"
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            # update parameters
            optimizer.step()
            scheduler.step()
            # print the loss values and time elapsed for every 20 batches
            if (step % 1000 == 0 and step != 0) or (step == len(train_dataloader) - 1):
                # calculate time elapsed for 20 batches
                time_elapsed = time.time() - t0_batch
                # print training results
                print(f"{epoch_i + 1:^7} | {step:^7} | {batch_loss / batch_counts:^12.6f} | {'-':^10} | {'-':^9} | {'-':^9} | {time_elapsed:^9.2f}")
                # reset batch tracking variables
                batch_loss, batch_counts = 0, 0
                t0_batch = time.time()
        # calculate the average loss over the entire training data
        avg_train_loss = total_loss / len(train_dataloader)
        print("-" * 70)
        # =======================================
        #               Evaluation
        # =======================================
        if evaluation == True:
            # after the completion of each training epoch, measure the model's performance
            # on our validation set.
            val_loss, val_accuracy, val_f1_score = evaluate(model, val_dataloader)
            # print validation results
            time_elapsed = time.time() - t0_epoch
            print(f"{epoch_i + 1:^7} | {'-':^7} | {avg_train_loss:^12.6f} | {val_loss:^10.6f} | {val_accuracy:^9.2f} | {val_f1_score:^9.2f} | {time_elapsed:^9.2f}")
            print("-" * 70)
        print("\n")
    print("Training complete!")
    
# define function for evaluation
def evaluate(model, val_dataloader):
    # put the model into the evaluation mode
    model.eval()
    # tracking variables
    val_accuracy = []
    val_loss = []
    # for each batch in our validation set
    val_f1_score = []
    for batch in val_dataloader:
        # load batch to GPU
        b_input_ids, b_attn_mask, b_labels = tuple(t.to(device) for t in batch)
        # deactivate autograd
        with torch.no_grad():
            # model predictions
            logits = model(b_input_ids, b_attn_mask)
        # compute loss and accuracy
        loss = loss_fn(logits, b_labels)
        val_loss.append(loss.item())
        # get the predictions
        preds = torch.argmax(logits, dim=1).flatten()
        # calculate the accuracy rate
        # calculate the f1 score
        f1_score_macro = f1_score(b_labels.cpu().numpy(), preds.cpu().numpy(), average='macro')
        accuracy = (preds == b_labels).cpu().numpy().mean() * 100
        val_f1_score.append(f1_score_macro)
        val_accuracy.append(accuracy)
    # compute the average accuracy and loss over the validation set
    val_loss = np.mean(val_loss)
    val_accuracy = np.mean(val_accuracy)
    val_f1_score = np.mean(val_f1_score)
    return val_loss, val_accuracy, val_f1_score

In [14]:
# convert train data to torch tensor
train_inputs = tokenizer(train_X, padding=True, truncation=True, max_length=256, return_tensors="pt")
train_labels = torch.tensor(train_y)
# convert validation data to torch tensor
val_inputs = tokenizer(val_X, padding=True, truncation=True, max_length=256, return_tensors="pt")
val_labels = torch.tensor(val_y)
# convert test data to torch tensor
test_inputs = tokenizer(test_X, padding=True, truncation=True, max_length=256, return_tensors="pt")
test_labels = torch.tensor(test_y)

# create the DataLoader for our training set
train_data = TensorDataset(train_inputs['input_ids'], train_inputs['attention_mask'], train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
# create the DataLoader for our validation set
val_data = TensorDataset(val_inputs['input_ids'], val_inputs['attention_mask'], val_labels)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data, sampler=val_sampler, batch_size=batch_size)
# create the DataLoader for our test set
test_data = TensorDataset(test_inputs['input_ids'], test_inputs['attention_mask'], test_labels)
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)
                                    

In [15]:
train(model=model, train_dataloader=train_dataloader, val_dataloader=val_dataloader, epochs=epochs, evaluation=True)

Start training...

 Epoch  |  Batch  |  Train Loss  |  Val Loss  |  Val Acc  |  Val F1   |  Elapsed 
----------------------------------------------------------------------


  2%|▏         | 1001/45032 [05:22<4:02:35,  3.02it/s]

   1    |  1000   |   1.926023   |     -      |     -     |     -     |  322.26  


  4%|▍         | 2001/45032 [10:28<3:35:17,  3.33it/s]

   1    |  2000   |   1.877168   |     -      |     -     |     -     |  305.89  


  7%|▋         | 3001/45032 [15:31<3:31:15,  3.32it/s]

   1    |  3000   |   1.852217   |     -      |     -     |     -     |  303.85  


  9%|▉         | 4001/45032 [20:35<3:29:05,  3.27it/s]

   1    |  4000   |   1.793487   |     -      |     -     |     -     |  303.85  


 11%|█         | 5001/45032 [25:39<3:23:56,  3.27it/s]

   1    |  5000   |   1.784153   |     -      |     -     |     -     |  303.63  


 13%|█▎        | 6001/45032 [30:43<3:16:14,  3.31it/s]

   1    |  6000   |   1.771376   |     -      |     -     |     -     |  304.17  


 16%|█▌        | 7001/45032 [35:46<3:14:13,  3.26it/s]

   1    |  7000   |   1.764161   |     -      |     -     |     -     |  303.24  


 18%|█▊        | 8001/45032 [40:50<3:06:53,  3.30it/s]

   1    |  8000   |   1.763137   |     -      |     -     |     -     |  303.76  


 20%|█▉        | 9001/45032 [45:53<3:04:06,  3.26it/s]

   1    |  9000   |   1.757923   |     -      |     -     |     -     |  303.33  


 22%|██▏       | 10001/45032 [50:58<2:57:06,  3.30it/s]

   1    |  10000  |   1.758411   |     -      |     -     |     -     |  304.22  


 24%|██▍       | 11001/45032 [56:02<2:53:48,  3.26it/s]

   1    |  11000  |   1.756003   |     -      |     -     |     -     |  303.97  


 27%|██▋       | 12001/45032 [1:01:06<2:47:30,  3.29it/s]

   1    |  12000  |   1.753402   |     -      |     -     |     -     |  304.10  


 29%|██▉       | 13001/45032 [1:06:09<2:42:33,  3.28it/s]

   1    |  13000  |   1.748551   |     -      |     -     |     -     |  303.62  


 31%|███       | 14001/45032 [1:11:13<2:35:38,  3.32it/s]

   1    |  14000  |   1.742421   |     -      |     -     |     -     |  303.92  


 33%|███▎      | 15001/45032 [1:16:17<2:31:13,  3.31it/s]

   1    |  15000  |   1.744478   |     -      |     -     |     -     |  303.76  


 36%|███▌      | 16001/45032 [1:21:21<2:28:33,  3.26it/s]

   1    |  16000  |   1.744048   |     -      |     -     |     -     |  304.11  


 38%|███▊      | 17001/45032 [1:26:25<2:22:31,  3.28it/s]

   1    |  17000  |   1.737032   |     -      |     -     |     -     |  304.33  


 40%|███▉      | 18001/45032 [1:31:29<2:16:43,  3.30it/s]

   1    |  18000  |   1.732790   |     -      |     -     |     -     |  303.73  


 42%|████▏     | 19001/45032 [1:36:33<2:12:51,  3.27it/s]

   1    |  19000  |   1.725424   |     -      |     -     |     -     |  304.21  


 44%|████▍     | 20001/45032 [1:41:37<2:07:34,  3.27it/s]

   1    |  20000  |   1.714170   |     -      |     -     |     -     |  303.49  


 47%|████▋     | 21001/45032 [1:46:41<2:00:49,  3.31it/s]

   1    |  21000  |   1.710770   |     -      |     -     |     -     |  303.97  


 49%|████▉     | 22001/45032 [1:51:45<1:56:09,  3.30it/s]

   1    |  22000  |   1.705896   |     -      |     -     |     -     |  303.70  


 51%|█████     | 23001/45032 [1:56:49<1:51:47,  3.28it/s]

   1    |  23000  |   1.702272   |     -      |     -     |     -     |  304.17  


 53%|█████▎    | 24001/45032 [2:01:53<1:47:00,  3.28it/s]

   1    |  24000  |   1.697825   |     -      |     -     |     -     |  303.93  


 56%|█████▌    | 25001/45032 [2:06:57<1:41:56,  3.28it/s]

   1    |  25000  |   1.694125   |     -      |     -     |     -     |  304.29  


 58%|█████▊    | 26001/45032 [2:12:01<1:38:00,  3.24it/s]

   1    |  26000  |   1.696137   |     -      |     -     |     -     |  303.89  


 60%|█████▉    | 27001/45032 [2:17:05<1:31:38,  3.28it/s]

   1    |  27000  |   1.689062   |     -      |     -     |     -     |  304.12  


 62%|██████▏   | 28001/45032 [2:22:09<1:25:59,  3.30it/s]

   1    |  28000  |   1.685796   |     -      |     -     |     -     |  303.67  


 64%|██████▍   | 29001/45032 [2:27:12<1:22:00,  3.26it/s]

   1    |  29000  |   1.691085   |     -      |     -     |     -     |  303.36  


 67%|██████▋   | 30001/45032 [2:32:15<1:16:13,  3.29it/s]

   1    |  30000  |   1.686264   |     -      |     -     |     -     |  303.12  


 69%|██████▉   | 31001/45032 [2:37:18<1:11:18,  3.28it/s]

   1    |  31000  |   1.684070   |     -      |     -     |     -     |  303.16  


 71%|███████   | 32001/45032 [2:42:22<1:05:57,  3.29it/s]

   1    |  32000  |   1.683492   |     -      |     -     |     -     |  303.37  


 73%|███████▎  | 33001/45032 [2:47:25<1:01:13,  3.28it/s]

   1    |  33000  |   1.687469   |     -      |     -     |     -     |  303.82  


 76%|███████▌  | 34001/45032 [2:52:29<54:57,  3.35it/s]  

   1    |  34000  |   1.683497   |     -      |     -     |     -     |  303.29  


 78%|███████▊  | 35001/45032 [2:57:32<50:47,  3.29it/s]

   1    |  35000  |   1.683292   |     -      |     -     |     -     |  302.90  


 80%|███████▉  | 36001/45032 [3:02:35<45:55,  3.28it/s]

   1    |  36000  |   1.678795   |     -      |     -     |     -     |  303.29  


 82%|████████▏ | 37001/45032 [3:07:39<40:36,  3.30it/s]

   1    |  37000  |   1.680640   |     -      |     -     |     -     |  303.56  


 84%|████████▍ | 38001/45032 [3:12:42<35:15,  3.32it/s]

   1    |  38000  |   1.673865   |     -      |     -     |     -     |  303.59  


 87%|████████▋ | 39001/45032 [3:17:46<30:38,  3.28it/s]

   1    |  39000  |   1.675897   |     -      |     -     |     -     |  304.08  


 89%|████████▉ | 40001/45032 [3:22:50<25:22,  3.31it/s]

   1    |  40000  |   1.674445   |     -      |     -     |     -     |  303.55  


 91%|█████████ | 41001/45032 [3:27:54<20:33,  3.27it/s]

   1    |  41000  |   1.671047   |     -      |     -     |     -     |  303.91  


 93%|█████████▎| 42001/45032 [3:32:57<15:23,  3.28it/s]

   1    |  42000  |   1.670482   |     -      |     -     |     -     |  303.26  


 95%|█████████▌| 43001/45032 [3:38:00<10:20,  3.27it/s]

   1    |  43000  |   1.671384   |     -      |     -     |     -     |  303.51  


 98%|█████████▊| 44001/45032 [3:43:05<05:13,  3.29it/s]

   1    |  44000  |   1.670280   |     -      |     -     |     -     |  304.28  


100%|█████████▉| 45001/45032 [3:48:08<00:09,  3.34it/s]

   1    |  45000  |   1.673788   |     -      |     -     |     -     |  303.73  


100%|██████████| 45032/45032 [3:48:18<00:00,  3.29it/s]

   1    |  45031  |   1.685205   |     -      |     -     |     -     |   9.51   
----------------------------------------------------------------------





   1    |    -    |   1.723712   |  1.658355  |   61.34   |   0.45    | 13712.42 
----------------------------------------------------------------------


Training complete!


In [16]:
# compute the accuracy and f1 score on the test set
test_loss, test_accuracy, test_f1_score = evaluate(model, test_dataloader)
# print the accuracy and loss on the test set
print(f"Test Accuracy: {test_accuracy}")
print(f"Test F1 Score: {test_f1_score}")


Test Accuracy: 61.39176065162907
Test F1 Score: 0.4463059730763


In [17]:
# save model
torch.save(model.state_dict(), '../../BERT_model_v3.bin')

In [18]:
# load model
model = BertClassifier(freeze_bert=False)
model.load_state_dict(torch.load('../../BERT_model_v3.bin'))
model.to(device)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BertClassifier(
  (bert): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNo

In [19]:
# do predict on test_dicts, which has no label
test_inputs = tokenizer([dic['text'] for dic in test_dicts], padding=True, truncation=True, max_length=256, return_tensors="pt")

# create the DataLoader for our test set
test_data = TensorDataset(test_inputs['input_ids'], test_inputs['attention_mask'])
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

In [20]:


# put the model into the evaluation mode
model.eval()
# tracking variables
predictions = []
# predict
for batch in tqdm(test_dataloader):
    # load batch to GPU
    b_input_ids, b_attn_mask = tuple(t.to(device) for t in batch)
    # deactivate autograd
    with torch.no_grad():
        # model predictions
        logits = model(b_input_ids, b_attn_mask)
    # get the predictions
    preds = torch.argmax(logits, dim=1).flatten()
    # put the predicted labels to a list
    predictions += preds.cpu().numpy().tolist()
# get the prediction result
predictions = label_encoder.inverse_transform(predictions)
# print the result into a csv file
output_df = pd.DataFrame({'id': [dic['ids'] for dic in test_dicts], 'emotion': predictions})



100%|██████████| 12875/12875 [19:25<00:00, 11.05it/s]


In [21]:
pd.DataFrame.to_csv(output_df, 'submission_v3.csv', index=False)

## Discussion

#### 由於資料量真的很大，tokenize data就要十分鐘左右，訓練一個epoch就要花上三個半小時，訓練好之後讓model reference到要交到kaggle上的test data也要20分鐘。從loss的下降來看，訓練很多epoch並沒有甚麼幫助(也浪費時間跟計算資源)，因此我分數最高的版本也只有訓練兩個epoch而已(有load checkpoint然後再訓練一次，因此notebook中的epoch設為1)。
#### 在這次competition中，比較令我意外的是feature extraction對RoBERTa以及BERT model並沒有幫助，反而讓模型的效果下降不少，我認為原因是這些feature extraction的方式影響到了原本的語意，因此不適合用在BERT-base model上再做embedding，反而是這樣的pretrained model可以自己學習到語意以及這些資訊，因此不適合、不需要額外再做之前Lab中練習的feature extraction。但這不代表feature extraction沒有意義，但這裡我並沒有找到針對BERT-base model適合的feature extraction方式。