### **Part 3 - Report**
---

#### Table of Contents

1. Preprocessing
    - Training Data
    - Testing Data
2. Model
    - Preparing
    - Spliting Data
    - Loading Tokenizer and Encoding Data
    - Setting Up Pre-Trained Model
    - Creating Data Loaders
    - Setting Up Optimiser and Scheduler
    - Defining Performance Metrics
    - Creating Training Loop
3. Prediction
4. Different Things I Tried

---

#### **1. Preprocessing** 
---

##### Training Data

In [None]:
import json
import pandas as pd

In [None]:
file = open("dm2022-isa5810-lab2-homework/tweets_DM.json", 'r', encoding='utf-8')
tweets_id = []
tweets_hashtags = []
tweets_text = []
for line in file.readlines():
    dic = json.loads(line)
    tweets_id.append(dic["_source"]["tweet"]["tweet_id"])
    tweets_hashtags.append(dic["_source"]["tweet"]["hashtags"])
    tweets_text.append(dic["_source"]["tweet"]["text"])

src = pd.DataFrame([], columns=[]) 

src = src.assign(id = tweets_id, hashtags = tweets_hashtags, text = tweets_text)

src

Unnamed: 0,id,hashtags,text
0,0x376b20,[Snapchat],"People who post ""add me on #Snapchat"" must be ..."
1,0x2d5350,"[freepress, TrumpLegacy, CNN]","@brianklaas As we see, Trump is dangerous to #..."
2,0x28b412,[bibleverse],"Confident of your obedience, I write to you, k..."
3,0x1cd5b0,[],Now ISSA is stalking Tasha 😂😂😂 <LH>
4,0x2de201,[],"""Trust is not the same as faith. A friend is s..."
...,...,...,...
1867530,0x316b80,"[mixedfeeling, butimTHATperson]",When you buy the last 2 tickets remaining for ...
1867531,0x29d0cb,[],I swear all this hard work gone pay off one da...
1867532,0x2a6a4f,[],@Parcel2Go no card left when I wasn't in so I ...
1867533,0x24faed,[],"Ah, corporate life, where you can date <LH> us..."


In [None]:
colnames=['id', 'emotion'] 

label = pd.read_csv('dm2022-isa5810-lab2-homework/emotion.csv', names=colnames, header=0)

label

Unnamed: 0,id,emotion
0,0x3140b1,sadness
1,0x368b73,disgust
2,0x296183,anticipation
3,0x2bd6e1,joy
4,0x2ee1dd,anticipation
...,...,...
1455558,0x38dba0,joy
1455559,0x300ea2,joy
1455560,0x360b99,fear
1455561,0x22eecf,joy


In [None]:
train_df = pd.merge(src, label, on="id", how="left")

train_df

Unnamed: 0,id,hashtags,text,emotion
0,0x376b20,[Snapchat],"People who post ""add me on #Snapchat"" must be ...",anticipation
1,0x2d5350,"[freepress, TrumpLegacy, CNN]","@brianklaas As we see, Trump is dangerous to #...",sadness
2,0x28b412,[bibleverse],"Confident of your obedience, I write to you, k...",
3,0x1cd5b0,[],Now ISSA is stalking Tasha 😂😂😂 <LH>,fear
4,0x2de201,[],"""Trust is not the same as faith. A friend is s...",
...,...,...,...,...
1867530,0x316b80,"[mixedfeeling, butimTHATperson]",When you buy the last 2 tickets remaining for ...,
1867531,0x29d0cb,[],I swear all this hard work gone pay off one da...,
1867532,0x2a6a4f,[],@Parcel2Go no card left when I wasn't in so I ...,
1867533,0x24faed,[],"Ah, corporate life, where you can date <LH> us...",joy


In [None]:
train_df = train_df.dropna(axis=0, how='any') #drop all rows that have any NaN values

train_df

Unnamed: 0,id,hashtags,text,emotion
0,0x376b20,[Snapchat],"People who post ""add me on #Snapchat"" must be ...",anticipation
1,0x2d5350,"[freepress, TrumpLegacy, CNN]","@brianklaas As we see, Trump is dangerous to #...",sadness
3,0x1cd5b0,[],Now ISSA is stalking Tasha 😂😂😂 <LH>,fear
5,0x1d755c,"[authentic, LaughOutLoud]",@RISKshow @TheKevinAllison Thx for the BEST TI...,joy
6,0x2c91a8,[],Still waiting on those supplies Liscus. <LH>,anticipation
...,...,...,...,...
1867526,0x321566,"[NoWonder, Happy]",I'm SO HAPPY!!! #NoWonder the name of this sho...,joy
1867527,0x38959e,[],In every circumtance I'd like to be thankful t...,joy
1867528,0x2cbca6,[blessyou],there's currently two girls walking around the...,joy
1867533,0x24faed,[],"Ah, corporate life, where you can date <LH> us...",joy


In [None]:
train_df['emotion'].value_counts()

joy             516017
anticipation    248935
trust           205478
sadness         193437
disgust         139101
fear             63999
surprise         48729
anger            39867
Name: emotion, dtype: int64

##### Testing Data



In [None]:
colnames=['id', 'emotion'] 

test = pd.read_csv('dm2022-isa5810-lab2-homework/sampleSubmission.csv', names=colnames, header=0)

test

Unnamed: 0,id,emotion
0,0x2c7743,surprise
1,0x2c1eed,surprise
2,0x2826ea,surprise
3,0x356d9a,surprise
4,0x20fd95,surprise
...,...,...
411967,0x351857,surprise
411968,0x2c028e,surprise
411969,0x1f2430,surprise
411970,0x2be24e,surprise


In [None]:
test_df = pd.merge(src, test, on="id", how="left")

test_df

Unnamed: 0,id,hashtags,text,emotion
0,0x376b20,[Snapchat],"People who post ""add me on #Snapchat"" must be ...",
1,0x2d5350,"[freepress, TrumpLegacy, CNN]","@brianklaas As we see, Trump is dangerous to #...",
2,0x28b412,[bibleverse],"Confident of your obedience, I write to you, k...",surprise
3,0x1cd5b0,[],Now ISSA is stalking Tasha 😂😂😂 <LH>,
4,0x2de201,[],"""Trust is not the same as faith. A friend is s...",surprise
...,...,...,...,...
1867530,0x316b80,"[mixedfeeling, butimTHATperson]",When you buy the last 2 tickets remaining for ...,surprise
1867531,0x29d0cb,[],I swear all this hard work gone pay off one da...,surprise
1867532,0x2a6a4f,[],@Parcel2Go no card left when I wasn't in so I ...,surprise
1867533,0x24faed,[],"Ah, corporate life, where you can date <LH> us...",


In [None]:
test_df = test_df.dropna(axis=0, how='any') #drop all rows that have any NaN values

test_df

Unnamed: 0,id,hashtags,text,emotion
2,0x28b412,[bibleverse],"Confident of your obedience, I write to you, k...",surprise
4,0x2de201,[],"""Trust is not the same as faith. A friend is s...",surprise
9,0x218443,"[materialism, money, possessions]",When do you have enough ? When are you satisfi...,surprise
30,0x2939d5,"[GodsPlan, GodsWork]","God woke you up, now chase the day #GodsPlan #...",surprise
33,0x26289a,[],"In these tough times, who do YOU turn to as yo...",surprise
...,...,...,...,...
1867525,0x2913b4,[],"""For this is the message that ye heard from th...",surprise
1867529,0x2a980e,[],"""There is a lad here, which hath five barley l...",surprise
1867530,0x316b80,"[mixedfeeling, butimTHATperson]",When you buy the last 2 tickets remaining for ...,surprise
1867531,0x29d0cb,[],I swear all this hard work gone pay off one da...,surprise


In [None]:
test_df = test_df.drop(['hashtags', 'emotion'], axis=1)

test_df

Unnamed: 0,id,text
2,0x28b412,"Confident of your obedience, I write to you, k..."
4,0x2de201,"""Trust is not the same as faith. A friend is s..."
9,0x218443,When do you have enough ? When are you satisfi...
30,0x2939d5,"God woke you up, now chase the day #GodsPlan #..."
33,0x26289a,"In these tough times, who do YOU turn to as yo..."
...,...,...
1867525,0x2913b4,"""For this is the message that ye heard from th..."
1867529,0x2a980e,"""There is a lad here, which hath five barley l..."
1867530,0x316b80,When you buy the last 2 tickets remaining for ...
1867531,0x29d0cb,I swear all this hard work gone pay off one da...


#### **2. Model** 
---

##### Preparing

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 4.8 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 85.0 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 76.4 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.10.1 tokenizers-0.13.2 transformers-4.24.0


In [None]:
import torch
import random
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from transformers import BertTokenizer
from transformers import BertForSequenceClassification
#from transformers import RobertaTokenizer
#from transformers import RobertaForSequenceClassification
from transformers import AdamW, get_linear_schedule_with_warmup
from tqdm.notebook import tqdm

In [None]:
df = train_df

df = df.drop(['hashtags'], axis=1)

df.columns = ['id', 'text', 'category']

possible_labels = df.category.unique()

label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

df['label'] = df.category.replace(label_dict)

df

Unnamed: 0,id,text,category,label
0,0x376b20,"People who post ""add me on #Snapchat"" must be ...",anticipation,0
1,0x2d5350,"@brianklaas As we see, Trump is dangerous to #...",sadness,1
3,0x1cd5b0,Now ISSA is stalking Tasha 😂😂😂 <LH>,fear,2
5,0x1d755c,@RISKshow @TheKevinAllison Thx for the BEST TI...,joy,3
6,0x2c91a8,Still waiting on those supplies Liscus. <LH>,anticipation,0
...,...,...,...,...
1867526,0x321566,I'm SO HAPPY!!! #NoWonder the name of this sho...,joy,3
1867527,0x38959e,In every circumtance I'd like to be thankful t...,joy,3
1867528,0x2cbca6,there's currently two girls walking around the...,joy,3
1867533,0x24faed,"Ah, corporate life, where you can date <LH> us...",joy,3


##### Spliting Data

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=17, 
                                                  stratify=df.label.values)

In [None]:
df['data_type'] = ['not_set']*df.shape[0]

In [None]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [None]:
df.groupby(['category', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,id,text
category,label,data_type,Unnamed: 3_level_1,Unnamed: 4_level_1
anger,4,train,33887,33887
anger,4,val,5980,5980
anticipation,0,train,211595,211595
anticipation,0,val,37340,37340
disgust,6,train,118236,118236
disgust,6,val,20865,20865
fear,2,train,54399,54399
fear,2,val,9600,9600
joy,3,train,438614,438614
joy,3,val,77403,77403


##### Loading Tokenizer and Encoding Data

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
#tokenizer = RobertaTokenizer.from_pretrained('roberta-base', do_lower_case=True)
#tokenizer = RobertaTokenizer.from_pretrained('distilroberta-base', do_lower_case=True)

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

In [None]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [None]:
print(len(dataset_train), len(dataset_val))

1237228 218335


##### Setting Up Pre-Trained Model

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                            num_labels=len(label_dict),
                            output_attentions=False,
                            output_hidden_states=False)

#model = RobertaForSequenceClassification.from_pretrained('roberta-base',
#                            num_labels=len(label_dict),
#                            output_attentions=False,
#                            output_hidden_states=False)

#model = RobertaForSequenceClassification.from_pretrained('distilroberta-base',
#                            num_labels=len(label_dict),
#                            output_attentions=False,
#                            output_hidden_states=False)

##### Creating Data Loaders

In [None]:
batch_size = 32

dataloader_train = DataLoader(dataset_train, 
                sampler=RandomSampler(dataset_train), 
                batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                  sampler=SequentialSampler(dataset_val), 
                  batch_size=batch_size)

##### Setting Up Optimiser and Scheduler

In [None]:
optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)

epochs = 3

scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=len(dataloader_train)*epochs)

##### Defining Performance Metrics

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

##### Creating Training Loop

In [None]:
seed_val = 777
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)             
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/3 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/38664 [00:00<?, ?it/s]


Epoch 1
Training loss: 1.0544061688715913
Validation loss: 0.9543728305153285
F1 Score (Weighted): 0.6456101635917261


Epoch 2:   0%|          | 0/38664 [00:00<?, ?it/s]


Epoch 2
Training loss: 0.8976780332281135
Validation loss: 0.9265554194512927
F1 Score (Weighted): 0.6586710650381051


Epoch 3:   0%|          | 0/38664 [00:00<?, ?it/s]


Epoch 3
Training loss: 0.8207980599967363
Validation loss: 0.9309109879714136
F1 Score (Weighted): 0.6629221272055813


#### **3. Prediction** 
---

In [None]:
test = test_df.set_index('id').T.to_dict('list')

test

In [None]:
label = []
for id in test:
  sentence = test[id]

  inputs = tokenizer(sentence, padding='max_length', truncation=True, max_length=256, return_tensors="pt")

  # to gpu
  ids = inputs["input_ids"].to(device)
  mask = inputs["attention_mask"].to(device)

  # to model
  outputs = model(ids, mask)
  logits = outputs[0]

  active_logits = logits.view(-1, model.num_labels) # 大小 (batch_size * seq_len, num_labels)
  flattened_predictions = torch.argmax(active_logits, axis=1) # 大小 (batch_size*seq_len,) 

  tokens = tokenizer.convert_ids_to_tokens(ids.squeeze().tolist())
  ids_to_labels = {'0':'anticipation', '1':'sadness', '2':'fear', '3':'joy', '4':'anger', '5':'trust', '6':'disgust', '7':'surprise'}
  token_predictions = ids_to_labels[str(flattened_predictions.cpu().numpy()[0])]
  label.append(token_predictions)

In [None]:
fin_df = test_df

fin_df = fin_df.assign(emotion = label)

fin_df = fin_df.drop(['text'], axis=1)

fin_df

In [None]:
fin_df.to_csv('/kaggle/working/submission.csv', index=False)

#### **4. Different Things I Tried** 
---

- For the first time, I used `BERT` as my basic pre-trained model, and I also tried to use `RoBERTa`.

- The main difference between `RoBERTa` and `BERT` is that: 1. `RoBERTa` uses more data for training 2. The pre-training task discards the task of predicting the next sentence, and only uses Cloze task.

- In addition, I used 3 epochs for `BERT` fine-tuning, but kaggle's training resources have been used up, I only used 1 epoch for `RoBERTa` fine-tuning.

- On the public leaderboard, the score of the `BERT` version is 0.55008, and the score of the `RoBERTa` version is 0.55377.

- After that I tried `DistilRoBERTa`, it's a distilled version of the `RoBERTa` model. I used 2 epochs for training. The score is 0.54156.

| Model | Epoch | Score |
| ----- | ----- | ----- |
| `BERT` | 3 | 0.55008 |
| `RoBERTa` | 1 | 0.55377 |
| `DistilRoBERTa` | 2 | 0.54156 |