## Sentiment Analysis with Deep Learning using BERT

### Project Outline

**Task 1**: Introduction 

**Task 2**: Exploratory Data Analysis and Preprocessing

**Task 3**: Training/Validation Split

**Task 4**: Loading Tokenizer and Encoding our Data

**Task 5**: Setting up BERT Pretrained Model

**Task 6**: Creating Data Loaders

**Task 7**: Setting Up Optimizer and Scheduler

**Task 8**: Defining our Performance Metrics

**Task 9**: Creating our Training Loop

### Introduction

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.
For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 
[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

### Exploratory Data Analysis and Preprocessing

In [1]:
import torch

In [2]:
import pandas as pd
from tqdm.notebook import tqdm

In [3]:
df = pd.read_csv(
    'smileannotationsfinal.csv',
    names=['id', 'text', 'category'])
df.set_index('id', inplace=True)

In [4]:
df.head()

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [5]:
df.text.iloc[1]

'Dorian Gray with Rainbow Scarf #LoveWins (from @britishmuseum http://t.co/Q4XSwL0esu) http://t.co/h0evbTBWRq'

In [6]:
df.category.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|disgust             2
sad|angry               2
sad|disgust|angry       1
Name: category, dtype: int64

In [7]:
df = df[~df.category.str.contains('\|')]
df = df[df.category != 'nocode']
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [8]:
possible_labels = df.category.unique()
possible_labels

array(['happy', 'not-relevant', 'angry', 'disgust', 'sad', 'surprise'],
      dtype=object)

In [9]:
labels_dict = {}
for index, possible_label in enumerate(possible_labels):
    labels_dict[possible_label] = index
labels_dict

{'angry': 2,
 'disgust': 3,
 'happy': 0,
 'not-relevant': 1,
 'sad': 4,
 'surprise': 5}

In [10]:
df['label'] = df.category.replace(labels_dict)
df.head()

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0


### Training/Validation Split

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_val, y_train, y_val = train_test_split(
        df.index.values,
        df.label.values,
        test_size=0.15,
        random_state=17,
        stratify=df.label.values
)

In [13]:
df['data_type'] = ['no_set']*df.shape[0]
df.head()

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy,0,no_set
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy,0,no_set
614877582664835073,@Sofabsports thank you for following me back. ...,happy,0,no_set
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy,0,no_set
611570404268883969,@NationalGallery @ThePoldarkian I have always ...,happy,0,no_set


In [14]:
df.loc[X_train,'data_type'] = 'train'
df.loc[X_val,'data_type'] = 'val'
df.groupby(['category','label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


### Loading Tokenizer and Encoding our Data

In [16]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [17]:
tokenizer = BertTokenizer.from_pretrained(
    'bert-base-uncased',
    do_lower_case=True
)
import logging
logging.basicConfig(level=logging.ERROR)

In [18]:
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=="train"].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors="pt"
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=="val"].text.values,
    add_special_tokens=True,
    return_attention_mask=True,
    padding=True,
    truncation=True,
    max_length=256,
    return_tensors="pt"
)

input_ids_train = encoded_data_train["input_ids"]
attention_masks_train = encoded_data_train["attention_mask"]
labels_train = torch.tensor(df[df.data_type=="train"].label.values)

input_ids_val = encoded_data_val["input_ids"]
attention_masks_val = encoded_data_val["attention_mask"]
labels_val = torch.tensor(df[df.data_type=="val"].label.values)

In [19]:
dataset_train = TensorDataset(input_ids_train,
                              attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val,
                            attention_masks_val, labels_val)

In [20]:
len(dataset_train)

1258

In [21]:
len(dataset_val)

223

###  Setting up BERT Pretrained Model

In [22]:
from transformers import BertForSequenceClassification

In [23]:
model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels = len(labels_dict),
    output_attentions=False,
    output_hidden_states=False
)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [24]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [25]:
batch_size = 32

dataloader_train = DataLoader(
    dataset_train,
    sampler=RandomSampler(dataset_train),
    batch_size=batch_size
)

dataloader_val = DataLoader(
    dataset_val,
    sampler=SequentialSampler(dataset_val),
    batch_size=32
)

### Setting Up Optimizer and Scheduler

In [26]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [27]:
optimizer = AdamW(
    model.parameters(),
    lr=1e-5, #2e-5 > 5e-5 from original paper, recommended params
    eps=1e-8
)

In [28]:
epochs = 10

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=0,
    num_training_steps = len(dataloader_train)*epochs #how many iterations it should go on, how many times you want your learning rate to change
)

### Defining our Performance Metrics

#### Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification)

In [29]:
import numpy as np
from sklearn.metrics import f1_score

In [30]:
dict_inverse = {v: k for k, v in labels_dict.items()}
dict_inverse

{0: 'happy',
 1: 'not-relevant',
 2: 'angry',
 3: 'disgust',
 4: 'sad',
 5: 'surprise'}

In [31]:
# we using f-1 score because we know about class imbalance. Accuracy in this case can give skew result
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten() #we don't want list of list but single array, i.e.list
    labels_flat = labels.flatten()
    return f1_score(labels_flat,preds_flat, average="weighted") # "weighted" because we have imbalanced distribution of classes, this param weights each class based on how many samples exist

In [32]:
def accuracy_per_class(preds, labels): # true labels of class 5 how many of prediction were actually class 5
    labels_dict_inverse = {v: k for k, v in labels_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten() # to get predictions in form we want
    labels_flat = labels.flatten()  # IT's TRUE LABELS
    
    for label in np.unique(labels_flat): # ітер по унікальних лейблах нашого датасету
        y_preds = preds_flat[labels_flat==label] # to index 2 arrays of the same shape by each other(np built-in func) 
                                                # Тобто беремо всі предикшини для "реального" лейбла, наприклад, label=angry
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {labels_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n') # it was correctly predicted label > y_preds[y_preds==label]

### Creating our Training Loop

#### Approach adopted from an older version of HuggingFace's run_glue.py script. [Here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128) 

In [33]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed(seed_val) #code when using GPU in colab

In [34]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [35]:
# we don't change gradients. we care about loss, logits. Eval mode freezes all our weights

def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        # disable gradients
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]      # use logits as our predictions
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy() #якщо юзаю gpu, то переключаюсь на срu, щоб використати numpy
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [36]:
# epoch loop
for epoch in tqdm(range(1, epochs+1)):
    
    model.train() 
    
    loss_train_total = 0
    
    progress_bar = tqdm(dataloader_train, 
                        desc=" Epoch {:1d}".format(epoch),
                        leave=False, # to overwrite each epoch
                        disable=False)
    # batch loop
    for batch in progress_bar:
        
        model.zero_grad() # повертаю градієнт до початкового стану після кожного проходу, але це не до нуля(нуль для RNN)
        
        batch = tuple(b.to(device) for b in batch) #each item of tuple is on correct device. important if you use gpu and cuda
        
        inputs = {                     # what goes into bert model
        "input_ids"      : batch[0],
        "attention_mask" : batch[1],
        "labels"         : batch[2]
        }
        
        outputs = model(**inputs)  # to get outputs, just run our model. (**inputs - unpacks the dictionary)
        
        # BERT returns loss & logits(hidden layer units) as a tuple. we care only about loss (first element of a tuple)
        loss = outputs[0]  
        loss_train_total += loss.item() # adding up a loss
        loss.backward() # backpropagation
        
        # safe way: clip our gradients and give them a norm value. prevents gradients of being exceptionally small or big.
        # It helps to promote generalization on our dataset
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # all model params=all (our weights) - model.parameters() be normalized to 1 per layer
        
        optimizer.step()
        scheduler.step()
        
        # update progress_bar to show loss per batch
        progress_bar.set_postfix({"training_loss": "{:.3f}".format(loss.item()/len(batch))}) # .set_postfix means appending a dict to our progress_bar to see loss
    
    # save model every epoch
    torch.save(model.state_dict(), f"Models/Bert_ft_epoch{epoch}.model")
    
    # Reporting what epoch
    tqdm.write(f"\nEpoch: {epoch}")
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    # This alows to look at training and stop if reached some level of consistency or
    tqdm.write(f"Training loss: {loss_train_avg}") # av.tr.loss after each epoch
    
    # important if you want to know if model is overtraining. When training loss decreases but val loss goes up - overtraining!
    val_loss, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals) # weighted f1 score
    tqdm.write(f"Validation loss: {val_loss}")
    tqdm.write(f" F1 Score (weighted): {val_f1}")
    
    

HBox(children=(FloatProgress(value=0.0, max=10.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description=' Epoch 1', max=40.0, style=ProgressStyle(description_widt…


Epoch: 1
Training loss: 1.1075460880994796


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.7748265436717442
 F1 Score (weighted): 0.6656119824269878


HBox(children=(FloatProgress(value=0.0, description=' Epoch 2', max=40.0, style=ProgressStyle(description_widt…


Epoch: 2
Training loss: 0.695071841776371


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.6787866396563393
 F1 Score (weighted): 0.7224692489127791


HBox(children=(FloatProgress(value=0.0, description=' Epoch 3', max=40.0, style=ProgressStyle(description_widt…


Epoch: 3
Training loss: 0.5486723475158215


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.6016379381929126
 F1 Score (weighted): 0.7609793530131173


HBox(children=(FloatProgress(value=0.0, description=' Epoch 4', max=40.0, style=ProgressStyle(description_widt…


Epoch: 4
Training loss: 0.4433597411960363


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5416471532412938
 F1 Score (weighted): 0.7983081940481044


HBox(children=(FloatProgress(value=0.0, description=' Epoch 5', max=40.0, style=ProgressStyle(description_widt…


Epoch: 5
Training loss: 0.36410230249166486


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5575605034828186
 F1 Score (weighted): 0.819874901797898


HBox(children=(FloatProgress(value=0.0, description=' Epoch 6', max=40.0, style=ProgressStyle(description_widt…


Epoch: 6
Training loss: 0.3179525567218661


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5339785835572651
 F1 Score (weighted): 0.8174600007894299


HBox(children=(FloatProgress(value=0.0, description=' Epoch 7', max=40.0, style=ProgressStyle(description_widt…


Epoch: 7
Training loss: 0.2724010307341814


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5656996071338654
 F1 Score (weighted): 0.8292513947430034


HBox(children=(FloatProgress(value=0.0, description=' Epoch 8', max=40.0, style=ProgressStyle(description_widt…


Epoch: 8
Training loss: 0.24829029217362403


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5419247363294873
 F1 Score (weighted): 0.8388620269734415


HBox(children=(FloatProgress(value=0.0, description=' Epoch 9', max=40.0, style=ProgressStyle(description_widt…


Epoch: 9
Training loss: 0.23218769934028388


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5617750776665551
 F1 Score (weighted): 0.8480024021915179


HBox(children=(FloatProgress(value=0.0, description=' Epoch 10', max=40.0, style=ProgressStyle(description_wid…


Epoch: 10
Training loss: 0.22001583548262715


HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))


Validation loss: 0.5612936743668148
 F1 Score (weighted): 0.8532466874414127



###  Loading and Evaluating our model

In [42]:
# reloading our model so it's a fresh model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                    num_labels = len(labels_dict),
                                                    output_attentions=False,
                                                    output_hidden_states=False)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [43]:
model.to(device)
pass

In [53]:
model.load_state_dict(
    torch.load('Models/Bert_ft_epoch10.model',
    map_location=torch.device('cpu')))
# "all keys matched sucessfully"

<All keys matched successfully>

In [54]:
_, predictions, true_vals = evaluate(dataloader_val)

HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))




In [55]:
accuracy_per_class(predictions, true_vals)

Class: happy
Accuracy: 164/171

Class: not-relevant
Accuracy: 20/32

Class: angry
Accuracy: 8/9

Class: disgust
Accuracy: 0/1

Class: sad
Accuracy: 0/5

Class: surprise
Accuracy: 2/5

