# Sentiment Analysis with Deep Learning using BERT

### Prerequisites

- Intermediate-level knowledge of Python 3 (NumPy and Pandas preferably, but not required)
- Exposure to PyTorch usage
- Basic understanding of Deep Learning and Language Models (BERT specifically)

### Project Outline

**Task 1**: Introduction (this section)

**Task 2**: Exploratory Data Analysis and Preprocessing

**Task 3**: Training/Validation Split

**Task 4**: Loading Tokenizer and Encoding our Data

**Task 5**: Setting up BERT Pretrained Model

**Task 6**: Creating Data Loaders

**Task 7**: Setting Up Optimizer and Scheduler

**Task 8**: Defining our Performance Metrics

**Task 9**: Creating our Training Loop

## Introduction

### What is BERT

BERT is a large-scale transformer-based Language Model that can be finetuned for a variety of tasks.

For more information, the original paper can be found [here](https://arxiv.org/abs/1810.04805). 

[HuggingFace documentation](https://huggingface.co/transformers/model_doc/bert.html)

[Bert documentation](https://characters.fandom.com/wiki/Bert_(Sesame_Street) ;)

<img src="Images/BERT_diagrams.pdf" width="1000">

## Exploratory Data Analysis and Preprocessing

We will use the SMILE Twitter dataset.

_Wang, Bo; Tsakalidis, Adam; Liakata, Maria; Zubiaga, Arkaitz; Procter, Rob; Jensen, Eric (2016): SMILE Twitter Emotion dataset. figshare. Dataset. https://doi.org/10.6084/m9.figshare.3187909.v2_

In [None]:
import torch
import pandas as pd
from tqdm.notebook import tqdm

In [None]:
df = pd.read_csv('labeled_comments.csv', names=['id', 'text', 'theme'])
df.set_index('id', inplace=True)

In [None]:
df.head()

Unnamed: 0_level_0,text,theme
id,Unnamed: 1_level_1,Unnamed: 2_level_1
ID,Comments,Theme
1,Just filled out the form on “Contact MileagePl...,Contact Form Bug
2,DOES NOT ACCEPT MY UPLOAD :(,upload
3,Hello I have an issue. I applied for a united ...,User Error
4,"There NEEDS to be an ""AND"" or an ""OR"" between ...",Content Clarity


In [None]:
df['theme'] = df['theme'].str.lower()
# df.head()
# df.loc['1', 'theme'].lower()
df.head()


Unnamed: 0_level_0,text,theme
id,Unnamed: 1_level_1,Unnamed: 2_level_1
ID,Comments,theme
1,Just filled out the form on “Contact MileagePl...,contact form bug
2,DOES NOT ACCEPT MY UPLOAD :(,upload
3,Hello I have an issue. I applied for a united ...,user error
4,"There NEEDS to be an ""AND"" or an ""OR"" between ...",content clarity


In [None]:
df.theme.value_counts()

customer support              250
refund                        153
error                         132
covid requirements unclear    132
chat                          127
upload                        120
social issues                  95
bug                            86
sweepstakes                    75
etc/ffc                        73
baggage                        67
login and password             65
contact and support issues     63
issues during travel           63
design complaint               38
pet                            27
checkin                        22
content clarity                21
policy complaint               18
account and login              17
ticketing                       7
agent compliment                5
checkin issues                  4
reciepts and other proofs       4
ffc                             4
inflight purch                  3
irrop                           2
booking                         2
contact form bug                2
change fee inf

In [None]:
# df = df[~df.category.str.contains('\|')]
#df1 = df[df.theme.value_counts() > 3]
df = df.groupby('theme').filter(lambda x: len(x) > 5)
df.theme.value_counts()

customer support              250
refund                        153
covid requirements unclear    132
error                         132
chat                          127
upload                        120
social issues                  95
bug                            86
sweepstakes                    75
etc/ffc                        73
baggage                        67
login and password             65
contact and support issues     63
issues during travel           63
design complaint               38
pet                            27
checkin                        22
content clarity                21
policy complaint               18
account and login              17
ticketing                       7
Name: theme, dtype: int64

In [None]:
# df = df[df.category != 'nocode']

In [None]:
df.category.value_counts()

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

In [None]:
possible_labels = df.theme.unique()

In [None]:
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
df['label'] = df.theme.replace(label_dict)

In [None]:
df.head()

Unnamed: 0_level_0,text,theme,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,DOES NOT ACCEPT MY UPLOAD :(,upload,0
4,"There NEEDS to be an ""AND"" or an ""OR"" between ...",content clarity,1
6,I tried submitting this earlier... I upload...,upload,0
7,The Covid related travel restrictions you just...,covid requirements unclear,2
8,I am not getting the option to upload my covid...,upload,0


## Training/Validation Split

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=17, 
                                                  stratify=df.label.values)

In [None]:
df['data_type'] = ['not_set']*df.shape[0]

In [None]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [None]:
df.groupby(['theme', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
theme,label,data_type,Unnamed: 3_level_1
account and login,20,train,14
account and login,20,val,3
baggage,17,train,57
baggage,17,val,10
bug,8,train,73
bug,8,val,13
chat,3,train,108
chat,3,val,19
checkin,19,train,19
checkin,19,val,3


## Loading Tokenizer and Encoding our Data

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 8.5 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 43.6 MB/s 
Collecting huggingface-hub==0.0.12
  Downloading huggingface_hub-0.0.12-py3-none-any.whl (37 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 59.7 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.45-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 66.9 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninsta

In [None]:
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=570.0, style=ProgressStyle(description_…




In [None]:
encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

encoded_data_val = tokenizer.batch_encode_plus(
    df[df.data_type=='val'].text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type=='train'].label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type=='val'].label.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [None]:
len(dataset_train)

1403

In [None]:
len(dataset_val)

248

## Setting up BERT Pretrained Model

In [None]:
from transformers import BertForSequenceClassification

In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)



HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Creating Data Loaders

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [None]:
batch_size = 32

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

## Setting Up Optimiser and Scheduler

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [None]:
optimizer = AdamW(model.parameters(),
                  lr=1e-5, 
                  eps=1e-8)

In [None]:
epochs = 20

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

## Defining our Performance Metrics

Accuracy metric approach originally used in accuracy function in [this tutorial](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification).

In [None]:
import numpy as np

In [None]:
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [None]:
def accuracy_per_class(preds, labels):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        print(f'Accuracy: {len(y_preds[y_preds==label])}/{len(y_true)}\n')

## Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [None]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_BERT_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)            
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

HBox(children=(FloatProgress(value=0.0, max=20.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, description='Epoch 1', max=44.0, style=ProgressStyle(description_width…


Epoch 1
Training loss: 2.986733252351934
Validation loss: 2.8813958168029785
F1 Score (Weighted): 0.018154020614381466


HBox(children=(FloatProgress(value=0.0, description='Epoch 2', max=44.0, style=ProgressStyle(description_width…


Epoch 2
Training loss: 2.7962769757617605
Validation loss: 2.705287516117096
F1 Score (Weighted): 0.1182972126402414


HBox(children=(FloatProgress(value=0.0, description='Epoch 3', max=44.0, style=ProgressStyle(description_width…


Epoch 3
Training loss: 2.596252820708535
Validation loss: 2.4969874024391174
F1 Score (Weighted): 0.21328801771268335


HBox(children=(FloatProgress(value=0.0, description='Epoch 4', max=44.0, style=ProgressStyle(description_width…


Epoch 4
Training loss: 2.345874748446725
Validation loss: 2.229147434234619
F1 Score (Weighted): 0.3695178014421619


HBox(children=(FloatProgress(value=0.0, description='Epoch 5', max=44.0, style=ProgressStyle(description_width…


Epoch 5
Training loss: 2.084644244475798
Validation loss: 2.0241172164678574
F1 Score (Weighted): 0.4178790273486356


HBox(children=(FloatProgress(value=0.0, description='Epoch 6', max=44.0, style=ProgressStyle(description_width…


Epoch 6
Training loss: 1.8482587337493896
Validation loss: 1.8323257863521576
F1 Score (Weighted): 0.49616681207439395


HBox(children=(FloatProgress(value=0.0, description='Epoch 7', max=44.0, style=ProgressStyle(description_width…


Epoch 7
Training loss: 1.6586092629215934
Validation loss: 1.6843547374010086
F1 Score (Weighted): 0.5082030197690498


HBox(children=(FloatProgress(value=0.0, description='Epoch 8', max=44.0, style=ProgressStyle(description_width…


Epoch 8
Training loss: 1.4848367734388872
Validation loss: 1.5804213136434555
F1 Score (Weighted): 0.5703604635111745


HBox(children=(FloatProgress(value=0.0, description='Epoch 9', max=44.0, style=ProgressStyle(description_width…


Epoch 9
Training loss: 1.3558995601805774
Validation loss: 1.4826981276273727
F1 Score (Weighted): 0.5866525445737734


HBox(children=(FloatProgress(value=0.0, description='Epoch 10', max=44.0, style=ProgressStyle(description_widt…


Epoch 10
Training loss: 1.2383924654938958
Validation loss: 1.4194718971848488
F1 Score (Weighted): 0.5894892627730846


HBox(children=(FloatProgress(value=0.0, description='Epoch 11', max=44.0, style=ProgressStyle(description_widt…


Epoch 11
Training loss: 1.1387143514373086
Validation loss: 1.3617082610726357
F1 Score (Weighted): 0.6166667457886029


HBox(children=(FloatProgress(value=0.0, description='Epoch 12', max=44.0, style=ProgressStyle(description_widt…


Epoch 12
Training loss: 1.0553776242516257
Validation loss: 1.3262663930654526
F1 Score (Weighted): 0.6263453653182722


HBox(children=(FloatProgress(value=0.0, description='Epoch 13', max=44.0, style=ProgressStyle(description_widt…


Epoch 13
Training loss: 0.9817622507160361
Validation loss: 1.283662810921669
F1 Score (Weighted): 0.6280407275575173


HBox(children=(FloatProgress(value=0.0, description='Epoch 14', max=44.0, style=ProgressStyle(description_widt…


Epoch 14
Training loss: 0.9195272041992708
Validation loss: 1.2482188045978546
F1 Score (Weighted): 0.6345972835927262


HBox(children=(FloatProgress(value=0.0, description='Epoch 15', max=44.0, style=ProgressStyle(description_widt…


Epoch 15
Training loss: 0.8747890781272541
Validation loss: 1.2273388355970383
F1 Score (Weighted): 0.6510902639245626


HBox(children=(FloatProgress(value=0.0, description='Epoch 16', max=44.0, style=ProgressStyle(description_widt…


Epoch 16
Training loss: 0.8365956802259792
Validation loss: 1.2112998887896538
F1 Score (Weighted): 0.6634059447254806


HBox(children=(FloatProgress(value=0.0, description='Epoch 17', max=44.0, style=ProgressStyle(description_widt…


Epoch 17
Training loss: 0.7989911897615953
Validation loss: 1.1990215554833412
F1 Score (Weighted): 0.6591143530766229


HBox(children=(FloatProgress(value=0.0, description='Epoch 18', max=44.0, style=ProgressStyle(description_widt…


Epoch 18
Training loss: 0.7763664424419403
Validation loss: 1.1857652105391026
F1 Score (Weighted): 0.6650084412960078


HBox(children=(FloatProgress(value=0.0, description='Epoch 19', max=44.0, style=ProgressStyle(description_widt…


Epoch 19
Training loss: 0.7664011039517142
Validation loss: 1.1821846477687359
F1 Score (Weighted): 0.668306548965263


HBox(children=(FloatProgress(value=0.0, description='Epoch 20', max=44.0, style=ProgressStyle(description_widt…


Epoch 20
Training loss: 0.7520389353687112
Validation loss: 1.1804797016084194
F1 Score (Weighted): 0.6610378536264898



In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

model.to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
model.load_state_dict(torch.load('finetuned_BERT_epoch_20.model', map_location=torch.device('cpu')))

<All keys matched successfully>

In [None]:
_, predictions, true_vals = evaluate(dataloader_validation)

In [None]:
accuracy_per_class(predictions, true_vals)

Class: upload
Accuracy: 15/18

Class: content clarity
Accuracy: 0/3

Class: covid requirements unclear
Accuracy: 15/20

Class: chat
Accuracy: 19/19

Class: refund
Accuracy: 22/23

Class: login and password
Accuracy: 6/10

Class: pet
Accuracy: 1/4

Class: customer support
Accuracy: 33/38

Class: bug
Accuracy: 5/13

Class: issues during travel
Accuracy: 2/9

Class: design complaint
Accuracy: 0/6

Class: etc/ffc
Accuracy: 11/11

Class: error
Accuracy: 12/20

Class: policy complaint
Accuracy: 0/3

Class: contact and support issues
Accuracy: 0/9

Class: social issues
Accuracy: 12/14

Class: ticketing
Accuracy: 0/1

Class: baggage
Accuracy: 10/10

Class: sweepstakes
Accuracy: 11/11

Class: checkin
Accuracy: 0/3

Class: account and login
Accuracy: 0/3

