# A Visual Notebook to Using BERT

**Status: Work in progress. Check back later.**

!!!! FUENTE/SOURCE: https://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-sentence-classification.png" height="600" width="960">

In this notebook, we will use pre-trained deep learning model to process some text. We will then use the output of that model to classify the text. The text is a list of sentences from film reviews. And we will calssify each sentence as either speaking "positively" about its subject of "negatively".

## Models: Sentence Sentiment Classification
Our goal is to create a model that takes a sentence (just like the ones in our dataset) and produces either 1 (indicating the sentence carries a positive sentiment) or a 0 (indicating the sentence carries a negative sentiment). We can think of it as looking like this:

<img src="https://jalammar.github.io/images/distilBERT/sentiment-classifier-1.png" height="600" width="960">

Under the hood, the model is actually made up of two model.

* DistilBERT processes the sentence and passes along some information it extracted from it on to the next model. DistilBERT is a smaller version of BERT developed and open sourced by the team at HuggingFace. It’s a lighter and faster version of BERT that roughly matches its performance.
* The next model, a basic Logistic Regression model from scikit learn will take in the result of DistilBERT’s processing, and classify the sentence as either positive or negative (1 or 0, respectively).

The data we pass between the two models is a vector of size 768. We can think of this of vector as an embedding for the sentence that we can use for classification.

<img src="https://jalammar.github.io/images/distilBERT/distilbert-bert-sentiment-classifier.png" height="600" width="960">

## Dataset
The dataset we will use in this example is [SST2](https://nlp.stanford.edu/sentiment/index.html), which contains sentences from movie reviews, each labeled as either positive (has the value 1) or negative (has the value 0):

<table class="features-table">
  <tr>
    <th class="mdc-text-light-green-600">
    sentence
    </th>
    <th class="mdc-text-purple-600">
    label
    </th>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      a stirring , funny and finally transporting re imagining of beauty and the beast and 1930s horror films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      apparently reassembled from the cutting room floor of any given daytime soap
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      they presume their audience won't sit still for a sociology lesson
    </td>
    <td class="mdc-bg-purple-50">
      0
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      this is a visually stunning rumination on love , memory , history and the war between art and commerce
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
  <tr>
    <td class="mdc-bg-light-green-50" style="text-align:left">
      jonathan parker 's bartleby should have been the be all end all of the modern office anomie films
    </td>
    <td class="mdc-bg-purple-50">
      1
    </td>
  </tr>
</table>

## The transformers library
Let's start by importing the huggingface transformers library so we can load our deep learning NLP model.

In [0]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import torch
import transformers as ppb
import warnings
warnings.filterwarnings('ignore')

### DEVICE selection

In [0]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Importing the dataset
We'll use pandas to read the dataset and load it into a dataframe.

In [0]:
df = pd.read_csv('https://github.com/clairett/pytorch-sentiment-classification/raw/master/data/SST2/train.tsv', delimiter='\t', header=None)
df.head() # 13840 rows

Unnamed: 0,0,1
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


For performance reasons, we'll only use 3.200 sentences from the dataset

In [0]:
batch_1 = df[:3200]

We can ask pandas how many sentences are labeled as "positive" (value 1) and how many are labeled "negative" (having the value 0)

In [0]:
batch_1[1].value_counts()

1    1672
0    1528
Name: 1, dtype: int64

## Loading the Pre-trained BERT model
Let's now load a pre-trained BERT model. 

In [0]:
# For DistilBERT:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

## Want BERT instead of distilBERT? Uncomment the following line:
# model_class, tokenizer_class, pretrained_weights = (ppb.BertModel, ppb.BertTokenizer, 'bert-base-uncased')

# Load pretrained tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)

# model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=numlabels)

# Load pretrained model
model = model_class.from_pretrained(pretrained_weights)
# model.to(device) # In case of using GPUs

Right now, the variable `model` holds a pretrained distilBERT model -- a version of BERT that is smaller, but much faster and requiring a lot less memory.

## Model #1: Preparing the Dataset
Before we can hand our sentences to BERT, we need to so some minimal processing to put them in the format it requires.

### Tokenization
Our first step is to tokenize the sentences -- break them up into word and subwords in the format BERT is comfortable with.

In [0]:
tokenized = batch_1[0].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tokenization-2-token-ids.png" height="600" width="960" />
         
#### Flowing Through DistilBERT
Passing the input vector through DistilBERT works just like BERT. The output would be a vector for each input token. each vector is made up of 768 numbers (floats).

<img src="https://jalammar.github.io/images/distilBERT/bert-model-input-output-1.png" height="600" width="960" />

Because this is a sentence classification task, we ignore all except the first vector (the one associated with the [CLS] token). The one vector we pass as the input to the logistic regression model.

<img src="https://jalammar.github.io/images/distilBERT/bert-model-calssification-output-vector-cls.png" height="600" width="960"/>
          

### Padding
After tokenization, `tokenized` is a list of sentences -- each sentences is represented as a list of tokens. We want BERT to process our examples all at once (as one batch). It's just faster that way. For that reason, we need to pad all lists to the same size, so we can represent the input as one 2-d array, rather than a list of lists (of different lengths).

In [0]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
print("Done")

Our dataset is now in the `padded` variable, we can view its dimensions below:

In [0]:
np.array(padded).shape

### Masking
If we directly send `padded` to BERT, that would slightly confuse it. We need to create another variable to tell it to ignore (mask) the padding we've added when it's processing its input. That's what attention_mask is:

In [0]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

## Model #1: And Now, Deep Learning!
Now that we have our model and inputs ready, let's run our model!

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-tutorial-sentence-embedding.png" height="600" width="960" />

The `model()` function runs our sentences through BERT. The results of the processing will be returned into `last_hidden_states`.

In [0]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

In [0]:
input_ids = input_ids.type(torch.LongTensor)
attention_mask = attention_mask.type(torch.LongTensor)
# b_labels = b_labels.type(torch.LongTensor)

In [0]:
# In case you're using GPUs
# input_ids = input_ids.to(device)
# attention_mask = attention_mask.to(device)

In [0]:
# https://github.com/huggingface/transformers/issues/2952
with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

### Recapping a sentence’s journey
Each row is associated with a sentence from our dataset. To recap the processing path of the first sentence, we can think of it as looking like this:

<img src="https://jalammar.github.io/images/distilBERT/bert-input-to-output-tensor-recap.png" height="600" width="960"/>

Let's slice only the part of the output that we need. That is the output corresponding the first token of each sentence. The way BERT does sentence classification, is that it adds a token called `[CLS]` (for classification) at the beginning of every sentence. The output corresponding to that token can be thought of as an embedding for the entire sentence.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-tensor-selection.png" height="600" width="960"/>

And now features is a 2d numpy array containing the sentence embeddings of all the sentences in our dataset.

<img src="https://jalammar.github.io/images/distilBERT/bert-output-cls-senteence-embeddings.png" height="600" width="960"/>

The tensor we sliced from BERT's output

We'll save those in the `features` variable, as they'll serve as the features to our logitics regression model.

In [0]:
last_hidden_states[0].shape

In [0]:
features = last_hidden_states[0][:,0,:].numpy()

The labels indicating which sentence is positive and negative now go into the `labels` variable

In [0]:
labels = batch_1[1]

## Model #2: Train/Test Split
Let's now split our datset into a training set and testing set (even though we're using 2,000 sentences from the SST2 training set).

In [0]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

<img src="https://jalammar.github.io/images/distilBERT/bert-distilbert-train-test-split-sentence-embedding.png" height="600" width="960" />

### [Bonus] Grid Search for Parameters
We can dive into Logistic regression directly with the Scikit Learn default parameters, but sometimes it's worth searching for the best value of the C parameter, which determines regularization strength.

In [0]:
parameters = {'C': np.linspace(0.0001, 100, 20)}
grid_search = GridSearchCV(LogisticRegression(), parameters)
grid_search.fit(train_features, train_labels)

print('best parameters: ', grid_search.best_params_)
print('best scrores: ', grid_search.best_score_)

We now train the LogisticRegression model. If you've chosen to do the gridsearch, you can plug the value of C into the model declaration (e.g. `LogisticRegression(C=5.2)`).

In [0]:
lr_clf = LogisticRegression()
lr_clf.fit(train_features, train_labels)

<img src="https://jalammar.github.io/images/distilBERT/bert-training-logistic-regression.png" />

## Evaluating Model #2
So how well does our model do in classifying sentences? One way is to check the accuracy against the testing dataset:

In [0]:
lr_clf.score(test_features, test_labels)

How good is this score? What can we compare it against? Let's first look at a dummy classifier:

In [0]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

So our model clearly does better than a dummy classifier. But how does it compare against the best models?

## Proper SST2 scores
For reference, the [highest accuracy score](http://nlpprogress.com/english/sentiment_analysis.html) for this dataset is currently **96.8**. DistilBERT can be trained to improve its score on this task – a process called **fine-tuning** which updates BERT’s weights to make it achieve a better performance in this sentence classification task (which we can call the downstream task). The fine-tuned DistilBERT turns out to achieve an accuracy score of **90.7**. The full size BERT model achieves **94.9**.



And that’s it! That’s a good first contact with BERT. The next step would be to head over to the documentation and try your hand at [fine-tuning](https://huggingface.co/transformers/examples.html#glue). You can also go back and switch from distilBERT to BERT and see how that works.

# Sentiment Analysis with Deep Learning using BERT
https://www.coursera.org/learn/sentiment-analysis-bert/

## EDA an pre-processing

In [0]:
import pandas as pd

In [0]:
df = pd.read_csv('datasets/smile-annotations-final.csv',
                 names=['id', 'text', 'category'])
df.set_index('id', inplace = True)
df.head(5) # (3085, 2)

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
614877582664835073,@Sofabsports thank you for following me back. ...,happy
611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [0]:
df.category.value_counts()

nocode               1572
happy                1137
not-relevant          214
angry                  57
surprise               35
sad                    32
happy|surprise         11
happy|sad               9
disgust|angry           7
disgust                 6
sad|angry               2
sad|disgust             2
sad|disgust|angry       1
Name: category, dtype: int64

In [0]:
df[~df.category.str.contains('\|')].sample(10)

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
614721194097553408,A lot of the scholarly establishment will dism...,nocode
611859875048615936,Sculpture from the Parthenon @britishmuseum w/...,happy
613686536215789570,"FLAMIN: the Artists Present… Screenings, even...",nocode
614113369411289088,Looking forward to Amelia Smith speaking in ‘C...,happy
615141346429435904,What an amazing facility is the @NationalGalle...,happy
613637339794042880,Looking forward to our public engagement event...,happy
610894249719009280,@tateliverpool Really really looking forward t...,happy
615160329086070784,Born #OnThisDay 1577: #PeterPaulReubens. The F...,nocode
610753052333764608,@NationalGallery #AskTheGallery I love the lit...,happy
611086695329480704,"Rayban_Outlet_65,Click_24 http://t.co/sMiDEXlz...",nocode


In [0]:
df = df[~df.category.str.contains('\|')]
df = df[df.category != 'nocode']
display(df.category.value_counts())
df.sample(5) # (1481, 3)

happy           1137
not-relevant     214
angry             57
surprise          35
sad               32
disgust            6
Name: category, dtype: int64

Unnamed: 0_level_0,text,category
id,Unnamed: 1_level_1,Unnamed: 2_level_1
610487926229704704,@NationalGallery #AskTheGallery As a director ...,happy
615059870610497536,@BBC_Culture @britishmuseum @CoppermillPoets s...,happy
612964052684439552,"Very good, unobtrusive use of technology in re...",happy
612193346543583232,Setting up at St Peters Cambridge for #CastleH...,happy
612946639595171840,@britishmuseum turquoise eyes!!!,surprise


In [0]:
possible_labels = df.category.unique()
possible_labels

array(['happy', 'not-relevant', 'angry', 'disgust', 'sad', 'surprise'],
      dtype=object)

In [0]:
label_dict = dict()

for index, possible_labels in enumerate(possible_labels):
    label_dict[possible_labels] = index
    
label_dict

{'angry': 2,
 'disgust': 3,
 'happy': 0,
 'not-relevant': 1,
 'sad': 4,
 'surprise': 5}

In [0]:
df['label'] = df.category.replace(label_dict)
df.sample(5)

Unnamed: 0_level_0,text,category,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
615136474242138112,Judging by #DefiningBeauty @britishmuseum the ...,happy,0
612701137645502464,@tateliverpool Gorgeous,happy,0
612620979185979392,@Ophiolatrist @britishmuseum The stupid #Frenc...,angry,2
615499699106222080,We are looking forward to the unveiling of the...,happy,0
610850484337967104,@ExeterLiving @ThelmaHulbert Thank you!,happy,0


## Traning/Validation Split

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
X_train, X_val, y_train, y_val = train_test_split(df.index.values,
                                                  df.label.values,
                                                  test_size = 0.15,
                                                  random_state = 17,
                                                  stratify = df.label.values)

# X_train -> 1258
# X_val -> 223
# y_train -> 1258
# y_val -> 223

In [0]:
df['data_type'] = ['no-set']*df.shape[0]
df.sample(5)

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
611573636164681730,@Ophiolatrist @britishmuseum Seriously ? Napol...,angry,2,no-set
612664104013176832,Had a great @IndianaJones tour of the @british...,happy,0,no-set
614789536774778880,#DefiningBeauty @britishmuseum finally visited...,happy,0,no-set
615202453894598656,Had a fantastic day at the @britishmuseum - ma...,happy,0,no-set
614405383889788928,"Piero della Francesca, The #Baptism of #Christ...",not-relevant,1,no-set


In [0]:
df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

In [0]:
df.groupby(['category','label','data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
category,label,data_type,Unnamed: 3_level_1
angry,2,train,48
angry,2,val,9
disgust,3,train,5
disgust,3,val,1
happy,0,train,966
happy,0,val,171
not-relevant,1,train,182
not-relevant,1,val,32
sad,4,train,27
sad,4,val,5


In [0]:
df.sample(5)

Unnamed: 0_level_0,text,category,label,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
614128641329491968,@Smerchant84 RT @lisamaythomas: The Touch Diar...,happy,0,train
614706739645128704,@NationalGallery superb. Great to see you educ...,happy,0,train
610175909195313153,@NationalGallery,not-relevant,1,train
613346237132218372,@NationalGallery The 2rd GENOCIDE against #Bia...,not-relevant,1,val
615490132922249216,"lovely, informative, short film by @HistoryNee...",happy,0,train


## Loading Tokenizer and Encoding the data

In [0]:
import torch
from tqdm.notebook import tqdm
from transformers import BertTokenizer
from torch.utils.data import TensorDataset

In [0]:
tokenizer = BertTokenizer.from_pretrained('bert-large-uncased',
                                          do_lower_case = True) # uncased == lowercase words

HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




In [0]:
encoded_data_train = tokenizer.batch_encode_plus(df[df.data_type == 'train'].text.values,
                                                 add_special_tokens = True,
                                                 return_attention_masks = True,
                                                 pad_to_max_length = True,
                                                 max_length = 256,
                                                 return_tensors = 'pt')

encoded_data_val = tokenizer.batch_encode_plus(df[df.data_type == 'val']['text'].values,
                                                 add_special_tokens = True,
                                                 return_attention_masks = True,
                                                 pad_to_max_length = True,
                                                 max_length = 256,
                                                 return_tensors = 'pt')

In [0]:
inputs_ids_train = encoded_data_train['input_ids']
attention_mask_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df[df.data_type == 'train']['label'].values)

inputs_ids_val = encoded_data_val['input_ids']
attention_mask_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(df[df.data_type == 'val']['label'].values)

In [0]:
dataset_train = TensorDataset(inputs_ids_train,
                              attention_mask_train,
                              labels_train) # len(dataset_train) = 1258 || torch.utils.data.dataset.TensorDataset

dataset_val = TensorDataset(inputs_ids_val,
                            attention_mask_val,
                            labels_val) # len(dataset_val) = 223 || torch.utils.data.dataset.TensorDataset

## Setting up BERT pretrainded model

In [0]:
from transformers import BertForSequenceClassification

In [0]:
model = BertForSequenceClassification.from_pretrained('bert-base-uncased',
                                                      num_labels = len(label_dict),
                                                      output_attentions = False,
                                                      output_hidden_states = False)

HBox(children=(IntProgress(value=0, description='Downloading', max=433, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=440473133, style=ProgressStyle(description_…




## Create Data Loaders

In [0]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [0]:
batch_size = 32

dataloader_train = DataLoader(dataset_train,
                              sampler = RandomSampler(dataset_train),
                              batch_size = batch_size) # len(dataloader_train) == 40

dataloader_val = DataLoader(dataset_val,
                              sampler = RandomSampler(dataset_val),
                              batch_size = batch_size) # len(dataloader_val) == 7

## Setting up Optimizer and Scheduler

In [0]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [0]:
optimizer = AdamW(model.parameters(),
                  lr = 1e-5, #2e-5 > 5e-5
                  eps = 1e-08)

In [0]:
epochs = 12

scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps = 0,
                                            num_training_steps = len(dataloader_train) * epochs, # 40 * 10
                                            last_epoch=-1, # Default value
                                           )

## Defining our performance metrics

Accuracy metric approach originally used in accuracy function in [this link](https://mccormickml.com/2019/07/22/BERT-fine-tuning/#41-bertforsequenceclassification)

In [0]:
import numpy as np
from sklearn.metrics import f1_score

In [0]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    return f1_score(labels_flat, preds_flat, average = 'weighted')

def accuracy_per_class(preds, labels):
    label_dict_inverted = {v:k for k,v in label_dict.items()} # https://therenegadecoder.com/code/how-to-invert-a-dictionary-in-python/
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        
        print(f'Class: {label_dict_inverted[label]}')
        
        acc_ =  len(y_preds[y_preds==label])/len(y_true)
        print(f'Accuracy: {acc_}')
        print(f'Result: {len(y_preds[y_preds==label])} of {len(y_true)} \n')

## Creating our Training Loop

Approach adapted from an older version of HuggingFace's `run_glue.py` script. Accessible [here](https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128).

In [0]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [0]:
device = torch.device('cuda' if torch.cuda.is_available else 'cpu')

# device = 'cpu'

model.to(device)
print(device)

cuda


In [0]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in tqdm(dataloader_val):
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [0]:
for epoch in tqdm(range(1,epochs+1)):
    model.train()
    
    loss_train_total = 0

    print('--- Training ---')
    
    progress_bar = tqdm(dataloader_train,
                        desc = '-> Train Epoch {:1d}'.format(epoch),
                        leave = False,
                        disable = False)
    
    for batch in progress_bar:
        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }
        
        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward() #for backpropagation
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) # Normalize de NN param
        
        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'TRAINING loss':'{:.3f}'.format(loss.item()/len(batch))})
        
    torch.save(model.state_dict(), f'Models/BERT_ft_coursera_epoch{epochs}.model')
    
    tqdm.write(f'-> Val Epoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)
    tqdm.write(f'-> Loss Train Avg: {loss_train_avg}')

    print('--- Evaluating ---')
    
    val_loss_avg, predictions, true_vals = evaluate(dataloader_val)
    val_f1 = f1_score_func(predictions, true_vals)
    
    tqdm.write(f'Validation loss: {val_loss_avg}')
    tqdm.write(f'F1 Score [weighted]: {val_f1}')

HBox(children=(IntProgress(value=0, max=12), HTML(value='')))

HBox(children=(IntProgress(value=0, description='-> Train Epoch 1', max=40, style=ProgressStyle(description_wi…

-> Val Epoch 1
-> Loss Train Avg: 0.12687300350517033
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5372943324702126
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 2', max=40, style=ProgressStyle(description_wi…

-> Val Epoch 2
-> Loss Train Avg: 0.12973791658878325
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5347048044204712
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 3', max=40, style=ProgressStyle(description_wi…

-> Val Epoch 3
-> Loss Train Avg: 0.12448257757350803
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5352461380617959
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 4', max=40, style=ProgressStyle(description_wi…

-> Val Epoch 4
-> Loss Train Avg: 0.12626283227000384
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5352387470858437
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 5', max=40, style=ProgressStyle(description_wi…

-> Val Epoch 5
-> Loss Train Avg: 0.12904668739065528
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5367149476494107
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 6', max=40, style=ProgressStyle(description_wi…

-> Val Epoch 6
-> Loss Train Avg: 0.1265225626528263
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5341151697295052
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 7', max=40, style=ProgressStyle(description_wi…

-> Val Epoch 7
-> Loss Train Avg: 0.13552328376099468
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5349770890814918
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 8', max=40, style=ProgressStyle(description_wi…

-> Val Epoch 8
-> Loss Train Avg: 0.1261987017467618
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5362387214388166
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 9', max=40, style=ProgressStyle(description_wi…

-> Val Epoch 9
-> Loss Train Avg: 0.13507912177592515
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5347526669502258
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 10', max=40, style=ProgressStyle(description_w…

-> Val Epoch 10
-> Loss Train Avg: 0.13013515351340174
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5345186633723122
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 11', max=40, style=ProgressStyle(description_w…

-> Val Epoch 11
-> Loss Train Avg: 0.12935251630842687
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5347629402365003
F1 Score [weighted]: 0.8485280819320433


HBox(children=(IntProgress(value=0, description='-> Train Epoch 12', max=40, style=ProgressStyle(description_w…

-> Val Epoch 12
-> Loss Train Avg: 0.12594856666401028
--- Evaluating ---


HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




Validation loss: 0.5356734246015549
F1 Score [weighted]: 0.8485280819320433



## Loading and Evaluating our Model

In [0]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                      num_labels=len(label_dict),
                                                      output_attentions=False,
                                                      output_hidden_states=False)

In [0]:
model.to('cuda')
pass

In [0]:
model.load_state_dict(torch.load('Models/BERT_ft_coursera_epoch12.model',  # finetuned_bert_epoch_1_gpu_trained.model
                                 map_location=torch.device('cuda'))) 

<All keys matched successfully>

In [0]:
_, predictions, true_vals = evaluate(dataloader_val)

HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




In [0]:
# BERT_ft_coursera_epoch12
accuracy_per_class(predictions, true_vals)

Class: happy
Accuracy: 0.9649122807017544
Result: 165 of 171 

Class: not-relevant
Accuracy: 0.5625
Result: 18 of 32 

Class: angry
Accuracy: 0.8888888888888888
Result: 8 of 9 

Class: disgust
Accuracy: 0.0
Result: 0 of 1 

Class: sad
Accuracy: 0.0
Result: 0 of 5 

Class: surprise
Accuracy: 0.4
Result: 2 of 5 

