In [None]:
import torch

# Check whether GPU is successfully acquired
if torch.cuda.is_available():    

    #Set device used to be the GPU if available  
    device = torch.device("cuda")
    print('GPU found:', torch.cuda.get_device_name(0))
else:
    print('GPU could not connect')
    device = torch.device("cpu")

GPU found: Tesla T4


In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import os
from datetime import datetime
from collections import Counter
from torch.utils.data import DataLoader
from torch.nn.utils import clip_grad_norm_
import random
from sklearn.metrics import classification_report
from sklearn.model_selection import StratifiedKFold
!pip install transformers
from transformers import BertTokenizer,BertForSequenceClassification,AdamW
from transformers import get_linear_schedule_with_warmup

In [None]:
#Fix random seeds so that results can be reproduced
torch.manual_seed(42)
np.random.seed(42)
random.seed(42)

In [None]:
# Mount a drive where the model will be exported later on
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


## This notebook is meant to familiarize the reader with the process fine-tuning the pre-trained Bidirectional Encoder Representations from Transformers(BERT) model originally developed by Devlin, et al.(2018). 

### The notebook is structured as follows:
1. General Information
  1. Bidirectional Encoder Representations from Transformers (BERT) Model
  1. BERT Model ImplementatioN
  1. Fine-tuning Process Overview
 
1. Brief Dataset Overview (the reader can get more detailed information about the dataset in the 'Model Data Generation and Exploration' notebook).
1. Creating a Custom Dataset Class as Required by Pytorch.
1. Generating Train and Test Splits.
1. Training Proccess 
 1. Choice of Hyperparameters
 1. Choice of Evaluation Metrics 
 1. Training Procedure
 1. Evaluation Procedure
2. Exported Model Testing Procedure
  

Detailed description is provided in each subpoint.

- **Important notes about recreating results**: 
 - Should the reader wish to recreate the training procedure, a **Tesla T4** (or better) GPU must be used. GPUs having less processing power will likely render training not possible due to memory constraints.
 - Fine-tuning was done using Google Colab hence some cells contain commands specific to the Colab environment (e.g mounting Google drive and importing the dataset). These would not be needed in a normal jupyter notebook or other environment and the reader can use the dataset found in the project_dataset direcory.


In [None]:
# Upload the dataset exported from the 'Model Data Generation and Exploration' notebook.
from google.colab import files
import io
uploaded = files.upload()
train_data = pd.read_csv(io.BytesIO(uploaded['bert_dataset.csv']))

Saving bert_dataset.csv to bert_dataset.csv


## General Information

### Bidirectional Encoder Representations from Transformers (BERT) Model

- BERT was trained on a large corpus of texts from Wikipedia and the BooksCorpus. The model uses two mechanisms for error signalling - masked languge modelling (Masked LM) and next sentence prediction (NSP). The former involves replacing 15% of the words in any sequence by a [MASK] token and letting the model predict it. The latter technique implies replacing mixing sequential sentences 50% of the time and allowing the model to predict whether the second sentence follows from the previous one or if it is a random replacement. The model architecture builds upon the seminal Transformer network model (Vaswani, et al.,2017) and consists of 12 transformer blocks, 768 hidden layers, 12 attention heads, and 110 million parameters. The authors of the original paper further propose an architecture that has twice as many of each of the above described features but it will not be considered due to its resource requirements that would render it impossible to use. Each word in any sequence is represented by a 768-dimensional vector and BERT also includes what is refferred to as special tokens that are meant for usage in downstream tasks. The [CLS] special token is the one employed for classification and it essentially serves as a summary representation of an entire sequence.

### BERT Implementation

- The Hugging Face library provides various implementations of the BERT model [(Hugging Face.,*BERT*, 2021)](https://huggingface.co/transformers/model_doc/bert.html). For the purposes of this project the BERT base model Pytorch implementation was used [(Hugging Face,*bert-base-cased* 2021)](https://huggingface.co/bert-base-cased). The architecture consists of 12 transformer blocks(768 hidden layers), 12 attention heads, and 110 million parameters. The library also offers many options for pre-trained weights, and bert-base-cased was the one chosen. Additionally, since in this case BERT will be used for a downstream task, the Hugging Face library provides a model that includes an extra fully connected linear layer that can be trained for the task of sentiment classification - BertForSequenceClassification [(Hugging Face.,*BertForSequenceClassification*, 2021)](https://https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification). This classification head is built on top of the [CLS] pooled output mentioned above. The library also provides text tokenizers for each of the pre-trained models they offer [(Hugging Face.,*Tokenizer*, 2021)](https://huggingface.co/transformers/main_classes/tokenizer.html).

### Fine-tuning process overview
- During the process of fine-tuning the model is first initialized with its pre-trained parameters. The embedding weights are adjusted through BERT's established learning mechansims - masked LM and NSP. The weights of the ouput layer for the given task, in this case sentiment classification, are learned using the labelled data. In that manner the model is trained end-to-end for the given task (Devlin, et al.,2018).




## Brief Dataset Overview (refer to the 'Model Data Generation and Exploration' notebook for detailed dataset information)

- The dataset contains financial news headlines headline texts and their sentiment annotation on a ternary scale of negative/neutral/positive.
Two existing datsets were combined to generate the final set used - [Financial Phrase Bank (Malo, P. et al.,2013)](https://www.researchgate.net/publication/251231107_Good_Debt_or_Bad_Debt_Detecting_Semantic_Orientations_in_Economic_Texts) and data by [Sousa et al.(2019)](https://github.com/stocks-predictor/bert) 



In [None]:
print('The dataset contains a total of {} sentences'.format(train_data.shape[0]))

print('Below one can see the label distribution between the three sentiment categories:')
labels_map={'label':{0:'negative',1:'neutral',2:'positive'}}
train_data.replace(labels_map).label.value_counts()
    

The dataset contains a total of 5946 sentences
Below one can see the label distribution between the three sentiment categories:


neutral     3418
positive    1574
negative     954
Name: label, dtype: int64

## Generating the Train/Test Split

- A 80/20 proportion split was used. The training dataset will be used for training the model using 10-fold cross validation.
  The model that generates the higher macro avg f1 is the one that will be exported for use in the application. Therefore, in order to validate its ability to generalize 
    well on unseen data, the test split will be used. In that manner, the evaluation metrics produced during the training folds will provide evidence for the correctness of hyperparameter choices and
    the evaluation metrics provided from testing the exported model with the test data will showcase the performance of the model used for the application.



In [None]:
train_texts, test_texts, train_labels, test_labels = train_test_split(train_data.headline, train_data.label, test_size=.2,shuffle=False)


## Creating a Custom Dataset Class as Required by Pytorch
- The dataset class extends pytorch.Dataset and implements the requiered methods [(Pytorch., *Data Loading Tutorial*, 2021)](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html) [(Hugging Face., *Fine-tuning a Pre-trained Model* 2021)](https://huggingface.co/transformers/training.html).


In [None]:

class Dataset(torch.utils.data.Dataset):
  """Params:
      encodings: transformers.tokenization_utils_base.BatchEncoding - the encoding returned by BertTokenizer.encode_plus
      labels: list - list of labels for the dataset 
      raw_text: list - list of the raw examples from the dataset 
    """
  def __init__(self, encodings, labels,raw_text):
      self.encodings = encodings
      self.labels = labels
      self.raw_text=raw_text

  def __getitem__(self, idx):
      item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
      item['labels'] = torch.tensor(self.labels[idx])
      return item

  def __len__(self):
      return len(self.labels)



## Training Proccess Details
- Due to the rather small size of the dataset employed for training, a decision was made to use 10-fold cross-validation in order to best estimate the model's performance and reduce bias as much as possible. More specifically, due to the label imbalance of the dataset, stratified k-fold cross validation was employed implemented using the scikit-learn library(Pedregosa et al.,2011). Using stratified instead of regular k-fold results in a more balanced sample of labels between each split instead of randomly selecting observations which could lead to distortions in evaluating performance.  The model that scores the highest macro average f1 score will be saved for further evaluation. The training proccess will be repeated multiple times and then the performance of some of the different saved models will be compared to select the final one that will be used for the application. The capabilities of the different models will be judged by their ability to generalize well on the test portion of the dataset and also how they perform on scoring the sentiment of the articles dataset specifically built for this project (see the 'Selecting Model for Production' notebook). The articles dataset contains entire news articles texts and controls for the model's performance when the extra logic used to arrive at a final sentiment classification is also used.

### Choice of Hyperparameters
- Devlin et al., (2018) advice for a choice between any combination of the following hyperparameters:
 - Batch size:16,32
 - Learning rate (Adam): 5e-5,3e-5,2e-5
 - Number of epochs: 2,3,4
 
 Having run multiple runs with different combinations of the above, the final selection is as follows: 
  - Batch size: 32
  - Learning rate: 2e-5
  - Number of epochs: 4

The rather large choice of batch size ensures that training is completed within reasonble time constraints. It's potenital detrimental impact on generalization is tackled by a smaller learning rate, using gradient clipping and learning rate decay (Hoffer, Hubara, & Soudry, 2017). Furthermore, the use of gradient clipping also reduces the possibility of the exploding gradient problem.

### Choice of Evaluation Metrics
- The model's performance is evaluated using the macro averaged f1-score. The reason for deeming other evaluation metrics like accuracy, precision, recall or simple f1-score not appropriate stems from the label imbalance of
 the dataset used. Additionally, equal importance of each class should be attributed hence avoiding using just the f1-score. The macro average f1-score expands on the simple f1-score by averaging the f1-score for each class. Impliedly, it should communicate both information about the true positive rate (recall) and incorrect predictions (precision) while avoiding the pitfall of using these metrics in a stand-alone fashion which could be deceiving due to the mentioned characteristics of the dataset. One potential weakness of the macro averaged f1-score is that it could overestimate the model's performance in cases where it performs really well in classifying a given smaller class but not as good in others. 

 ### Training Procedure
- The training split consisting of 80% of the observations of the dataset is used for training. As mentioned stratified 10-fold cross-validation is used to assess the performance of the model and the one that scores best(per macro averaged f1-score) is exported to be used for predictions in the application. After each fold the previously model that was trained for 4 epochs is removed from GPU memory to avoid 'out of memory' exceptions.
- Steps:
  - Generate the stratified k-fold splits, with k=10.
  - Generate the BERT encodings that are required by BERT using BertTokenizer. The encodings per observation consist of the following:
   - input ids: The ids of each token in the sequence including word tokens ids and special tokens ids. As the model takes fixed length input id tensors, each tensor is padded to a given length passed as a parameter which should not exceed the maximum allowed of 512. A factor that was considered when selecting the token length was the complexity of the model which grows quadratic with the increase of sequence length. Therefore, a maximum length of 300 was chosen as a balancing compromise between allowing the model to learn longer dependencies while maintaining computational costs at a reasonable level. 
   - attention mask: A tensor of length equal to the maximum length(300) which indicates which ids in input ids are padding tokens and which tokens should the model consider.
- Create the pytorch datasets and dataloaders using batch size of 32 for the dataloaders.
-Instantiate the model, Adam optimizer(used for adjusting the model weights) and learning rate decay scheduler. The learning rate is set at 2e-5 and it will be linearly decayed.
-Run an epoch (**more details are provided as comments in the code section**)
  - For each epoch a training and validation loops are ran. After each validation loop a classification report is generated and the model with highest macro averaged f1-score is saved.

 ### Evaluation Procedure
- After all folds are concluded, the macro average f1-score for each epoch in each fold is averaged to produce the final evaluation metric for the model. 



In [None]:
''' As mentioned differrent models will be evaluated. For one of the models the dataset will be preprocessed to control for any punctuation irregularrities.
    The reason for using such an approach is that accorrding to Devlin, et al. (2018), the original model was trained on large text corpuses
    without any punctuation preprocessing. Therefore, in order to fully recreate the format of the data used for generating the original weights,
    the below two functions were used for preproccessing the dataset. They will only be used for one of the exported models - BertFixedPunct.
'''
def punct_recover(text):
    return text.replace(" .", ".").replace(" 're", "'re'").replace(" ,", ",").replace(" 's", "'s").replace(" 've", "'ve").replace(" 't", "'t").replace(" 'd", "'d").replace(" %","%")

def restore_punct(df):

    for i in range(df.shape[0]):
        df.sentence.iloc[i]=punct_recover(df.sentence.iloc[i])

    return df

In [None]:
# Instantiate the bert tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')


In [None]:

# a list for storing stats concerning the training proccess
fold_stats =[]
# based on the best macro averaged f1, a model will be saved for use in the application
best_macro_f1=0
classes=['negative','neutral','positive']
fold=0

# Generate the stratified 10 fold splits. StartifiedKFold.split returns a generator with the indexes of the split observations and their corresponding labels.
N_SPLITS=10
skf = StratifiedKFold(n_splits=N_SPLITS)
total_time0=datetime.now()
for train_index, test_index in skf.split(train_texts, train_labels):
  fold_time0=datetime.now()

  # Remove the model from the previous fold to avoid running out of GPU memory.
  if fold>0:
    torch.cuda.empty_cache()
    del model
  
  print('=============Fold: {} of 10 ============'.format(fold+1))

  # Generate the train and test split for the fold.
  X_train, X_test = train_texts[train_index], train_texts[test_index]
  y_train, y_test = train_labels[train_index], train_labels[test_index]

  # Generate the encodings required by BERT 
  train_encodings= tokenizer.batch_encode_plus(list(X_train),add_special_tokens=True,padding='max_length',max_length=300,return_attention_mask=True)
  test_encodings = tokenizer.batch_encode_plus(list(X_test),add_special_tokens=True,padding='max_length',max_length=300,return_attention_mask=True)
  # Generate the pytorch datasets
  train_dataset=Dataset(train_encodings,list(y_train),list(X_test))
  test_dataset= Dataset(test_encodings,list(y_test),list(X_test))
  # Generate the pytroch dataloaders
  train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
  test_loader= DataLoader(test_dataset,batch_size=32)

  #Instantiate the model and move it to the GPU (if available).
  model = BertForSequenceClassification.from_pretrained('bert-base-cased',num_labels=3)
  model.to(device)
  #Instantiate the optimizer. Learning rates of 5e-5, 3e-5, and 2e-5 were adhering to what was proposed by Devlin, et al. (2018).
  optim = AdamW(model.parameters(), lr=2e-5,eps = 1e-8)

  #The paper advises for 2,3, or 4 epochs to be performed.
  epochs=4
  training_steps = len(train_loader)*epochs
  epoch_stats=[]
  #A scheduler for reducing the learning rate will be used to aid convergence and counter the potential pitfalls of the large batches.
  scheduler = get_linear_schedule_with_warmup(optim,num_warmup_steps=0,num_training_steps=training_steps)

  #Start the training 
  for epoch in range(epochs):
    epoch_time0=datetime.now()
    print("")
    print('======== Epoch {} of {}. Fold: {}=========='.format(epoch + 1, epochs,fold+1))
    print('Training...')
    
    # The below 2 variables are used to calculate the average loss after each training loop is concluded.
    total_train_loss=0
    avg_train_loss=0
    # The predictions variable is needed for generating the classification report.
    predictions=np.array([])

    #Set the model in a traning stage
    model.train()

    # Training loop
    for step, batch in enumerate(train_loader):
        # Information is printed after every 30 batches were loaded.
        if step % 30 == 0 and not step == 0:
          print('Batch {} of {}'.format(step,len(train_loader)))

        # Get the variables needed from the batch to be passed as parameters to the model .
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        #The model returns the loss using cross entropy loss and the logits(output tensors).
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        
        # Add the loss outputed by the model to the total_train_loss variable for tracking.
        loss = outputs[0]
        total_train_loss+=loss.item()
        

        # Perform backpropagation and clip the gradient to avoid the exploding gradiet problem.
        loss.backward()
        clip_grad_norm_(model.parameters(),max_norm=1.0)

        # Update the parameters using the Adam optimizer, reduce the learning rate using the scheduler and reset all gradients.
        optim.step()
        scheduler.step()
        optim.zero_grad()

        # Measure average loss over all batches in the loader    
    avg_train_loss=total_train_loss/len(train_loader) 
    print('Training pass done, loss is {}'.format(avg_train_loss))

    # Set the model in a validation stage
    model.eval()
    # Validation loop
    for batch in test_loader:
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        #Pytorch automatically computes gradients in its tensors which will not be needed in the validation step
        with torch.no_grad():
            outputs=model(input_ids,attention_mask=attention_mask,labels=labels)
        # Get the model outputs
        logits = outputs[1].detach().cpu().numpy()
        # Store the predictions of the model so that they can be used to generate a classificaton report. np.argmax is applied to get the index of the class with highest value. 
        predictions=np.append(predictions,np.argmax(logits, axis=1).flatten())
        
    # Append the statistics for the current epoch and extract the macro averaged f1 to be used when deciding whether to save the model.
    epoch_stats.append(classification_report(np.array(list(y_test)),predictions,target_names=classes,output_dict=True, zero_division=0))
    epoch_macro_f1 =classification_report(np.array(list(y_test)),predictions,target_names=classes,output_dict=True,zero_division=0)['macro avg']['f1-score']

    print('============Validation pass of epoch {} , fold {} completed!============'.format(epoch+1,fold+1))
    print(classification_report(np.array(list(y_test)),predictions,target_names=classes,zero_division=0))
    
    # Export the model for further use if it has a higher macro avergared f1 than the previous highest.
    if epoch_macro_f1>best_macro_f1:
      best_macro_f1=epoch_macro_f1
      print('Saving model...')
      model.save_pretrained('/content/gdrive/My Drive/BertModel3')
      print('Model saved!!')
    epoch_time1=datetime.now()-epoch_time0
    print('=========Epoch {} out of {} completed. It took {}.============='.format(epoch+1,epochs,str(epoch_time1)))
  
  fold_time1=datetime.now()-fold_time0
  print('============Fold {} completed. It took {}.========'.format(fold+1,str(fold_time1)))
  fold_stats.append(epoch_stats)
  fold+=1
total_time1=datetime.now()-total_time0
print('============TRAINING COMPLETED! It took {}. ==================='.format(str(total_time1)))









Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.7010188439666335
              precision    recall  f1-score   support

    negative       0.81      0.69      0.75        55
     neutral       0.95      0.70      0.80       286
    positive       0.57      0.93      0.71       135

    accuracy                           0.76       476
   macro avg       0.78      0.77      0.75       476
weighted avg       0.82      0.76      0.77       476

Saving model...
Model saved!!

Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.30660570935526893
              precision    recall  f1-score   support

    negative       0.78      0.78      0.78        55
     neutral       0.96      0.76      0.85       286
    positive       0.63      0.91      0.74       135

    accuracy                           0.80       476
   macro avg       0.79      0.82      0.79       476
weighted avg

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.5995002613583608
              precision    recall  f1-score   support

    negative       1.00      0.74      0.85        54
     neutral       0.95      0.82      0.88       286
    positive       0.71      0.99      0.83       136

    accuracy                           0.86       476
   macro avg       0.89      0.85      0.85       476
weighted avg       0.89      0.86      0.86       476

Saving model...
Model saved!!

Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.29704392873751584
              precision    recall  f1-score   support

    negative       0.95      0.78      0.86        54
     neutral       0.95      0.86      0.90       286
    positive       0.77      0.99      0.87       136

    accuracy                           0.89       476
   macro avg       0.89      0.87      0.88       476
weighted avg

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.6360752738233822
              precision    recall  f1-score   support

    negative       0.80      0.37      0.51        54
     neutral       0.91      0.75      0.82       286
    positive       0.58      0.93      0.71       136

    accuracy                           0.76       476
   macro avg       0.77      0.68      0.68       476
weighted avg       0.81      0.76      0.76       476


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.31726629637292963
              precision    recall  f1-score   support

    negative       0.89      0.74      0.81        54
     neutral       0.89      0.88      0.88       286
    positive       0.73      0.79      0.76       136

    accuracy                           0.84       476
   macro avg       0.84      0.80      0.82       476
weighted avg       0.84      0.84      0.8

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.5807875553841022
              precision    recall  f1-score   support

    negative       0.60      0.80      0.68        54
     neutral       0.94      0.88      0.91       286
    positive       0.72      0.73      0.73       136

    accuracy                           0.83       476
   macro avg       0.75      0.80      0.77       476
weighted avg       0.84      0.83      0.83       476


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.2946066168039592
              precision    recall  f1-score   support

    negative       0.62      0.80      0.70        54
     neutral       0.96      0.85      0.90       286
    positive       0.69      0.79      0.73       136

    accuracy                           0.82       476
   macro avg       0.76      0.81      0.78       476
weighted avg       0.85      0.82      0.83

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.7789658548226998
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00        54
     neutral       0.80      0.87      0.83       286
    positive       0.63      0.78      0.70       136

    accuracy                           0.74       476
   macro avg       0.48      0.55      0.51       476
weighted avg       0.66      0.74      0.70       476


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.49706809000292823
              precision    recall  f1-score   support

    negative       0.72      0.48      0.58        54
     neutral       0.79      0.94      0.86       286
    positive       0.88      0.62      0.73       136

    accuracy                           0.80       476
   macro avg       0.80      0.68      0.72       476
weighted avg       0.81      0.80      0.7

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.6050602231007903
              precision    recall  f1-score   support

    negative       0.98      0.80      0.88        55
     neutral       0.85      0.95      0.90       285
    positive       0.88      0.72      0.79       136

    accuracy                           0.87       476
   macro avg       0.90      0.82      0.86       476
weighted avg       0.87      0.87      0.87       476


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.3038130418689393
              precision    recall  f1-score   support

    negative       0.94      0.84      0.88        55
     neutral       0.87      0.94      0.91       285
    positive       0.86      0.75      0.80       136

    accuracy                           0.88       476
   macro avg       0.89      0.84      0.86       476
weighted avg       0.88      0.88      0.87

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.5827652277119124
              precision    recall  f1-score   support

    negative       0.85      0.91      0.88        55
     neutral       0.83      0.93      0.88       285
    positive       0.87      0.62      0.72       135

    accuracy                           0.84       475
   macro avg       0.85      0.82      0.83       475
weighted avg       0.84      0.84      0.83       475


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.28885243099127245
              precision    recall  f1-score   support

    negative       0.97      0.71      0.82        55
     neutral       0.79      0.96      0.87       285
    positive       0.83      0.55      0.66       135

    accuracy                           0.81       475
   macro avg       0.87      0.74      0.78       475
weighted avg       0.82      0.81      0.8

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.5614181299930188
              precision    recall  f1-score   support

    negative       0.67      0.89      0.77        55
     neutral       0.89      0.85      0.87       285
    positive       0.76      0.74      0.75       135

    accuracy                           0.82       475
   macro avg       0.77      0.83      0.79       475
weighted avg       0.83      0.82      0.82       475


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.26506515636817735
              precision    recall  f1-score   support

    negative       0.77      0.84      0.80        55
     neutral       0.85      0.92      0.88       285
    positive       0.83      0.67      0.74       135

    accuracy                           0.84       475
   macro avg       0.82      0.81      0.81       475
weighted avg       0.84      0.84      0.8

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.606054838246374
              precision    recall  f1-score   support

    negative       0.59      0.85      0.70        55
     neutral       0.88      0.93      0.90       285
    positive       0.74      0.52      0.61       135

    accuracy                           0.80       475
   macro avg       0.74      0.77      0.74       475
weighted avg       0.80      0.80      0.79       475


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.3025396392861409
              precision    recall  f1-score   support

    negative       0.81      0.98      0.89        55
     neutral       0.95      0.85      0.89       285
    positive       0.74      0.84      0.79       135

    accuracy                           0.86       475
   macro avg       0.83      0.89      0.86       475
weighted avg       0.87      0.86      0.86 

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.5690139517632883
              precision    recall  f1-score   support

    negative       0.46      1.00      0.63        55
     neutral       0.84      0.78      0.81       285
    positive       0.88      0.58      0.70       135

    accuracy                           0.75       475
   macro avg       0.72      0.79      0.71       475
weighted avg       0.81      0.75      0.76       475


Training...
Batch 30 of 134
Batch 60 of 134
Batch 90 of 134
Batch 120 of 134
Training pass done, loss is 0.28406009734121723
              precision    recall  f1-score   support

    negative       0.44      0.98      0.61        55
     neutral       0.87      0.69      0.77       285
    positive       0.76      0.71      0.73       135

    accuracy                           0.73       475
   macro avg       0.69      0.79      0.70       475
weighted avg       0.79      0.73      0.7

Below the reader can see the averaged macro avareraged f1-score across all epochs in all folds. The obtained value was 0.8 which could be regarded as a good indication of the model's performance based on the hyperparameters chosen. From the classification reports after each epoch, the reader can observe that the model seems to exhibit relatively worse perfrormance when guessing negative observations. That is to be expected given that they are the least represented class in the dataset used. 

In [None]:
TOTAL_STEPS=epochs*N_SPLITS
macro_avg_f1=0
for fold_run in fold_stats:
  for epoch_run in fold_run:
    macro_avg_f1+=epoch_run['macro avg']['f1-score']
final_macro_avg_f1=macro_avg_f1/TOTAL_STEPS
round(final_macro_avg_f1,1)

0.8

## Exported Model Testing Procedure
- The goal of the below testing loop is to validate the exported models' ability to generate on unseen data. The test split of the original dataset is employed for it. Once again, the macro averaged f1-score is the evaluation metric chosen to assess the models' performance. As mentioned the performance of more than one of the exported models will be evaluated. After observing the performance metrics for each model, they are further tested on the custom built news articles dataset to choose a final model for production use (see the 'Selecting Model for Production') notebook.

In [None]:
# Generate the encodings required by BERT
test_encodings = tokenizer.batch_encode_plus(test_texts,add_special_tokens=True,padding='max_length',max_length=300,return_attention_mask=True)
# Generate the pytorch datasets
test_dataset= Dataset(test_encodings,list(test_labels),list(test_texts))
# Generate the pytroch dataloaders
test_loader= DataLoader(test_dataset,batch_size=32)


In [None]:
#Testing the performance of the saved model from the above training loop.
classes=['negative','neutral','positive']
# Load the saved model 
models=['BertModel','BertModel2','BertModel3','BertFixedPunct']
path='/content/gdrive/My Drive/'




total_test_loss=0
predictions=np.array([])

for bertmodel in models:
  model_path=path+bertmodel
  model = BertForSequenceClassification.from_pretrained(model_path).to(device)
  model.eval()
  for batch in test_loader:
          input_ids = batch['input_ids'].to(device)
          attention_mask = batch['attention_mask'].to(device)
          labels = batch['labels'].to(device)
          with torch.no_grad():
              outputs=model(input_ids,attention_mask=attention_mask,labels=labels)
          total_test_loss += outputs[0].item()
          logits = outputs[1].detach().cpu().numpy()
          label_ids = labels.to('cpu').numpy()

          #store the predictions of the model so that they can be used to generate a classificaton report 
          predictions=np.append(predictions,np.argmax(logits, axis=1).flatten())
    
  avg_test_loss = total_test_loss / len(test_loader)   
    
  print('===Model:',bertmodel,'===')  
  print("Test Loss: {0:.2f}".format(avg_test_loss))
  print('')
  print(classification_report(np.array(test_labels),predictions,target_names=classes))
  predictions=np.array([])

===Model: BertModel ===
Test Loss: 0.38

              precision    recall  f1-score   support

    negative       0.90      0.88      0.89       408
     neutral       0.87      0.92      0.89       563
    positive       0.88      0.79      0.83       219

    accuracy                           0.88      1190
   macro avg       0.88      0.86      0.87      1190
weighted avg       0.88      0.88      0.88      1190

===Model: BertModel2 ===
Test Loss: 0.80

              precision    recall  f1-score   support

    negative       0.90      0.88      0.89       408
     neutral       0.85      0.88      0.87       563
    positive       0.79      0.75      0.77       219

    accuracy                           0.86      1190
   macro avg       0.85      0.84      0.84      1190
weighted avg       0.86      0.86      0.86      1190

===Model: BertModel3 ===
Test Loss: 1.78

              precision    recall  f1-score   support

    negative       0.73      0.64      0.68       408
    

### Bibliography
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K., 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Google AI Language.

Hoffer, E., Hubara, I., & Soudry, D. (2017). Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in Neural Information Processing Systems, 1729-1739.

Pytorch., (2021) *Writing Custom Datasets, Dataloaders and Transforms*. Available from: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html [Accessed 15th July 2021]

Hugging Face., (2021) *BERT*. Available from: https://huggingface.co/transformers/model_doc/bert.html [Accessed 10th July 2021]

Hugging Face., (2021) *BERT Base Model(cased)*. Available from: https://huggingface.co/bert-base-cased [Accessed 10th July 2021]

Hugging Face., (2021) *BertForSequenceClassification*. Available from: https://https://huggingface.co/transformers/model_doc/bert.html#bertforsequenceclassification [Accessed 10th July 2021]

Hugging Face., (2021) *Tokenizer*. Available from: https://huggingface.co/transformers/main_classes/tokenizer.html [Accessed 11th July 2021]

Hugging Face., (2021) *Fine-tuning a pre-trained model*. Available from: https://huggingface.co/transformers/training.html [Accessed 15th July 2021]

Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[Malo, P., Sinha, A., Takala, P., Korhonen, P. and Wallenius, J. (2013): “Good debt or bad debt: Detecting semantic orientations in economic texts.” Journal of the American Society for Information Science and Technology. (in Press)](https://www.researchgate.net/publication/251231107_Good_Debt_or_Bad_Debt_Detecting_Semantic_Orientations_in_Economic_Texts)

[Sousa, M.G., Sakiyama, K., de Souza Rodrigues, L., Moraes, P.H., Fernandes, E.R. and Matsubara, E.T., 2019, November. BERT for stock market sentiment analysis. In 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 1597-1601). IEEE.](https://github.com/stocks-predictor/bert)




