In [None]:
# Fine Tuning Transformer for Sentiment Classification

### Introduction

In this tutorial we will be fine tuning a transformer model for the **Sentiment classification** problem. **Sentiment classification** is a special case of **Multiclass Classification**. In this case the classes represent the sentiment represented by the text.
The number of classes are generally lesser than a standard multiclass classification proboem where the classes represent the polarity, in form of `postive`, `negative` and in some cases and additional `neutral` polarity.

This is one of the most common business problems when trying to ascertain the sentiment of a statement made by your customer in a business setup.

#### Flow of the notebook

* As with all the tutorials previously, this notebook also follows a easy to follow steps. Making the process of fine tuning and training a Transformers model a straight forward task.
* However, unlike the other notebooks, in the tutorial, most of the sections have been created into functions, and they are called from the `main()` in the end of the notebook. 
* This is done to leverage the [Weights and Biases Service](https://www.wandb.com/) WandB in short.
* It is a experiment tracking, parameter optimization and artifact management service. That can be very easily integrated to any of the Deep learning or Machine learning frameworks. 

The notebook will be divided into separate sections to provide a organized walk through for the process used. This process can be modified for individual use cases. The sections are:

1. [Preparing Environment and Importing Libraries](#section01)
2. [Pre-Processing and Preparing the Dataset for data processing: Class](#section02)
3. [Defining a Model/Network](#section07)
4. [Fine Tuning the Model: Function](#section03)
5. [Validating the Model Performance: Function](#section04)
6. [Main Function](#section05)
    * [Initializing WandB](#section501)
    * [Importing and Pre-Processing the domain data](#section502)
    * [Creation of Dataset and Dataloader](#section503)
    * [Neural Network and Optimizer](#section504)
    * [Training Model and Logging to WandB](#section505)
    * [Validation and generation of Summary](#section506)


#### Technical Details

This script leverages on multiple tools designed by other teams. Details of the tools used below. Please ensure that these elements are present in your setup to successfully implement this script.

- **Data**:
	- We are using the **IMDB Dataset** available at [Kaggle](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
	- This dataset is a collection of moview reviews obtained from IMDB website, the reviews are labled with a positive or negative sentiment. 
	- There are approx. `50000` rows of data.  Where each row has the following data-point:
		- **review** : Review of a movie
		- **sentiment** : positive or negative


- **Language Model Used**: 
    - This notebook uses ***RoBERTa*** as its base transformer model. [Research Paper](https://arxiv.org/abs/1907.11692)    
    - ***RoBERTa*** was an incremental improvement in the ***BERT*** architecture with multiple tweaks in different domains.
    - Some of the changes in RoBERTa were: Bigger training data, Dymanic Masking, Different Self Supervised training objective.
    - You can have a detailed read of these changes at the following [link](https://medium.com/towards-artificial-intelligence/a-robustly-optimized-bert-pretraining-approach-f6b6e537e6a6). 
   - We will be leveraging on the ***RoBERTa*** implementation from the HuggingFace team.    
   - [Documentation for python](https://huggingface.co/transformers/model_doc/t5.html)


- **Hardware Requirements**: 
	- Python 3.6 and above
	- Pytorch, Transformers and
	- All the stock Python ML Library
	- GPU/TPU enabled setup 
   

- **Script Objective**:
	- The objective of this script is to fine tune ***RoBERTa*** to be able to classify wether the sentiment of a given text is positive or negative.

---
NOTE: 
We are using the Weights and Biases Tool-set in  this tutorial. The different components will be explained as we go through the article. This is an incremental work done in the summarization notebook.

[Link](https://app.wandb.ai/abhimishra-91/transformers_tutorials_sentiment?workspace=user-abhimishra-91) to the Project on WandB

<a id='section01'></a>
### Preparing Environment and Importing Libraries

At this step we will be installing the necessary libraries followed by importing the libraries and modules needed to run our script. 
We will be installing:
* transformers
* wandb
* packages to support tpu for pytorch

Libraries imported are:
* Pandas
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Transformers
* Roberta Model and Tokenizer
* wandb

Followed by that we will preapre the device to support TPU execution for training.

Finally, we will be logging into the [wandb](https://www.wandb.com/) serice using the login command

In [None]:
# Installing NLP-Transformers library
!pip install -q transformers

# Installing wandb library for experiment tracking and hyper parameter optimization
!pip install -q wandb

# Code for TPU packages install
!curl -q https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py --apt-packages libomp5 libopenblas-dev

In [None]:
# Importing stock libraries
import numpy as np
import pandas as pd
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler

# Importing lackages from our NLP-Hugging Package
from transformers import RobertaConfig, RobertaModel, RobertaTokenizerFast, RobertaForSequenceClassification

# Importing wand for logging and hyper-parameter tuning
import wandb

In [None]:
# Setting up the accelerators

# # GPU
# from torch import cuda
# device = 'cuda' if cuda.is_available() else 'cpu'

# TPU
import torch_xla
import torch_xla.core.xla_model as xm
device = xm.xla_device()

In [None]:
# login to wandb
!wandb login

[34m[1mwandb[0m: You can find your API key in your browser here: https://app.wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[32mSuccessfully logged in to Weights & Biases![0m


<a id='section02'></a>
### Pre-Processing and Preparing the Dataset for data processing: Class

* We will start with creation of Preprocess class - This defines how the text is pre-processed before working on the tokenization, dataset and dataloader aspects of the workflow. In this class the dataframe is loaded and then the `sentiment` column is used to create a new column in the dataframe called `encoded_polarity` such that if:
    * `sentiment = positive` then `encoded_polarity = 0`
    * `sentiment = negative` then `encoded_polarity = 1`

* Followed by this, the `sentiment` column is removed from the dataframe.
* The `dataframe` and `encoded_polarity` dictionary are returned. 
* This method is called in the `run()` function.

* After this we will work on the Dataset class - This defines how the text is pre-processed before sending it to the neural network. This dataset will be used the the Dataloader method that will feed  the data in batches to the neural network for suitable training and processing. 
* The Dataloader and Dataset will be used inside the `run()`.
* Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

#### *CustomDataset* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the Roberta model for training. 
- We are using the Roberta tokenizer to tokenize the data in the `review` column of the dataframe. 
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/roberta.html#robertatokenizer)
- `encoded_polarity` transformed into the `targets` tensor. 
- The *CustomDataset* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **70% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

#### return_dataloader: Called inside the `run()`
- `return_dataloader` function is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of data loaded to the memory and then passed to the neural network needs to be controlled.
- Internally the `return_dataloader` function calls the pytorch Dataloader class and the CustomDataset class to create the dataloaders for training and validation. 
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [None]:
# Preprocess class defines how the dataframe will be processed to generate  and removal of features based on thier usage.
# A new encoded_polarity column is added that adds 0 and 1 to the column based on the positive and negative
# The processing method will return both the dictionary, and the updated dataframe for further usage.

class Preprocess:
    def __init__(self, df):
        """
        Constructor for the class
        :param df: Input Dataframe to be pre-processed
        """
        self.df = df
        self.encoded_dict = dict()

    def encoding(self, x):
        if x not in self.encoded_dict.keys():
            self.encoded_dict[x] = len(self.encoded_dict)
        return self.encoded_dict[x]

    def processing(self):
        self.df['encoded_polarity'] = self.df['sentiment'].apply(lambda x: self.encoding(x))
        self.df.drop(['sentiment'], axis=1, inplace=True)
        return self.encoded_dict, self.df

In [None]:
# Creating a CustomDataset class that is used to read the updated dataframe and tokenize the text. 
# The class is used in the return_dataloader function

class CustomDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        text = str(self.data.review[index])
        text = " ".join(text.split())
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'targets': torch.tensor(self.data.encoded_polarity[index], dtype=torch.float)
        } 
    
    def __len__(self):
        return self.len

In [None]:
# Creating a function that returns the dataloader based on the dataframe and the specified train and validation batch size. 

def return_dataloader(df, tokenizer, train_batch_size, validation_batch_size, MAX_LEN, train_size=0.7):
    train_size = 0.7
    train_dataset=df.sample(frac=train_size,random_state=200)
    val_dataset=df.drop(train_dataset.index).reset_index(drop=True)
    train_dataset = train_dataset.reset_index(drop=True)

    print("FULL Dataset: {}".format(df.shape))
    print("TRAIN Dataset: {}".format(train_dataset.shape))
    print("VAL Dataset: {}".format(val_dataset.shape))

    training_set = CustomDataset(train_dataset, tokenizer, MAX_LEN)
    validation_set = CustomDataset(val_dataset, tokenizer, MAX_LEN)

    train_params = {'batch_size': train_batch_size,
                'shuffle': True,
                'num_workers': 1
                }

    val_params = {'batch_size': validation_batch_size,
                    'shuffle': True,
                    'num_workers': 1
                    }

    training_loader = DataLoader(training_set, **train_params)
    validation_loader = DataLoader(validation_set, **val_params)
    
    return training_loader, validation_loader

<a id='section07'></a>
### Defining a Model/Network

#### Neural Network
 - We will be creating a neural network with the `ModelClass`. 
 - This network will have the Roberta Language model and a few by a `dropout` and `Linear` layer to obtain the final outputs. 
 - The data will be fed to the Roberta Language model as defined in the dataset. 
 - Final layer outputs is what will be compared to the `encoded_polarity` to determine the accuracy of models prediction. 
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 
 - The `return_model` function is used in the `run()` to instantiate the model and set it up for TPU execution.

In [None]:
# Creating the customized model, by adding a drop out and a dense layer on top of roberta to get the final output for the model. 

class ModelClass(torch.nn.Module):
    def __init__(self):
        super(ModelClass, self).__init__()
        self.model_layer = RobertaModel.from_pretrained("roberta-base")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.dropout = torch.nn.Dropout(0.3)
        self.classifier = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask):
        output_1 = self.model_layer(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.ReLU()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output


In [None]:
# Function to return model based on the defination of Model Class

def return_model(device):
    model = ModelClass()
    model = model.to(device)
    return model

In [None]:
# Function to calcuate the accuracy of the model

def calcuate_accu(big_idx, targets):
    n_correct = (big_idx==targets).sum().item()
    return n_correct

<a id='section03'></a>
### Fine Tuning the Model: Function

Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 

This function is called in the `run()`

Following events happen in this function to fine tune the neural network:
- The `epoch`, `model`, `device` details, `testing_ dataloader`, `optimizer` and `loss_function` are passed to the `train ()` when its called from the `run()`
- The dataloader passes data to the model based on the batch size.
- The output from the neural network: `outputs` is compared to the `targets` tensor and loss is calcuated using `loss_function()`
- Loss value is used to optimize the weights of the neurons in the network.
- After every 100 steps the loss value and accuracy is logged in the wandb service. This log is then used to generate graphs for analysis. Such as [these](https://app.wandb.ai/abhimishra-91/transformers_tutorials_sentiment?workspace=user-abhimishra-91)
- After every epoch the loss and accuracy value is printed in the console. Also, logged into the wandb service.

In [None]:
# Function to fine tune the model based on the epochs, model, tokenizer and other arguments

def train(epoch, model, device, training_loader, optimizer, loss_function):
    n_correct = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_loss = 0
    model.train()
    for _,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask).squeeze()
        optimizer.zero_grad()
        loss = loss_function(outputs, targets)
        tr_loss += loss.item()
        big_val, big_idx = torch.max(outputs.data, dim=1)
        n_correct += calcuate_accu(big_idx, targets)

        nb_tr_steps += 1
        nb_tr_examples+=targets.size(0)
        
        if _%100==0:
            loss_step = tr_loss/nb_tr_steps
            accu_step = (n_correct*100)/nb_tr_examples 
            wandb.log({"Training Loss per 100 steps": loss_step})
            wandb.log({"Training Accuracy per 100 steps": accu_step})

        optimizer.zero_grad()
        loss.backward()
        
        # # When using GPU or GPU
        # optimizer.step()
        
        # When using TPU
        xm.optimizer_step(optimizer)
        xm.mark_step()

    print(f'The Total Accuracy for Epoch {epoch}: {(n_correct*100)/nb_tr_examples}')
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    wandb.log({"Training Loss Epoch": epoch_loss})
    wandb.log({"Training Accuracy Epoch": epoch_accu})

<a id='section04'></a>
### Validating the Model Performance: Function

During the validation stage we pass the unseen data(Validation Dataset), trained model, and device details to the function to perform the validation run. This step generates new encoded_sentiment value for dataset that it has not seen during the training session. 

This is then compared to the actual encoded_sentiment, to give us the Validation Accuracy and Loss.

This function is called in the `run()`

This unseen data is the 30% of `IMBD Dataset` which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. We use the generate method for generating new text for the summary. 

The generated validation accuracy and loss are logged to wandb for every 100th step and per epoch. 

In [None]:
# Function to run the validation dataloader to validate the performance of the fine tuned model. 

def valid(epoch, model, device, validation_loader, loss_function):
    n_correct = 0; total = 0
    nb_tr_examples, nb_tr_steps = 0, 0
    tr_loss = 0
    model.eval()
    with torch.no_grad():
        for _,data in enumerate(validation_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)

            outputs = model(ids, mask).squeeze()
            loss = loss_function(outputs, targets)
            tr_loss += loss.item()
            big_val, big_idx = torch.max(outputs.data, dim=1)
            n_correct += calcuate_accu(big_idx, targets)

            nb_tr_steps += 1
            nb_tr_examples+=targets.size(0)
            
            if _%100==0:
                loss_step = tr_loss/nb_tr_steps
                accu_step = (n_correct*100)/nb_tr_examples 
                wandb.log({"Validation Loss per 100 steps": loss_step})
                wandb.log({"Validation Accuracy per 100 steps": accu_step})
        
    epoch_loss = tr_loss/nb_tr_steps
    epoch_accu = (n_correct*100)/nb_tr_examples
    wandb.log({"Validation Loss Epoch": epoch_loss})
    wandb.log({"Validation Accuracy Epoch": epoch_accu})
    print(f'The Validation Accuracy: {(n_correct*100)/nb_tr_examples}')

<a id='section05'></a>
### Run Function

The `run()` as the name suggests is the central location to run all the functions/flows created above in the notebook. The following steps are executed in the `run()`:


<a id='section501'></a>
#### Initializing WandB 

* The `run()` begins with initializing WandB run under a specific project. This command initiates a new run for each execution of this command. 

* We have seend wandb in action in one of the previous notebooks. Leveraging this notebook to log some additional metrics. 

* This particular tutorial is logged in the project: **[transformers_tutorials_sentiment](https://app.wandb.ai/abhimishra-91/transformers_tutorials_sentiment?workspace=user-abhimishra-91)**

**One of the dadshboard from the project**
![](https://github.com/abhimishra91/transformers-tutorials/blob/master/meta/wandb-sentiment.jpg?raw=1)

* Visit the project page to see the details of different runs and what information is logged by the service. 

* Following the initialization of the WandB service we define configuration parameters that will be used across the tutorial such as `batch_size`, `epoch`, `learning_rate` etc.

* These parameters are also passed to the WandB config. The config construct with all the parameters can be optimized using the Sweep service from WandB. Currently, that is outof scope of this tutorial. 


<a id='section502'></a>
#### Importing and Pre-Processing the domain data

We will be working with the data and preparing it for fine tuning purposes. 
*Assuming that the `IMDB Dataset.csv` is already downloaded in your `data` folder*

* The file is imported as a dataframe and give it the headers as per the documentation.
* Cleaning the file to remove the unwanted columns.
* All these steps are done using the `Preprocess Class` defined above
* The final Dataframe will be something like this:

|review|encoded_polarity|
|--|--|
|summary-1|0|
|summary-2|1|
|summary-3|1|


<a id='section503'></a>
#### Creation of Dataset and Dataloader

* The updated dataframe is divided into 70-20 ratio for test and validation. 
* Both the data-frames are passed to the `CustomerDataset` class for tokenization of the review and its sentiment.
* The tokenization is done using roberta tokenizer.
* Train and Validation parameters are defined and passed to the `pytorch Dataloader contstruct` to create `train` and `validation` data loaders.
* These dataloaders will be passed to `train()` and `validate()` respectively for training and validation action.
* The shape of datasets is printed in the console.
* All these actions are performed using the `return_dataloader()` and `CustomDataset class` defined above.


<a id='section504'></a>
#### Neural Network and Optimizer

* In this stage we define the model and optimizer that will be used for training and to update the weights of the network. 
* We are using the `roberta-base-uncased` transformer model for our project. You can read about the `RoBERTa model` and its features above. 
* The model is returned and instiated using the `return_model()` and `ModelClass`.
* We are using the `Adam` optimizer for our project. This has been a standard for all our tutorials and is something that can be changed updated to see how different optimizer perform with different learning rates. 
* There is also a scope for doing more with Optimizer such a decay, momentum to dynamically update the Learning rate and other parameters. All those concepts have been kept out of scope for these tutorials. 


<a id='section505'></a>
#### Training Model and Logging to WandB

* Followed by that we call the `train()` with all the necessary parameters.
* Loss and accuracy at every 100th step is logged to the WandB service. 
* Accuracy and end of every epoch is logged in WandB and also printed in the console.


<a id='section506'></a>
#### Validation

* After the training is completed, the validation step is initiated.
* As defined in the validation function, the model weights are not updated. We use the fine tuned model to generate encoded sentiment.
* An output is printed on the console giving the accuracy at the end of Validation. 

In [None]:
def run():
    
    # WandB – Initialize a new run
    wandb.init(project="transformers_tutorials_sentiment")
    
    # Defining some key variables that will be used later on in the training
    config = wandb.config 
    config.MAX_LEN = 512
    config.TRAIN_BATCH_SIZE = 4
    config.VALID_BATCH_SIZE = 2
    config.EPOCHS = 2
    config.LEARNING_RATE = 1e-05
    tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')

    # Reading the dataset and pre-processing it for usage
    df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/IMDB Dataset.csv', encoding='latin-1')
    pre = Preprocess(df)
    encoding_dict, df = pre.processing()

    # Creating the training and validation dataloader using the functions defined above
    training_loader, validation_loader = return_dataloader(df, tokenizer, config.TRAIN_BATCH_SIZE, config.VALID_BATCH_SIZE, config.MAX_LEN)

    # Defining the model based on the function and ModelClass defined above
    model = return_model(device)

    # Creating the loss function and optimizer
    loss_function = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.Adam(params =  model.parameters(), lr=config.LEARNING_RATE)

    # Fine tuning the model using the train function:
    for epoch in range(config.EPOCHS):
        train(epoch, model, device, training_loader, optimizer, loss_function)

    # Running the validation function to validate the performance of the trained model
    valid(epoch, model, device, validation_loader, loss_function)

In [None]:
run()

FULL Dataset: (50000, 2)
TRAIN Dataset: (35000, 2)
VAL Dataset: (15000, 2)
The Total Accuracy for Epoch 0: 91.74285714285715
The Total Accuracy for Epoch 1: 95.54
The Validation Accuracy: 94.68
