<a href="https://colab.research.google.com/github/niccronc/AITA/blob/master/AITA_with_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Morality predictions on AITA with Bert

## Introduction

In this notebook we train and validate a sentiment analysis model on the AITA dataset, available [here](https://github.com/iterative/aita_dataset). We thank them for making this cleaned dataset publicly available.

We use the BERT infrastructure, in the version developed by [HuggingFace](https://huggingface.co/). In fact, this mini-project started off with the goal to play around with HuggingFace's [Transformer](https://huggingface.co/transformers/) library.

We also thank Abhishek Kumar Mishra, as this notebook was adapted from [his DistilBert notebook](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_multiclass_classification.ipynb).


<a id='section01'></a>
### Importing Python Libraries and preparing the environment

At this step we will be importing the libraries and modules needed to run our script. Libraries are:
* Pandas
* Pytorch
* Pytorch Utils for Dataset and Dataloader
* Transformers
* BERT Model and Tokenizer

Followed by that we will preapre the device for CUDA execution. This configuration is needed if you want to leverage on onboard GPU. 

In [1]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/27/3c/91ed8f5c4e7ef3227b4119200fc0ed4b4fd965b1f0172021c25701087825/transformers-3.0.2-py3-none-any.whl (769kB)
[K     |▍                               | 10kB 22.4MB/s eta 0:00:01[K     |▉                               | 20kB 6.2MB/s eta 0:00:01[K     |█▎                              | 30kB 6.1MB/s eta 0:00:01[K     |█▊                              | 40kB 7.0MB/s eta 0:00:01[K     |██▏                             | 51kB 6.7MB/s eta 0:00:01[K     |██▋                             | 61kB 7.2MB/s eta 0:00:01[K     |███                             | 71kB 7.4MB/s eta 0:00:01[K     |███▍                            | 81kB 7.7MB/s eta 0:00:01[K     |███▉                            | 92kB 7.8MB/s eta 0:00:01[K     |████▎                           | 102kB 8.0MB/s eta 0:00:01[K     |████▊                           | 112kB 8.0MB/s eta 0:00:01[K     |█████▏                          | 122kB 8.0M

In [2]:
# Importing the libraries needed
import pandas as pd
import torch
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import BertModel, BertTokenizer

In [3]:
# Setting up the device for GPU usage

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

device

'cuda'

<a id='section02'></a>
### Importing and Pre-Processing the data

We will be working with the data and preparing for fine tuning purposes.

Copy the aita_clean.csv file to a data folder inside the Colab Notebooks folder of your own google drive.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [5]:
path='/content/drive/My Drive/Colab Notebooks/data/aita_clean.csv'

In [6]:
# Import the csv into pandas dataframe and add the headers
df=pd.read_csv(path)
df.head()


Unnamed: 0,id,timestamp,title,body,edited,verdict,score,num_comments,is_asshole
0,1ytxov,1393279000.0,[AITA] I wrote an explanation in TIL and came ...,[Here is the post in question](http://www.redd...,False,asshole,52,13.0,1
1,1yu29c,1393281000.0,[AITA] Threw my parent's donuts away,"My parents are diabetic, morbidly obese, and a...",1393290576.0,asshole,140,27.0,1
2,1yu8hi,1393285000.0,I told a goth girl she looked like a clown.,I was four.,False,not the asshole,74,15.0,0
3,1yuc78,1393287000.0,[AItA]: Argument I had with another redditor i...,http://www.reddit.com/r/HIMYM/comments/1vvfkq/...,1393286962.0,everyone sucks,22,3.0,1
4,1yueqb,1393288000.0,[AITA] I let my story get a little long and bo...,,False,not the asshole,6,4.0,0


In [7]:
# # Removing unwanted columns and only leaving text+body of the post, as well as the category is_asshole (0 for no, 1 for yes)

df['text'] = df['title']+df['body'].fillna('')

df=df[['text','is_asshole']]

***Warning: at the moment, we are not using this trick since the model performs decently without it.***

To account for the imbalanced dataset, we try to undersample the majority class, or to weight our loss functions accordingly.

In [None]:
assholes = len(df[df.is_asshole == 1])
no_assholes = len(df[df.is_asshole == 0])

In [None]:
df_assholes = df[df.is_asshole == 1]

In [None]:
df_no_assholes = df[df.is_asshole == 0].sample(n=assholes,random_state=200)

len(df_no_assholes) == assholes

True

In [None]:
df_undersampled = pd.concat([df_assholes, df_no_assholes])

In [None]:
df_undersampled = df_undersampled.sample(frac=1) #this is to shuffle rows

<a id='section03'></a>
### Preparing the Dataset and Dataloader

We will start with defining few key variables that will be used later during the training/fine tuning stage.
Followed by creation of Dataset class - This defines how the text is pre-processed before sending it to the neural network. We will also define the Dataloader that will feed  the data in batches to the neural network for suitable training and processing. 
Dataset and Dataloader are constructs of the PyTorch library for defining and controlling the data pre-processing and its passage to neural network. For further reading into Dataset and Dataloader read the [docs at PyTorch](https://pytorch.org/docs/stable/data.html)

#### *Triage* Dataset Class
- This class is defined to accept the Dataframe as input and generate tokenized output that is used by the BERT model for training. 
- We are using the BERT tokenizer to tokenize the data in the `text` column of the dataframe. 
- The tokenizer uses the `encode_plus` method to perform tokenization and generate the necessary outputs, namely: `ids`, `attention_mask`
- To read further into the tokenizer, [refer to this document](https://huggingface.co/transformers/model_doc/distilbert.html#distilberttokenizer)
- `target` is the encoded category on the news headline. 
- The *Triage* class is used to create 2 datasets, for training and for validation.
- *Training Dataset* is used to fine tune the model: **80% of the original data**
- *Validation Dataset* is used to evaluate the performance of the model. The model has not seen this data during training. 

#### Dataloader
- Dataloader is used to for creating training and validation dataloader that load data to the neural network in a defined manner. This is needed because all the data from the dataset cannot be loaded to the memory at once, hence the amount of dataloaded to the memory and then passed to the neural network needs to be controlled.
- This control is achieved using the parameters such as `batch_size` and `max_len`.
- Training and Validation dataloaders are used in the training and validation part of the flow respectively

In [8]:
# Defining some key variables that will be used later on in the training
MAX_LEN = 512
TRAIN_BATCH_SIZE = 4
VALID_BATCH_SIZE = 2
EPOCHS = 1
LEARNING_RATE = 1e-05
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=213450.0, style=ProgressStyle(descripti…




In [9]:
class Triage(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        self.len = len(dataframe)
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        
    def __getitem__(self, index):
        title = str(self.data.text[index])
        title = " ".join(title.split())
        inputs = self.tokenizer.encode_plus(
            title,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]

        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.data.is_asshole[index], dtype=torch.long)
        } 
    
    def __len__(self):
        return self.len

In [10]:
# Creating the dataset and dataloader for the neural network

train_size = 0.8
train_dataset=df.sample(frac=train_size,random_state=200)
test_dataset=df.drop(train_dataset.index).reset_index(drop=True)
train_dataset = train_dataset.reset_index(drop=True)


print("FULL Dataset: {}".format(df.shape))
print("TRAIN Dataset: {}".format(train_dataset.shape))
print("TEST Dataset: {}".format(test_dataset.shape))

training_set = Triage(train_dataset, tokenizer, MAX_LEN)
testing_set = Triage(test_dataset, tokenizer, MAX_LEN)

FULL Dataset: (97628, 2)
TRAIN Dataset: (78102, 2)
TEST Dataset: (19526, 2)


In [11]:
train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

test_params = {'batch_size': VALID_BATCH_SIZE,
                'shuffle': True,
                'num_workers': 0
                }

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

<a id='section04'></a>
### Creating the Neural Network for Fine Tuning

#### Neural Network
 - We will be creating a neural network with the `BERTClass`. 
 - This network will have the BERT Language model followed by a `dropout` and finally a `Linear` layer to obtain the final outputs. 
 - The data will be fed to the BERT Language model as defined in the dataset. 
 - Final layer outputs is what will be compared to the `encoded category` to determine the accuracy of models prediction. These final outputs are the likelihoods of belonging to the positive class.
 - We will initiate an instance of the network called `model`. This instance will be used for training and then to save the final trained model for future inference. 
 
#### Loss Function and Optimizer
 - `Loss Function` and `Optimizer` and defined in the next cell.
 - The `Loss Function` is used the calculate the difference in the output created by the model and the actual output.
 - We use `CrossEntropyLoss` as loss function. This requires us to manually adjust the outputs of the model, since the input to this loss functions need to have a different shape than the targets. See the [documentation](https://pytorch.org/docs/stable/nn.html#crossentropyloss).
 - `Optimizer` is used to update the weights of the neural network to improve its performance.
 
#### Further Reading
- [Pytorch Documentation for Loss Functions](https://pytorch.org/docs/stable/nn.html#loss-functions)
- [Pytorch Documentation for Optimizer](https://pytorch.org/docs/stable/optim.html)

In [12]:
# Creating the customized model, by adding a drop out and a dense layer on top of distil bert to get the final output for the model. 

class BERTClass(torch.nn.Module):
    def __init__(self):
        super(BERTClass, self).__init__()
        self.l1 = transformers.BertModel.from_pretrained('bert-base-cased')
        self.l2 = torch.nn.Dropout(0.3)
        self.l3 = torch.nn.Linear(768, 1)
    
    def forward(self, ids, mask, token_type_ids):
        _, output_1= self.l1(ids, mask, token_type_ids = token_type_ids)
        output_2 = self.l2(output_1)
        output = self.l3(output_2)
        return output

In [13]:
model = BERTClass()
model.to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




BERTClass(
  (l1): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(28996, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    

***Warning: at the moment we are not weighing the loss function, and thus the code cell below is not utilized.***

In [None]:
no_asshole_ratio = no_assholes / (no_assholes + assholes)
asshole_ratio = assholes / (no_assholes + assholes)

#distance = abs(no_asshole_ratio - asshole_ratio)
#n_steps = 2.5
#step = distance / n_steps

weights=[asshole_ratio, no_asshole_ratio]
weight_tensor = torch.tensor(weights).cuda()

In [None]:
weight_tensor

tensor([0.2716, 0.7284], device='cuda:0')

In [14]:
# Creating the loss function and optimizer
loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

<a id='section05'></a>
### Fine Tuning the Model

After all the effort of loading and preparing the data and datasets, creating the model and defining its loss and optimizer. This is probably the easier steps in the process. 

Here we define a training function that trains the model on the training dataset created above, specified number of times (EPOCH), An epoch defines how many times the complete data will be passed through the network. 

Following events happen in this function to fine tune the neural network:
- The dataloader passes data to the model based on the batch size. 
- Subsequent output from the model and the actual category are compared to calculate the loss. 
- Loss value is used to optimize the weights of the neurons in the network.
- After every 5000 steps the loss value is printed in the console.

In [15]:
import logging

logging.basicConfig(level=logging.ERROR)
#This is the easiest way to fix annoying warning messages that show up during training and ultimately crash the javascript code behind the notebook.

In [16]:
# Defining the training function on the 80% of the dataset for tuning the Bert model

def train(epoch):
    model.train()
    for _,data in enumerate(training_loader, 0):
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model(ids, mask, token_type_ids).squeeze()
        complements = (1 - outputs.clone().detach())

        optimizer.zero_grad()
        loss = loss_function(torch.stack([complements, outputs], dim = 1), targets)
        if _%5000==0:
            print(f'Epoch: {epoch}, Loss:  {loss.item()}')
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

In [19]:
torch.cuda.empty_cache()
#This is necessary or else there is not enough memory to train the model.

In [20]:
for epoch in range(EPOCHS):
    train(epoch)

Epoch: 0, Loss:  0.7482631206512451
Epoch: 0, Loss:  0.34433355927467346
Epoch: 0, Loss:  0.6134215593338013
Epoch: 0, Loss:  0.11833330988883972


<a id='section06'></a>
### Validating the Model

During the validation stage we pass the unseen data(Testing Dataset) to the model. This step determines how good the model performs on the unseen data. 

This unseen data is the 20% of `aita_clean.csv` which was seperated during the Dataset creation stage. 
During the validation stage the weights of the model are not updated. Only the final output is compared to the actual value. This comparison is then used to calcuate the accuracy of the model.

***Warning: the two code cells below are currently not used***

In [None]:
def valid(model, testing_loader):
    model.eval()
    n_correct = 0; n_wrong = 0; total = 0
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            outputs = model(ids, mask, token_type_ids).squeeze()
            big_val, big_idx = torch.max(outputs.data)
            total+=targets.size(0)
            n_correct+=(big_idx==targets).sum().item()
    return (n_correct*100.0)/total

In [None]:
print('This is the validation section to print the accuracy and see how it performs')
print('Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch')

acc = valid(model, testing_loader)
print("Accuracy on test data = %0.2f%%" % acc)

This is the validation section to print the accuracy and see how it performs
Here we are leveraging on the dataloader crearted for the validation dataset, the approcah is using more of pytorch
Accuracy on test data = 72.74%


In [21]:
from sklearn import metrics
import numpy as np

In [22]:
def validation(epoch):
    model.eval()
    fin_targets=[]
    fin_outputs=[]
    with torch.no_grad():
        for _, data in enumerate(testing_loader, 0):
            ids = data['ids'].to(device, dtype = torch.long)
            mask = data['mask'].to(device, dtype = torch.long)
            token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
            targets = data['targets'].to(device, dtype = torch.long)
            outputs = model(ids, mask, token_type_ids).squeeze()
            #big_val, big_idx = torch.max(outputs.data, dim=1)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

In [23]:
for epoch in range(EPOCHS):
    outputs, targets = validation(epoch)
    outputs = np.array(outputs) >= 0.5
    accuracy = metrics.accuracy_score(targets, outputs)
    f1_score = metrics.f1_score(targets, outputs)
    recall = metrics.recall_score(targets, outputs)
    precision = metrics.precision_score(targets, outputs)
    print(f"Accuracy Score = {accuracy}")
    print(f"F1 Score = {f1_score}")
    print(f"Recall = {recall}")
    print(f"Precision = {precision}")


Accuracy Score = 0.6116460104476084
F1 Score = 0.4824244078902464
Recall = 0.6640360766629086
Precision = 0.3788187372708758


<a id='section07'></a>
### Conclusions

We obtain an accuracy of about 61.2%. This is not great, but somewhat expected since the data is inherently very noisy - we are, after all, trying to guess the general consensus on pretty messy real-life conundrums.

This result is a bit worse than what was obtained by the creator of the dataset, as outlined in [this blog post](https://dvc.org/blog/a-public-reddit-dataset).

I suspect that accuracy can be improved, at least a bit, by adjusting tokenization to the problem at hand: the Bert tokenizer is very powerful but used to working with somewhat clean data, while these scraped postings are full of grammar mistakes, abbreviations, and so on. There is no reason why BERT should perform worse than logistic regression.

Another option to try to improve accuracy might be to weigh the loss function to give more importance to correctly recognizing the minority class.

The technique used by the aforementioned blog post to take care of the imbalance, SMOTE, does not seem to be readily available here as the outputs of the Bert tokenizer are integer-valued tensors. Furthermore each integer value corrsponds to a specific token, so any sort of averaging a bunch of them makes intrinsically no sense as the meaning gets lost.


<a id='section08'></a>
### Saving the Trained Model Artifacts for inference

This is the final step in the process of fine tuning the model. 

The model and its vocabulary are saved locally. These files are then used in the future to make inference on new inputs of news headlines.

Please remember that a trained neural network is only useful when used in actual inference after its training. 

In the lifecycle of an ML projects this is only half the job done. We will leave the inference of these models for some other day. 

In [None]:
# Saving the files for re-use

output_model_file = '/content/drive/My Drive/Colab Notebooks/data/pytorch_bert_aita.bin'
output_vocab_file = '/content/drive/My Drive/Colab Notebooks/data/vocab_bert_aita.bin'

model_to_save = model
torch.save(model_to_save, output_model_file)
tokenizer.save_vocabulary(output_vocab_file)

print('All files saved')

  "type " + obj.__name__ + ". It won't be checked "


All files saved
