<a href="https://colab.research.google.com/github/codistro/Articles/blob/main/covid_tweet_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# OverView

Bidirectional Encoder Representations from Transformers (BERT) is a state of the art model based on transformers developed by google. It can be pre-trained and later fine-tuned for a specific task. we will see fine-tuning in action in this post.

<img src="https://miro.medium.com/v2/resize:fit:2000/format:webp/1*kSSld7QWwSggc29EbzX7mQ.png" style="height:500px" />

We will fine-tune bert on a classification task. The task is to classify the sentiment of covid related tweets.

Here we are using the Hugging face library to fine-tune the model. Hugging face makes the whole process easy from text preprocessing to training.

## Bert

Bert was pre-trained on the BooksCorpus dataset and English Wikipedia. It obtained state-of-the-art results on eleven natural language processing tasks.

Bert was trained on two tasks simultaneously

*   Masked language modelling (MLM) — 15% of the tokens were masked and was trained to predict the masked word
*   Next Sentence Prediction(NSP) — Given two sentences A and B, predict whether B follows A

BERT is designed to pre-train deep bidirectional representations from an unlabeled text by jointly conditioning on both left and right context in all layers.

As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.


## Dataset

We are using the Coronavirus tweets NLP — Text Classification dataset available on Kaggle.

The dataset has two files Corona_NLP_test.csv (40k entries) and Corona_NLP_test.csv (4k entries).
These are the first five entries of training data:

<img src="https://miro.medium.com/v2/resize:fit:720/format:webp/1*BAQHuxVUdwfbEeYElGXRkA.png" style="height:150px" />

As you can see we have 5 features in our data: UserName, ScreenName Location, TweetAt, OriginalTweet, Sentiment, but we are only interested in 2 i.e OriginalTweet contains the actual tweet and Sentiment which are labels for our Tweet.

These tweets are classified into 5 categories — ‘Neutral’, ‘Positive’, ‘Extremely Negative’, ‘Negative’, ‘Extremely Positive’. Hence the number of labels is 5.


# Loading Data and Preprocessing:

We will be using the Hugging Face library for this project. we need to install the two modules:

*   ***transformers:*** Hugging Face implementation of transformers. We can download a wide range of pre-trained models
*   ***datasets:*** Loading the dataset and also different datasets can be downloaded that are available of Hugging Face hub

In [None]:
!pip install transformers
!pip install datasets

Hugging Face AutoTokenizertakes care of the tokenization part. we can download the tokenizer corresponding to our model, which is bert in this case.

In [1]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

tokenizer("Attention is all you need")

{'input_ids': [101, 1335, 5208, 2116, 1110, 1155, 1128, 1444, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

Here we are using load_dataset from datasets library. load_dataset can be used to download datasets from the hugging face hub or we can load our custom dataset.

We specified the datatype as CSV, passing file names as dictionaries to data_files. we are loading our train and test files into the dataset variable.

In [2]:

from datasets import load_dataset
dataset = load_dataset('csv', data_files={'train': 'Corona_NLP_train.csv', 'test': 'Corona_NLP_test.csv'}, encoding = "ISO-8859-1")


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Now as part of the preprocessing steps, we will perform two steps:

*   Convert Sentiment into an integer
*   Tokenize the tweets

We will be using map function of the dataset which is similar to apply function of the pandas data frame. It takes a function as an argument and applies to the entire dataset.

In [3]:
def transform_labels(label):

    label = label['Sentiment']
    num = 0
    if label == 'Positive':
        num = 0
    elif label == 'Negative':
        num = 1
    elif label == 'Neutral':
        num = 2
    elif label == 'Extremely Positive':
        num = 3
    elif label == 'Extremely Negative':
        num = 4

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['OriginalTweet'], padding='max_length')

dataset = dataset.map(tokenize_data, batched=True)

remove_columns = ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'Sentiment']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)


Map:   0%|          | 0/41157 [00:00<?, ? examples/s]

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]

Map:   0%|          | 0/41157 [00:00<?, ? examples/s]

Map:   0%|          | 0/3798 [00:00<?, ? examples/s]

## Training
There are two ways to train the data, either we write our own training loop or we can use trainer from the hugging face library.

In this case, we will use trainer from the library. To use trainer, first we need to define the training arguments like name, num_epochs, batch_size etc.

In [4]:

from transformers import TrainingArguments

training_args = TrainingArguments("test_trainer", num_train_epochs=3)

Let’s download the bert model now, which is very simple using the AutoModelForSequenceClassificatio class.

The classification model downloaded also expects an argument num_labels which is the number of classes in our data. A linear layer is attached at the end of the bert model to give output equal to the number of classes.

In [5]:

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=5)

pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification 

Before starting the training, we will split our training data into train and evaluation sets. We have 40k in training and 1k in eval set.

In [6]:
train_dataset = dataset['train'].shuffle(seed=10).select(range(40000))
eval_dataset = dataset['train'].shuffle(seed=10).select(range(40000, 41000))

If we are using a hugging face trainer we need to import the module Trainer and pass model, dataset and training arguments to it.

In [7]:
from transformers import Trainer

trainer = Trainer(
    model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)

Training will run for 3 epochs which can be adjusted from the training arguments.

In [8]:
trainer.train()



  0%|          | 0/15000 [00:00<?, ?it/s]

KeyboardInterrupt: 

Once training is done we can run trainer.evalute() to check the accuracy, but before that, we need to import metrics.


In [7]:
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.evaluate()

Downloading:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

***** Running Evaluation *****
  Num examples = 3798
  Batch size = 8


{'eval_accuracy': 0.8359662980516062,
 'eval_loss': 0.6989739537239075,
 'eval_runtime': 75.3905,
 'eval_samples_per_second': 50.378,
 'eval_steps_per_second': 6.301}

In [8]:
import torch
model_path = "./covidfe_tweet_classification.pt"
torch.save(model.state_dict(),model_path)