# Sentiment Analysis of Tweets with BERT

One BERT model which caught my eye when scrolling through the Huggingface website was the BERTweet model. The BERTweet model was proposed in the paper below:

[BERTweet: A pre-trained language model for English Tweets](https://www.aclweb.org/anthology/2020.emnlp-demos.2.pdf)

BERTweet is the first public large-scale pre-trained language model for English Tweets. The corpus used to pre-train BERTweet consists of 850 million English Tweets.

In this notebook, I am going to attempt to fine-tune the BERTweet model for the purpose of sentiment analysis of Tweets.

## Setup

### Mounting Google Drive

In [29]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Imports

In [30]:
import sys
import os
os.chdir('/content/drive/My Drive/College - 4th Year/CA4023_NLT/Assignment3')

try:
    import torch
except ModuleNotFoundError:
    !{sys.executable} -m pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

try:
    import transformers
except ModuleNotFoundError:
    !{sys.executable} -m pip install transformers

try:
    import emoji
except ModuleNotFoundError:
    !{sys.executable} -m pip install emoji

In [31]:
import pandas as pd

## Loading Data

For the purpose of sentiment prediction of Tweets, I will need a dataset consisting of Tweets annotated with sentiment. The dataset which I have chosen is the [sentiment140 dataset](https://www.kaggle.com/kazanova/sentiment140), which is freely available on Kaggle. This dataset contains 1.6 million Tweets annotated with sentiment (0 = negative, 4 = positive). 

For the sake of practicality, I am only going to use a subset of the 1.6 million Tweets. Below, I extract 50,000 positive and 50,000 negative sentiment Tweets to use.



In [32]:
data_path = "./data/tweets/training.1600000.processed.noemoticon.csv"

In [33]:
df_tweets = pd.read_csv(data_path, names=["target", "id", "date", "flag", "user", "text"], encoding='latin-1')

In [34]:
df_tweets.head()

Unnamed: 0,target,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


For the purpose of sentiment analysis, I only require the *target* and *text* columns. Therefore, I can drop the remaining columns.

In [35]:
df_tweets = df_tweets.drop(columns=["id", "date", "flag", "user"])

In [36]:
df_tweets.head()

Unnamed: 0,target,text
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [37]:
df_pos = df_tweets[df_tweets["target"] == 0]
df_neg = df_tweets[df_tweets["target"] == 4]

In [38]:
# We only want 50,000 of each
df_pos = df_pos[:50000]
df_neg = df_neg[:50000]

In [39]:
# Concatenate the above to form our new df
df = pd.concat([df_pos, df_neg], ignore_index=True)

Next, we put this data into arrays as this is the data structure which will be use from here onwards.

In [40]:
texts = list(df['text'])
labels = list(df['target'])

The final step is to split the data into train, validation and test sets which we can use for evaluation and tuning.

In [41]:
from sklearn.model_selection import train_test_split

train_texts, test_texts, train_labels, test_labels = train_test_split(texts, labels, test_size=0.1)

A validation set is also useful to have...

In [42]:
train_texts, val_texts, train_labels, val_labels = train_test_split(train_texts, train_labels, test_size=0.2)

Below are the size of the resulting sets...

In [43]:
print("Training Set: ", len(train_texts))
print("Validation Set: ", len(val_texts))
print("Test Set: ", len(test_texts))

Training Set:  72000
Validation Set:  18000
Test Set:  10000


## Normalising Tweets

Before fine-tuning the BERTweet model, the Tweets must also be normalized by converting user mentions and web/url links into special tokens @USER and HTTPURL, respectively. Thankfully, BERTweet provides this pre-processing step by enabling the normalization argument of the AutoTokenizer.

In [48]:
import torch
from transformers import AutoTokenizer

As the input Tweets are raw, we load the BERTweet tokeniser with normalisation mode enabled.

In [45]:
tokenizer = AutoTokenizer.from_pretrained("vinai/bertweet-base", normalization=True)

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


Now, we can simply pass the texts arrays to the tokeniser. The truncation and padding arguments ensure that Tweets are truncated to be no longer than the model's maximum input length and that all of the sequences are padded to the same length.

In [18]:
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
val_encodings = tokenizer(val_texts, truncation=True, padding=True)
test_encodings = tokenizer(test_texts, truncation=True, padding=True)

## Creating PyTorch Datasets

The next step is to take the encodings and labels which we have and turn these into a PyTorch *Dataset* object. This can be done by simply subclassing a *torch.utils.data.Dataset* object and implementing some functions. The data is put into this format so that it can be easily batched during training.

In [19]:
class TweetsDataset(torch.utils.data.Dataset):
  def __init__(self, encodings, labels):
    self.encodings = encodings
    self.labels = labels

  def __getitem__(self, idx):
    item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
    item['labels'] = torch.tensor(self.labels[idx])
    return item

  def __len__(self):
    return len(self.labels)

Now, we can create our train, val and test datasets.

In [20]:
train_dataset = TweetsDataset(train_encodings, train_labels)
val_dataset = TweetsDataset(val_encodings, val_labels)
test_dataset = TweetsDataset(test_encodings, test_labels)

## Fine-tuning BERTweet

Now that the datasets are ready, the final step is to fine-tune the BERTweet model. The *transformers* package provides a very useful Trainer object which can be used for this task.

We simply need to define our model, define the TrainingArguments and instantiate a Trainer.

In [27]:
from transformers import AutoModel, Trainer, TrainingArguments

In [49]:
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = AutoModel.from_pretrained("vinai/bertweet-base")

trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset             # evaluation dataset
)

trainer.train()

TypeError: ignored

Unfortunately, the model did not train sucessfully due to an error and after much trial and error, I could not figure out the issue.