# Part-1 **Sentiment Analysis using BERT on Twitter US-Airlines Sentiment dataset **

# **DistilBERT-base**

The dataset I've chosen to perform sentiment analysis using a BERT model is the **Twitter US Airline sentiment dataset**.  The Twitter US Airline Sentiment dataset is a really popular dataset used in the field of natural language processing to analyze customer sentiment towards 6 major US airlines. The dataset is a collection of about 15000 tweets and were categorized by humans into 3 categories: Positive, Neutral and Negative.

This dataset is so popular because it reflects real-world sentiments expressed by actual customers of US airlines, providing a diverse and realistic range of opinions. Additionally, the dataset's large size and even distribution of sentiment categories make it a great choice for training and evaluating machine learning models for sentiment analysis. People have used this dataset to evaluate many different types of models, including traditional machine learning algorithms and advanced neural networks like BERT. Given that BERT has shown exceptional performance on various NLP tasks, it's a natural fit to apply it to the Twitter US Airline Sentiment dataset to see how well it performs on this specific task.

Dataset source: https://huggingface.co/datasets/osanseviero/twitter-airline-sentiment/viewer/osanseviero--twitter-airline-sentiment/train

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import torch
# !pip install transformers
import transformers
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
from torch.utils.data import DataLoader, TensorDataset
from sklearn.preprocessing import LabelEncoder

# Load data
df = pd.read_csv('Tweets.csv')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
df.head()

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


# **Data Preprocessing**

In [None]:
import pandas as pd
import re
import nltk
from string import punctuation
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

# Load the dataset
data = pd.read_csv("Tweets.csv")

# Define preprocessing functions
def remove_usernames(text):
    return re.sub(r'@[A-Za-z0-9]+', '', text)

def remove_urls(text):
    return re.sub(r'http\S+', '', text)

def preprocess_tweet_text(tweet):
    # Convert to lowercase
    tweet = tweet.lower()
    
    # Remove URLs
    tweet = remove_urls(tweet)
    
    # Tokenize the tweet
    tokens = word_tokenize(tweet)
    
    # Remove stop words and punctuation
    stop_words = set(stopwords.words('english') + list(punctuation))
    tokens = [token for token in tokens if token not in stop_words]
    
    # Lemmatize the tokens
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Join the tokens into a string
    tweet = ' '.join(tokens)
    
    return tweet

# Apply preprocessing to the 'text' column
data['text'] = data['text'].apply(remove_usernames)
data['text'] = data['text'].apply(preprocess_tweet_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
from sklearn.model_selection import train_test_split
from transformers import DistilBertTokenizer
from torch.utils.data import TensorDataset

# Split data into train, validation, and test sets
train_text, temp_text, train_labels, temp_labels = train_test_split(df['text'], df['airline_sentiment'], 
                                                                    random_state=2022, 
                                                                    test_size=0.2, 
                                                                    stratify=df['airline_sentiment'])
val_text, test_text, val_labels, test_labels = train_test_split(temp_text, temp_labels, 
                                                                random_state=2022, 
                                                                test_size=0.5, 
                                                                stratify=temp_labels)

# Load pre-trained DistilBERT tokenizer and encode text
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased', do_lower_case=True)

train_encodings = tokenizer(train_text.tolist(), truncation=True, padding=True)
val_encodings = tokenizer(val_text.tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test_text.tolist(), truncation=True, padding=True)

df['airline_sentiment'] = df['airline_sentiment'].map({'negative': 0, 'neutral': 1, 'positive': 2})

label_encoder = LabelEncoder()
train_labels = label_encoder.fit_transform(train_labels)
val_labels = label_encoder.transform(val_labels)
test_labels = label_encoder.transform(test_labels)

train_dataset = TensorDataset(torch.tensor(train_encodings['input_ids']),
                              torch.tensor(train_encodings['attention_mask']),
                              torch.tensor(train_labels))

val_dataset = TensorDataset(torch.tensor(val_encodings['input_ids']),
                            torch.tensor(val_encodings['attention_mask']),
                            torch.tensor(val_labels))

test_dataset = TensorDataset(torch.tensor(test_encodings['input_ids']),
                             torch.tensor(test_encodings['attention_mask']),
                             torch.tensor(test_labels))


In [None]:
# Tensorizing the data using data loaders 
def get_data_loaders(train_inputs, train_labels, val_inputs, val_labels, batch_size):
    # Convert data to PyTorch tensors
    train_inputs = torch.tensor(train_inputs)
    train_labels = torch.tensor(train_labels)
    val_inputs = torch.tensor(val_inputs)
    val_labels = torch.tensor(val_labels)
    
    # Create TensorDataset objects
    train_data = TensorDataset(train_inputs, train_labels)
    val_data = TensorDataset(val_inputs, val_labels)
    
    # Create DataLoader objects
    train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
    val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=False)
    
    return train_dataloader, val_dataloader

In [None]:
from transformers import DistilBertForSequenceClassification

# Define data loaders
train_dataloader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=16, shuffle=False)

# Load pre-trained DistilBERT model
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=3)

# Set device to GPU (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# Move model to the device
model = model.to(device)

# Define optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=1e-5, eps=1e-8)
num_warmup_steps = int(len(train_dataloader) * 0.1)
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=5)
epochs = 5

# Define cross-entropy loss function
loss_fn = torch.nn.CrossEntropyLoss()

# Define early_stop
early_stop = 3
best_val_loss = float('inf')
best_epoch = 0
for epoch in range(epochs):
    # Training
    model.train()
    train_loss = 0
    train_acc = 0
    for batch in train_dataloader:
        batch = tuple(t.to(device) for t in batch)
        b_input_ids, b_input_mask, b_labels = batch
        optimizer.zero_grad()
        outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
        loss = outputs[0]
        train_loss += loss.item()
        loss.backward()
        optimizer.step()
        scheduler.step()
        train_acc += (outputs[1].detach().cpu().numpy().argmax(axis=1) == b_labels.cpu().numpy()).mean()
    train_loss /= len(train_dataloader)
    train_acc /= len(train_dataloader)

    # Evaluation
    model.eval()
    val_loss = 0
    val_acc = 0
    with torch.no_grad():
        for batch in val_dataloader:
            batch = tuple(t.to(device) for t in batch)
            b_input_ids, b_input_mask, b_labels = batch
            outputs = model(b_input_ids, attention_mask=b_input_mask, labels=b_labels)
            loss = outputs[0]
            val_loss += loss.item()
            val_acc += (outputs[1].detach().cpu().numpy().argmax(axis=1) == b_labels.cpu().numpy()).mean()
    val_loss /= len(val_dataloader)
    val_acc /= len(val_dataloader)

    print("Epoch {} - train loss: {:.3f} - train acc: {:.3f} - val loss: {:.3f} - val acc: {:.3f}".format(epoch, train_loss, train_acc, val_loss, val_acc))

    # Save the model
    if val_loss < best_val_loss:
        torch.save(model.state_dict(), 'distilbert_sentiment_model.pt')
        best_val_loss = val_loss
        best_epoch = epoch
        print("The model has been saved")

    # Stop training if the validation loss stops improving after certain epochs
    if epoch - best_epoch >= early_stop:
        print("Validation loss has not improved in {} epochs, stopping training".format(early_stop))
        break

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

cuda
Epoch 0 - train loss: 0.879 - train acc: 0.616 - val loss: 0.862 - val acc: 0.627
The model has been saved
Epoch 1 - train loss: 0.865 - train acc: 0.627 - val loss: 0.862 - val acc: 0.627
Epoch 2 - train loss: 0.866 - train acc: 0.627 - val loss: 0.862 - val acc: 0.627
Epoch 3 - train loss: 0.865 - train acc: 0.627 - val loss: 0.862 - val acc: 0.627
Validation loss has not improved in 3 epochs, stopping training



**Outputs with different batch sizes**

**With 8 batchsize**, the model is taking **~ 2 mins** for each epoch to run.

Results for each epoch:

Epoch 0 - train loss: 0.833 - train acc: 0.649 - val loss: 0.812 - val acc: 0.663

Epoch 1 - train loss: 0.816 - train acc: 0.666 - val loss: 0.812 - val acc: 0.663

Epoch 2 - train loss: 0.817 - train acc: 0.667 - val loss: 0.812 - val acc: 0.663

Epoch 3 - train loss: 0.818 - train acc: 0.666 - val loss: 0.812 - val acc: 0.663

Validation loss has not improved in 3 epochs, stopping training

**With 16 batch size**, each epoch is taking **< 90 seconds** to train.

Results for each epoch:

Epoch 0 - train loss: 0.879 - train acc: 0.616 - val loss: 0.862 - val acc: 0.627

Epoch 1 - train loss: 0.865 - train acc: 0.627 - val loss: 0.862 - val acc: 0.627

Epoch 2 - train loss: 0.866 - train acc: 0.627 - val loss: 0.862 - val acc: 0.627

Epoch 3 - train loss: 0.865 - train acc: 0.627 - val loss: 0.862 - val acc: 0.627

Validation loss has not improved in 3 epochs, stopping training


**Observations for the (base) DistilBERT model:**

The model achieves a training accuracy of 62.7% and a validation accuracy of 62.7%, with a loss of 0.862 after 4 epochs. It is observed that the validation loss does not improve after the second epoch and the training accuracy stays around 62.7%. This suggests that the model might have reached a saturation point where it is not able to learn further from the data.