# **CSCI323 Group Assignment**

This is our exploration of using Transformers instead of a Naive Bayes classifer as an approach to perform sentiment analysis on a dataset.  

Our dataset has been truncated to exclude "https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset/resolve/main/train_df.csv", due to an error with the file which causes our model tuning to crash. This approximately halves the size of our dataset. However, as we will see, the transformer model is able to achieve very similar accuracy to our Naives Bayes model, while having far less data to train  on.

## **Data Preparation**

### **Loading the dataset**

Firstly, we load the data into a pandas dataframe concatenate them into a single dataset. Our dataset is a combination of 2 huggingface datasets consisting of online comments, each pre-labelled with either **Negative(0)**, **Neutral(1)** or **Positive(2)**.

In [1]:
import pandas as pd

#Load the dataset
#Dataset 1
url1 = [
    # "https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset/resolve/main/train_df.csv",
    "https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset/resolve/main/test_df.csv",
    "https://huggingface.co/datasets/Sp1786/multiclass-sentiment-analysis-dataset/resolve/main/val_df.csv"
]

df_data1 = pd.concat([pd.read_csv(url) for url in url1], ignore_index=True)

#Dataset 2
url2 = [
    "https://huggingface.co/datasets/mteb/tweet_sentiment_extraction/resolve/main/train.jsonl",
    "https://huggingface.co/datasets/mteb/tweet_sentiment_extraction/resolve/main/test.jsonl"
]
df_data2 = pd.concat([pd.read_json(url, lines=True) for url in url2], ignore_index=True)

#Rename df_data2[label_text] to df_data2[sentiment]
df_data2.rename(columns={'label_text': 'sentiment'}, inplace=True)

#Combine both datasets
df_data = pd.concat([df_data1, df_data2], ignore_index=True)

df_data.head()

Unnamed: 0,id,text,label,sentiment
0,9235,getting cds ready for tour,1,neutral
1,16790,"MC, happy mother`s day to your mom ;).. love yah",2,positive
2,24840,A year from now is graduation....i am pretty s...,0,negative
3,20744,because you had chips and sale w/o me,1,neutral
4,6414,Great for organising my work life balance,2,positive


## **Pre-processing**

We use the **Natural Language Toolkit(NLTK)** which is a comprehensive library for working with human language data. NLTK provides useful tools for text processing such as tokenization and lemmitization. Here is a quick overview and explanation of the other imports used in this portion of the code:
1. The **punkt** tonkenizer is used for dividing a text into a list of words or sentences. This is needed for tokenization tasks.
2. A list of **stopwords** like *and*, *is*, *the*, etc is downloaded which is often removed from text data to focus on more meaningful words.
3. The **WordNet** database provides a large dictionary of words and their meanings, synonyms and antonymns. This is used for lemmatization where words are reduced to their base forms.
4. The **Open Multilingual WordNet** package allows access to WordNet in multiple languages. Our dataset is multilingual so this helps us in multilingual text processing.
5. **Words** is a list of English words which is used to filter or validate tokens to ensure that they are real words.
6. **word_tokenize** is a function that breaks down text into individual words.
7. **WordNetLemmatizer** is a tool that reduces words to their base form, eg *running* to *run*.

In [2]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('words')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, words
from nltk.stem import WordNetLemmatizer
import string
import re
!pip install emoji
import emoji

#Data processing
#Convert text to lowercase
df_data['text'] = df_data['text'].str.lower()

#Function to remove punctuation
def remove_punctuation(text):
    if isinstance(text, str):
        return re.sub(f'[{re.escape(string.punctuation)}]', '', text)
    else:
        return text  # or return an empty string: ''

#Apply punctuation removal to the text column
df_data['text'] = df_data['text'].apply(remove_punctuation)

# Function to convert emojis to text
def convert_emojis(text):
    if isinstance(text, str):
        return emoji.demojize(text)
    else:
        return text

# Apply the function to the 'text' column
df_data['text'] = df_data['text'].apply(convert_emojis)

#Function to remove stop words
stop_words = set(stopwords.words('english'))

def remove_stop_words(text):
    if isinstance(text, str):
        tokens = word_tokenize(text)
        filtered_tokens = [word for word in tokens if word not in stop_words]
        return ' '.join(filtered_tokens)
    else:
        return text

#Apply stop word removal to the text column
df_data['text'] = df_data['text'].apply(remove_stop_words)

#Function to tokenize text
def tokenize_text(text):
    if isinstance(text, str):
        return word_tokenize(text)
    else:
        return []

#Apply tokenization to the text column
df_data['tokens'] = df_data['text'].apply(tokenize_text)

#Function to lemmatize tokens
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    if isinstance(tokens, list):
        return [lemmatizer.lemmatize(token) for token in tokens]
    else:
        return tokens

#Apply lemmatization to the tokens column
df_data['lemmatized_tokens'] = df_data['tokens'].apply(lemmatize_tokens)

#Join lemmatized tokens back into strings
df_data['lemmatized_text'] = df_data['lemmatized_tokens'].apply(lambda tokens: ' '.join(tokens))

#Function to remove null values
def remove_null_values(df, column_name):
    df = df.dropna(subset=[column_name])
    return df

#Apply null value removal to the lemmatized text column
df_data = remove_null_values(df_data, 'lemmatized_text')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ssyab\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ssyab\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ssyab\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\ssyab\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\ssyab\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Collecting emoji
  Downloading emoji-2.12.1-py3-none-any.whl.metadata (5.4 kB)
Downloading emoji-2.12.1-py3-none-any.whl (431 kB)
   ---------------------------------------- 0.0/431.4 kB ? eta -:--:--
   -- ------------------------------------- 30.7/431.4 kB ? eta -:--:--
   --- ----------------------------------- 41.0/431.4 kB 667.8 kB/s eta 0:00:01
   --- ----------------------------------- 41.0/431.4 kB 667.8 kB/s eta 0:00:01
   -------- ------------------------------ 92.2/431.4 kB 585.1 kB/s eta 0:00:01
   ------------ ------------------------- 143.4/431.4 kB 778.5 kB/s eta 0:00:01
   ------------------- ------------------ 225.3/431.4 kB 919.0 kB/s eta 0:00:01
   ------------------------------ --------- 327.7/431.4 kB 1.1 MB/s eta 0:00:01
   ------------------------------------- -- 409.6/431.4 kB 1.3 MB/s eta 0:00:01
   ---------------------------------------- 431.4/431.4 kB 1.2 MB/s eta 0:00:00
Installing collected packages: emoji
Successfully installed emoji-2.12.1


In [3]:
#Join the lemmatized tokens back into a single string
df_data['processed_text'] = df_data['lemmatized_tokens'].apply(lambda tokens: ' '.join(tokens))

### **Vectorization and Train-Test Split**

A vectorizer converts our text column into tokens and then converts the tokens into numerical vectors.  

We imported **TfidVectorizer** which implements vectorization by converting text into a matrix of TF-IDF (Term Frequency-Inverse Document Frequency) features. This matrix quantifies the importance of each word in a document relative to the entire corpus, helping to highlight words that are more relevant to each document.  

Next, we will do the test-train split: 70% of the dataset will be used for training the model and the remaining 30% is used for testing. We imported the **train_test_split** function from scikit-learn module. This separation helps ensure that the model's performance metrics are reliable and that it doesn't overfit to the training data.

In [4]:
#Split the dataframe into inputs and expected outputs
x = df_data['processed_text']
y = df_data['label']

from sklearn.model_selection import train_test_split
#Split x and y into training sets and test sets
#Split the dataset into 70% training and 30% testing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=4)

from sklearn.feature_extraction.text import TfidfVectorizer
#Initialize CountVectorizer and fit on training data
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
vectorizer.fit(x_train)

#Transform the training and test data
x_train_vectorized = vectorizer.transform(x_train)
x_test_vectorized = vectorizer.transform(x_test)

## **Model Training**

We fit the dataset onto the Multinomial Model and train the model on our dataset.

### **Multinomial**

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

#Train a naive bayes classifier: MultinomialNB
mnb = MultinomialNB()
mnb.fit(x_train_vectorized, y_train)

from sklearn import metrics
mnb_predicted = mnb.predict(x_test_vectorized)
accuracy_score_mnb = metrics.accuracy_score(y_test, mnb_predicted)
mnb = MultinomialNB()
mnb.fit(x_train_vectorized, y_train)

print('MultinomialNB model accuracy is',str('{:04.2f}'.format(accuracy_score_mnb*100))+'%')
print('------------------------------------------------')
print('Confusion Matrix:')
print(pd.DataFrame(confusion_matrix(y_test, mnb_predicted)))
print('------------------------------------------------')
print('Classification Report:')
print(classification_report(y_test, mnb_predicted))

MultinomialNB model accuracy is 69.51%
------------------------------------------------
Confusion Matrix:
      0     1     2
0  1709  1708   163
1   205  4291   446
2    47  1220  2639
------------------------------------------------
Classification Report:
              precision    recall  f1-score   support

           0       0.87      0.48      0.62      3580
           1       0.59      0.87      0.71      4942
           2       0.81      0.68      0.74      3906

    accuracy                           0.70     12428
   macro avg       0.76      0.67      0.69     12428
weighted avg       0.74      0.70      0.69     12428



## **Hyperparemeter Tuning**

Here we train the model, using the same hyperparemeter grid that we used on our larger dataset for consistency. We then conduct model evaluation by calculating accuracy and ROC AUC score for the tuned model as a basis for comparison with our Transformer model.

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score

#Define the hyperparameter grid
param_grid = {
    'alpha': [0.1, 0.5, 1.0, 5.0, 10.0]
}

#Perform GridSearchCV on multinomial model
grid_search = GridSearchCV(mnb, param_grid, cv=5, scoring='accuracy')
grid_search.fit(x_train_vectorized, y_train)

#Print the best parameters and best score
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Cross-Validation Score: {grid_search.best_score_}")

#Get the best model and hyperparameters
best_clf = grid_search.best_estimator_
print(f'Best Hyperparameters: {grid_search.best_params_}')

#Make predictions on the test set with the best model
predictionsMNB = best_clf.predict(x_test_vectorized)

#Retrain the model using the best parameters
MNB_best_model = grid_search.best_estimator_
MNB_best_model.fit(x_train_vectorized, y_train)

#Calculate the accuracy
accuracyMNB = accuracy_score(y_test, predictionsMNB)
print(f'Accuracy for multinomial model: {accuracyMNB * 100}%')

#Calculate ROC AUC Score
MNB_prob = best_clf.predict_proba(x_test_vectorized)
MNB_roc_auc = roc_auc_score(y_test, MNB_prob, multi_class='ovr')
print(f'ROC AUC Score: {MNB_roc_auc:.4f}')

Best Parameters: {'alpha': 0.5}
Best Cross-Validation Score: 0.6841853530037877
Best Hyperparameters: {'alpha': 0.5}
Accuracy for multinomial model: 70.95268747988413%
ROC AUC Score: 0.8730


In [None]:
#Set roc_auc to best model
roc_auc = MNB_roc_auc

#Determine the best model based on roc_auc
best_model = "MNB"

#Set predictions to best model
predictions = globals()[f"predictions{best_model}"]

#Set accuracy to best model
accuracy = globals()[f"accuracy{best_model}"]

print(f'Best Model: {best_model}')
print(f'ROC AUC Score: {roc_auc:.4f}')
print(f'Accuracy: {accuracy:.4f}')

Best Model: MNB
ROC AUC Score: 0.8730
Accuracy: 0.7095


## **Transformer Exploration**

In [14]:
import joblib

# Save the model to a file
joblib.dump(MNB_best_model, 'naive_bayes_model.pkl')
joblib.dump(vectorizer, 'vectorizer.pkl')

print("Model saved successfully!")

Model saved successfully!


In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from scipy.special import softmax
import torch
import numpy as np

In [None]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

In [None]:
# Check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Tokenize the data
train_encodings = tokenizer(x_train.tolist(), truncation=True, padding='max_length', max_length=514, return_tensors='pt')
test_encodings = tokenizer(x_test.tolist(), truncation=True, padding='max_length', max_length=514, return_tensors='pt')

# Convert to torch Dataset
class SentimentDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() for key, val in self.encodings.items()}  # Adjusted tensor creation
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = SentimentDataset(train_encodings, y_train.tolist())
test_dataset = SentimentDataset(test_encodings, y_test.tolist())

torch.autograd.set_detect_anomaly(True)

Using device: cuda


<torch.autograd.anomaly_mode.set_detect_anomaly at 0x7d1a3ea17550>

In [None]:
# Function to calculate ROC AUC score and accuracy
def compute_metrics(model, dataset):
    trainer = Trainer(model=model)
    predictions = trainer.predict(dataset)

    # Apply softmax to convert logits to probabilities
    probs = torch.nn.functional.softmax(torch.tensor(predictions.predictions), dim=1).numpy()

    # Compute predicted classes
    preds = np.argmax(probs, axis=1)

    # Compute ROC AUC score for multiclass classification
    roc_auc = roc_auc_score(dataset.labels, probs, multi_class='ovr')

    # Compute accuracy
    accuracy = accuracy_score(dataset.labels, preds)

    return roc_auc, accuracy

In [None]:
# Calculate ROC AUC score and accuracy before fine-tuning
roc_auc_before, accuracy_before = compute_metrics(model, test_dataset)
print(f"ROC AUC Score before fine-tuning: {roc_auc_before}")
print(f"Accuracy before fine-tuning: {accuracy_before}")

ROC AUC Score before fine-tuning: 0.848275680533778
Accuracy before fine-tuning: 0.6901351786289025


In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    warmup_steps=50,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    gradient_accumulation_steps=2,
    fp16=True,
    evaluation_strategy="steps",
    eval_steps=500,
    max_grad_norm=1.0
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



In [None]:
import os

# Fine-tune the model
trainer.train()

Step,Training Loss,Validation Loss


Step,Training Loss,Validation Loss
500,0.4716,0.593001
1000,0.3318,0.598337


TrainOutput(global_step=1359, training_loss=0.4668948587202868, metrics={'train_runtime': 3992.0374, 'train_samples_per_second': 21.792, 'train_steps_per_second': 0.34, 'total_flos': 2.2960210061348784e+16, 'train_loss': 0.4668948587202868, 'epoch': 2.9966923925027564})

In [None]:
# Save only the model weights
torch.save(model.state_dict(), 'model_weights_3_epoch.pth')

# To load the weights later
# model = AutoModelForSequenceClassification.from_pretrained(MODEL)
# model.load_state_dict(torch.load('model_weights.pth'))

NameError: name 'torch' is not defined

In [None]:
# Calculate ROC AUC score and accuracy after fine-tuning
roc_auc_after, accuracy_after = compute_metrics(model, test_dataset)
print(f"ROC AUC Score after fine-tuning: {roc_auc_after:.4f}")
print(f"Accuracy after fine-tuning: {accuracy_after:.4f}")

ROC AUC Score after fine-tuning: 0.9242
Accuracy after fine-tuning: 0.7988
