<a href="https://colab.research.google.com/github/jeelfaldu7/transformer_sentiment_analysis/blob/main/notebook_jeel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# October Code Jam

Prepared by Jeel Faldu, Jimmy Koester, and Raphael Lu

In [27]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import torch
from tqdm import tqdm
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split

#HuggingFace tools
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

from google.colab import output

import urllib.request

## Introduction

This project explores sentiment analysis using `distilbert-base-uncased-finetuned-sst-2-english` and `twitter-roberta-base-sentiment` from the Hugging Face Transformers library. The goal is to automatically classify text as positive, negative, or neutral using pretrained transformer models.

We analyzed sentiment in social media posts, compared multiple models, and visualized their performance and confidence levels. We also applied the models to a creative dataset, showcasing how NLP can reveal insights from real-world text such as tweets, song lyrics, or news articles.


## Device Loading

In [2]:
output.disable_custom_widget_manager()

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Device Loaded: {device}")

# model_1 = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
# model_2 = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")

distil_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

Device Loaded: cuda


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

## Data Preprocessing

In data exploration, no duplicate entries were found. In the orignal data set, Tweets were encoded as
```
'nocode', 'happy', 'not-relevant', 'angry', 'disgust|angry',
'disgust', 'happy|surprise', 'sad', 'surprise', 'happy|sad',
'sad|disgust', 'sad|angry', 'sad|disgust|angry'
```

**Summary of Data Preprocessing:**
* Feilds were renamed `id`, `text`, and `label`
* `id` feild was ultimately dropped
* Target labels were re-coded into arrays using OHE:
     * e.g. with sentiments   
     `['angry' 'disgust' 'happy' 'nocode' 'not-relevant' 'sad' 'surprise']`,   
     a tensor of `[1 0 0 0 0 1 0]`
     indicates an `angry|sad` tweet

* Data were split using a 80:20 (train:test) ratio and stored in `X_train, X_test, y_train, y_test`

In [3]:
# Load dataset
data_url = 'https://raw.githubusercontent.com/jeelfaldu7/transformer_sentiment_analysis/refs/heads/main/dataset.csv'
df = pd.read_csv(data_url, header=None, names=['id', 'text', 'label'], sep=',')

# Display first few rows of the dataset
display(df.head())

Unnamed: 0,id,text,label
0,611857364396965889,@aandraous @britishmuseum @AndrewsAntonio Merc...,nocode
1,614484565059596288,Dorian Gray with Rainbow Scarf #LoveWins (from...,happy
2,614746522043973632,@SelectShowcase @Tate_StIves ... Replace with ...,happy
3,614877582664835073,@Sofabsports thank you for following me back. ...,happy
4,611932373039644672,@britishmuseum @TudorHistory What a beautiful ...,happy


In [4]:
# Display the summary of the dataset
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3085 entries, 0 to 3084
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      3085 non-null   int64 
 1   text    3085 non-null   object
 2   label   3085 non-null   object
dtypes: int64(1), object(2)
memory usage: 72.4+ KB


In [5]:
df = df.drop('id', axis = 1)

In [6]:
# Display the shape of all the DataFrame
n_rows, n_cols = df.shape
print(f"The DataFrame has {n_rows} rows and {n_cols} columns")

The DataFrame has 3085 rows and 2 columns


In [7]:
# Check for duplicates in the 'final' DataFrame
df.duplicated().sum(),df['label'].unique()

(np.int64(37),
 array(['nocode', 'happy', 'not-relevant', 'angry', 'disgust|angry',
        'disgust', 'happy|surprise', 'sad', 'surprise', 'happy|sad',
        'sad|disgust', 'sad|angry', 'sad|disgust|angry'], dtype=object))

### Train-Test Split

In [8]:
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

X_train = train_df.drop('label', axis = 1)
X_test = test_df.drop('label', axis = 1)

mlb = MultiLabelBinarizer()
y_train = mlb.fit_transform(train_df['label'].str.split('|'))
y_test = mlb.transform(test_df['label'].str.split('|'))

print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(2468, 1) (2468, 7)
(617, 1) (617, 7)


In [9]:
print(mlb.classes_)
print(y_train[92])

['angry' 'disgust' 'happy' 'nocode' 'not-relevant' 'sad' 'surprise']
[0 0 1 0 0 0 0]


### Data Pre-processing for roBERTa model

* Tweets were processed using recommened method in model [documentation](https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment).
     * Specifically usernames and hyperlinks were reduced to `@user` and `http` respectively
* Tweets were tokenized and stored in X2_train and X2_test

In [10]:
def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

In [11]:
X2_train = X_train['text'].apply(preprocess)
X2_test = X_test['text'].apply(preprocess)

X2_train.head()

Unnamed: 0,text
2634,"I'm at @user in London, Greater London http"
2373,My favourite #oilpainting 'Tiger Tiger Burnin...
839,Currently @user to discuss #DefeatingDepressio...
2857,@user thanks for the Favourite!
761,Farron Gorey selfies before his performance at...


In [12]:
roberta_tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")

config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [13]:
X2_train_tokens = roberta_tokenizer(X2_train.tolist(),
                                    padding=True,
                                    truncation=True,
                                    max_length=128,
                                    return_tensors="pt")

X2_test_tokens = roberta_tokenizer(X2_test.tolist(),
                                    padding=True,
                                    truncation=True,
                                    max_length=128,
                                    return_tensors="pt")

In [14]:
base_sentiments = base_sentiments = mlb.classes_

label2id = {label: idx for idx, label in enumerate(base_sentiments)}
id2label = {idx: label for idx, label in enumerate(base_sentiments)}

In [15]:
model_2 = AutoModelForSequenceClassification.from_pretrained(
    "cardiffnlp/twitter-roberta-base-sentiment",
    num_labels=len(base_sentiments),
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True,
    problem_type="multi_label_classification"
).to(device)


pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment and are newly initialized because the shapes did not match:
- classifier.out_proj.weight: found shape torch.Size([3, 768]) in the checkpoint and torch.Size([7, 768]) in the model instantiated
- classifier.out_proj.bias: found shape torch.Size([3]) in the checkpoint and torch.Size([7]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


#### Re-training the `roBERTa Model`

In [16]:
train_dataset = Dataset.from_dict({
    'input_ids': X2_train_tokens['input_ids'],
    'attention_mask': X2_train_tokens['attention_mask'],
    'labels': y_train.astype('float32').tolist()
})

test_dataset = Dataset.from_dict({
    'input_ids': X2_test_tokens['input_ids'],
    'attention_mask': X2_test_tokens['attention_mask'],
    'labels': y_test.astype('float32').tolist()
})

In [17]:
output.disable_custom_widget_manager()

training_args = TrainingArguments(output_dir="./sentiment-model",
                                  num_train_epochs=4,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=16,
                                  learning_rate=5e-5,
                                  weight_decay=0.1,
                                  eval_strategy="epoch",
                                  save_strategy="epoch",
                                  load_best_model_at_end=True,
                                  report_to="none",
                                  seed = 42)

roberta_trainer = Trainer(model=model_2,
                  args=training_args,
                  train_dataset=train_dataset,
                  eval_dataset=test_dataset,
                  processing_class=roberta_tokenizer)

roberta_trainer.train()

model_2.save_pretrained("./my-sentiment-model")
roberta_tokenizer.save_pretrained("./my-sentiment-model")



model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Epoch,Training Loss,Validation Loss
1,No log,0.166872
2,No log,0.159592
3,No log,0.179289
4,0.123900,0.180858


('./my-sentiment-model/tokenizer_config.json',
 './my-sentiment-model/special_tokens_map.json',
 './my-sentiment-model/vocab.json',
 './my-sentiment-model/merges.txt',
 './my-sentiment-model/added_tokens.json',
 './my-sentiment-model/tokenizer.json')

In [18]:
def predict_sentiments_debug(text, threshold=0.5):
    inputs = roberta_tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model_2(**inputs)

    logits = outputs.logits
    probs = torch.sigmoid(logits)[0].cpu().numpy()

    print("All probabilities:")
    for idx, prob in enumerate(probs):
        sentiment = id2label[idx]
        print(f"  {sentiment}: {prob:.4f}")

    predictions = {}
    for idx, prob in enumerate(probs):
        sentiment = id2label[idx]
        if prob >= threshold:
            predictions[sentiment] = float(prob)

    return predictions

result = predict_sentiments_debug("I'm so scared of the recent politcs from both parties", threshold=0.3)
print(f"Sentiments: {result}")

All probabilities:
  angry: 0.3081
  disgust: 0.0710
  happy: 0.2112
  nocode: 0.0387
  not-relevant: 0.1411
  sad: 0.1866
  surprise: 0.1938
Sentiments: {'angry': 0.3081361651420593}


Re-training the `DistilBERT model`

In [20]:
# Load pretrained DistilBERT model
distil_tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
distil_model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english"
).to(device)

# Function to predict sentiment
def distilbert_sentiment(text):
    inputs = distil_tokenizer(text, return_tensors="pt", truncation=True, max_length=128).to(device)
    with torch.no_grad():
        logits = distil_model(**inputs).logits
    pred_id = logits.argmax().item()
    return distil_model.config.id2label[pred_id]


In [22]:
# Apply to the dataset
tqdm.pandas()
test_df["distilbert_sentiment"] = test_df["text"].progress_apply(distilbert_sentiment)


100%|██████████| 617/617 [00:02<00:00, 218.99it/s]


In [23]:
test_df[["text", "label", "distilbert_sentiment"]].head(10)

Unnamed: 0,text,label,distilbert_sentiment
1505,pindah lagi ehehe (at @nationalgallery) — http...,nocode,NEGATIVE
2399,@DavidSmithArt @roshvarosha @LisaLooly @newman...,happy,POSITIVE
1814,"Roman gem engraved with Odysseus, his ship, an...",nocode,NEGATIVE
511,An exciting announcement soon :) @RAMMuseum #e...,happy,POSITIVE
1565,Excited to have a FLASH Residency at @Studio44...,happy,POSITIVE
790,The Astronomers of Babylon (explorable website...,nocode,NEGATIVE
354,"John H. Taylor du @britishmuseum ""The collecti...",nocode,NEGATIVE
1651,"@NationalGallery Wait, no, is there a secret L...",surprise,NEGATIVE
2859,@NationalGallery O retrato para mim é a expres...,nocode,NEGATIVE
408,@MorpethSch students got to try their hand at ...,nocode,NEGATIVE


In [30]:
accuracy = accuracy_score(test_df["binary_label"], test_df["distilbert_sentiment"])
print(f"DistilBERT Accuracy (binary): {accuracy:.4f}")

DistilBERT Accuracy (binary): 0.3485
