### RoBERTa Sentiment Analysis (Performed in Google Colab)

To complement the rule-based VADER analysis, I applied **transformer-based sentiment classification** using the **`cardiffnlp/twitter-roberta-base-sentiment`** model. This model is fine-tuned for social media text and handles informal language better than lexicon-based methods.

#### 🔧 Implementation Steps (in Google Colab):
1. **Loaded the preprocessed Reddit dataset** (with `Comment_raw` column).
2. Used Hugging Face Transformers to load the RoBERTa model:
   - Model: `cardiffnlp/twitter-roberta-base-sentiment`
   - Tokenizer: `AutoTokenizer.from_pretrained(...)`
3. Applied batch inference on `Comment_raw` to extract sentiment logits and convert to labels:
   - `Negative`, `Neutral`, `Positive`
4. Stored predictions in a new column: `roberta_sentiment`
5. Exported the updated dataset to a CSV: **`reddit_with_roberta_sentiment.csv`**

This file was then loaded into the current notebook for further analysis and visualization.


In [None]:
from google.colab import files
uploaded = files.upload()


Saving reddit_preprocessed_comments.csv to reddit_preprocessed_comments.csv


In [None]:
import pandas as pd
df = pd.read_csv('reddit_preprocessed_comments.csv')

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

In [None]:
from tqdm import tqdm

def batch_predict_sentiment(texts, tokenizer, model, batch_size=64):
    sentiments = []
    model = model.cuda()  # VERY IMPORTANT: Move model to GPU
    model.eval()
    with torch.no_grad():
        for i in tqdm(range(0, len(texts), batch_size)):
            batch_texts = texts[i:i+batch_size]
            encoded_input = tokenizer(batch_texts, return_tensors='pt', padding=True, truncation=True, max_length=512)
            encoded_input = {k: v.cuda() for k, v in encoded_input.items()}  # Move input tensors to GPU
            output = model(**encoded_input)
            predictions = torch.argmax(output.logits, dim=1).tolist()

            for pred in predictions:
                if pred == 0:
                    sentiments.append('Negative')
                elif pred == 1:
                    sentiments.append('Neutral')
                else:
                    sentiments.append('Positive')
    return sentiments


In [None]:
df['roberta_sentiment'] = batch_predict_sentiment(
    df['Comment_raw'].tolist(),
    tokenizer,
    model,
    batch_size=64
)


100%|██████████| 488/488 [09:15<00:00,  1.14s/it]


In [None]:
# Save full DataFrame with new roberta_sentiment column
df.to_csv("reddit_with_roberta_sentiment.csv", index=False)


In [None]:
from google.colab import files
files.download("reddit_with_roberta_sentiment.csv")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>