# 04 - Sentiment Analysis with RoBERTa

**Author:** Lucas Little  
**Course:** CSCA 5522: Data Mining Project  
**University:** University of Colorado - Boulder

This notebook implements sentiment analysis using a RoBERTa model fine-tuned on Twitter data to analyze samples of cryptocurrency-related tweets.

## Objectives
1. Load cleaned tweet data samples
2. Implement RoBERTa-based sentiment analysis on each sample
3. Aggregate sentiment scores into 15-minute windows for each sample
4. Save sentiment-enriched datasets for feature engineering

### Validation Note
To quickly validate the data processing pipeline and the overall thesis, this notebook uses a smaller, faster model (`cardiffnlp/twitter-roberta-base-sentiment-latest`) on a one-week sample of the data. This allows for rapid iteration and verification before committing to the full, computationally expensive analysis with the FinBERT model on the entire dataset.

In [7]:
# Core imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
from pathlib import Path
from tqdm import tqdm
import os

# Sentiment analysis imports
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")
np.random.seed(42)

print("Environment setup complete!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")

Environment setup complete!
PyTorch version: 2.7.1
CUDA available: False
MPS available: True


## 1. Initialize RoBERTa-based Sentiment Model

In [8]:
def initialize_sentiment_model():
    """
    Initialize a RoBERTa model fine-tuned for Twitter sentiment.
    """
    try:
        print("🤖 Initializing Twitter sentiment model...")
        model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name)
        
        device = -1
        if torch.cuda.is_available():
            device = 0
        elif torch.backends.mps.is_available():
            device = 'mps'
            
        sentiment_pipeline = pipeline(
            "sentiment-analysis",
            model=model,
            tokenizer=tokenizer,
            device=device
        )
        
        print(f"✅ {model_name} model loaded successfully!")
        return sentiment_pipeline, "roberta_twitter"
        
    except Exception as e:
        print(f"⚠️ Could not load the model: {e}")
        return None, None

sentiment_pipeline, model_type = initialize_sentiment_model()

🤖 Initializing Twitter sentiment model...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps


✅ cardiffnlp/twitter-roberta-base-sentiment-latest model loaded successfully!


## 2. Process Sampled Data

In [9]:
def get_sentiment_scores(texts, pipeline_model, batch_size=64):
    """
    Analyze sentiment for a list of texts in batches.
    """
    if pipeline_model is None:
        return [(0.0, 0.0)] * len(texts)
    
    results = []
    for i in tqdm(range(0, len(texts), batch_size), desc="Analyzing sentiment"):
        batch = [str(t)[:512] for t in texts[i:i+batch_size]]
        try:
            batch_results = pipeline_model(batch)
            for res in batch_results:
                label = res['label'].lower()
                score = res['score']
                if 'positive' in label:
                    polarity = score
                elif 'negative' in label:
                    polarity = -score
                else:
                    polarity = 0.0
                results.append(polarity)
        except Exception as e:
            print(f"Error processing batch: {e}")
            results.extend([0.0] * len(batch))
            
    return results

data_dir = Path('../data')
processed_data_dir = data_dir / 'processed'
sampled_dir = processed_data_dir / 'sampled'

if model_type:
    for i in range(1, 6):
        print(f"\n--- Processing Sample {i} ---")
        tweet_sample_path = sampled_dir / f'tweets_sample_{i}.csv'
        
        if not tweet_sample_path.exists():
            print(f"⚠️ Sample {i} not found. Skipping.")
            continue
            
        tweets_df = pd.read_csv(tweet_sample_path)
        tweets_df['timestamp'] = pd.to_datetime(tweets_df['timestamp'])
        
        print(f"🔍 Analyzing sentiment for {len(tweets_df):,} tweets in sample {i} using {model_type}...")
        sentiments = get_sentiment_scores(tweets_df['text'].tolist(), sentiment_pipeline)
        tweets_df['sentiment'] = sentiments
        print("✅ Sentiment analysis complete!")
        
        print("📅 Aggregating sentiment by 15-minute windows...")
        tweets_df.set_index('timestamp', inplace=True)
        
        agg_config = {
            'sentiment': ['mean', 'var', 'count'],
            'retweet_count': 'sum',
            'like_count': 'sum'
        }
        
        sentiment_15min = tweets_df.resample('15T').agg(agg_config)
        sentiment_15min.columns = ['_'.join(col).strip() for col in sentiment_15min.columns]
        sentiment_15min.reset_index(inplace=True)
        
        sentiment_15min['sentiment_momentum'] = sentiment_15min['sentiment_mean'].diff()
        sentiment_15min['sentiment_var'] = sentiment_15min['sentiment_var'].fillna(0)
        sentiment_15min['sentiment_momentum'] = sentiment_15min['sentiment_momentum'].fillna(0)
        
        print(f"✅ Created 15-minute aggregation: {len(sentiment_15min)} periods")
        
        print("💾 Saving aggregated sentiment dataset...")
        output_path = sampled_dir / f'sentiment_sample_{i}.csv'
        sentiment_15min.to_csv(output_path, index=False)
        print(f"✅ Saved 15-minute sentiment data: {output_path}")
else:
    print("Sentiment model not available. Skipping analysis.")


--- Processing Sample 1 ---
🔍 Analyzing sentiment for 2,072 tweets in sample 1 using roberta_twitter...


Analyzing sentiment: 100%|██████████| 33/33 [00:22<00:00,  1.44it/s]


✅ Sentiment analysis complete!
📅 Aggregating sentiment by 15-minute windows...
✅ Created 15-minute aggregation: 672 periods
💾 Saving aggregated sentiment dataset...
✅ Saved 15-minute sentiment data: ../data/processed/sampled/sentiment_sample_1.csv

--- Processing Sample 2 ---
🔍 Analyzing sentiment for 240 tweets in sample 2 using roberta_twitter...


Analyzing sentiment: 100%|██████████| 4/4 [00:02<00:00,  1.56it/s]


✅ Sentiment analysis complete!
📅 Aggregating sentiment by 15-minute windows...
✅ Created 15-minute aggregation: 665 periods
💾 Saving aggregated sentiment dataset...
✅ Saved 15-minute sentiment data: ../data/processed/sampled/sentiment_sample_2.csv

--- Processing Sample 3 ---
🔍 Analyzing sentiment for 2,820 tweets in sample 3 using roberta_twitter...


Analyzing sentiment: 100%|██████████| 45/45 [00:34<00:00,  1.32it/s]


✅ Sentiment analysis complete!
📅 Aggregating sentiment by 15-minute windows...
✅ Created 15-minute aggregation: 673 periods
💾 Saving aggregated sentiment dataset...
✅ Saved 15-minute sentiment data: ../data/processed/sampled/sentiment_sample_3.csv

--- Processing Sample 4 ---
🔍 Analyzing sentiment for 2,973 tweets in sample 4 using roberta_twitter...


Analyzing sentiment: 100%|██████████| 47/47 [00:30<00:00,  1.54it/s]


✅ Sentiment analysis complete!
📅 Aggregating sentiment by 15-minute windows...
✅ Created 15-minute aggregation: 673 periods
💾 Saving aggregated sentiment dataset...
✅ Saved 15-minute sentiment data: ../data/processed/sampled/sentiment_sample_4.csv

--- Processing Sample 5 ---
🔍 Analyzing sentiment for 7,663 tweets in sample 5 using roberta_twitter...


Analyzing sentiment: 100%|██████████| 120/120 [01:38<00:00,  1.22it/s]

✅ Sentiment analysis complete!
📅 Aggregating sentiment by 15-minute windows...
✅ Created 15-minute aggregation: 673 periods
💾 Saving aggregated sentiment dataset...
✅ Saved 15-minute sentiment data: ../data/processed/sampled/sentiment_sample_5.csv



