# Advanced Sentiment Analysis Model Comparison
### Enron Corporate Crisis: Model Performance Benchmark

This notebook compares **8 state-of-the-art sentiment analysis models** on the Enron email dataset:

**Transformer-Based Models:**
1. **BERT** - Base bidirectional encoder (110M params)
2. **RoBERTa** - Robustly optimized BERT (125M params)
3. **DistilBERT** - Distilled BERT (66M params, 40% faster)
4. **Twitter-RoBERTa** - Fine-tuned on 124M tweets for sentiment
5. **FinBERT** - Domain-specific for financial sentiment

**Traditional Baselines:**
6. **TextBlob** - Lexicon-based (current dashboard model)
7. **VADER** - Social media sentiment analyzer
8. **Flair Sentiment** - Character-level embeddings

**Evaluation Metrics:**
- Accuracy, Precision, Recall, F1-Score
- Inference speed (emails/second)
- Model size and memory usage
- Confusion matrices and ROC curves
- Real-world deployment recommendations

---

## 1. Setup and Dependencies

In [1]:
# Install required packages
!pip install -q transformers torch datasets evaluate scikit-learn textblob vaderSentiment flair plotly pandas numpy seaborn matplotlib

[33m  DEPRECATION: Building 'langdetect' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'langdetect'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m[33m  DEPRECATION: Building 'pptree' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'pptree'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m[33m  DEPRECATION: Building 'pptree' using 

In [2]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import time
import torch
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

# Machine Learning
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)
from sklearn.model_selection import train_test_split

# NLP Libraries
from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    pipeline, BertForSequenceClassification, RobertaForSequenceClassification,
    DistilBertForSequenceClassification
)
from textblob import TextBlob
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from flair.models import TextClassifier
from flair.data import Sentence

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if device.type == 'cuda':
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

Using device: cpu


## 2. Load and Prepare Enron Dataset

In [3]:
# Load emails
print("Loading Enron email dataset...")
df = pd.read_csv('emails.csv')
print(f"Total emails: {len(df):,}")

# Sample for faster experimentation (adjust sample_size as needed)
SAMPLE_SIZE = 5000  # Use 5000 for quick testing, increase to 50000+ for production
df_sample = df.sample(n=min(SAMPLE_SIZE, len(df)), random_state=42)

# Extract email text (body only)
df_sample['text'] = df_sample['message'].str.replace('Subject:', '', regex=False)
df_sample['text'] = df_sample['text'].str[:500]  # Limit to first 500 chars for speed
df_sample = df_sample.dropna(subset=['text'])

print(f"\nUsing {len(df_sample):,} emails for analysis")
print(f"Average email length: {df_sample['text'].str.len().mean():.0f} characters")

Loading Enron email dataset...
Total emails: 517,401

Using 5,000 emails for analysis
Average email length: 499 characters
Total emails: 517,401

Using 5,000 emails for analysis
Average email length: 499 characters


## 3. Manual Annotation (Ground Truth Creation)

For accurate benchmarking, we need ground truth labels. We'll use a hybrid approach:
1. **Consensus voting** from multiple baseline models
2. **Manual validation** of a subset
3. **Keyword-based heuristics** for corporate stress indicators

In [4]:
def create_ground_truth_labels(texts, method='consensus'):
    """
    Create ground truth labels using ensemble voting.
    
    Sentiment classes:
    - 0: Negative (stressed, concerned, angry)
    - 1: Neutral (informational, factual)
    - 2: Positive (optimistic, satisfied)
    """
    print("Generating ground truth labels using ensemble voting...")
    
    # Initialize analyzers
    vader = SentimentIntensityAnalyzer()
    
    labels = []
    confidences = []
    
    for text in tqdm(texts, desc="Creating labels"):
        # VADER score
        vader_score = vader.polarity_scores(str(text))['compound']
        
        # TextBlob score
        try:
            textblob_score = TextBlob(str(text)).sentiment.polarity
        except:
            textblob_score = 0
        
        # Corporate stress keywords (domain-specific)
        stress_keywords = ['crisis', 'layoff', 'bankrupt', 'investigate', 'fraud', 'concern', 
                          'worried', 'urgent', 'problem', 'issue', 'delay', 'loss']
        positive_keywords = ['thanks', 'appreciate', 'excellent', 'great', 'success', 
                            'congratulations', 'pleased', 'happy']
        
        text_lower = str(text).lower()
        stress_count = sum(1 for kw in stress_keywords if kw in text_lower)
        positive_count = sum(1 for kw in positive_keywords if kw in text_lower)
        
        # Ensemble vote
        avg_score = (vader_score + textblob_score) / 2
        
        # Apply domain-specific adjustments
        if stress_count >= 2:
            avg_score -= 0.3
        if positive_count >= 2:
            avg_score += 0.3
        
        # Classify
        if avg_score < -0.1:
            label = 0  # Negative
        elif avg_score > 0.1:
            label = 2  # Positive
        else:
            label = 1  # Neutral
        
        labels.append(label)
        confidences.append(abs(avg_score))
    
    return np.array(labels), np.array(confidences)

# Create labels
y_true, confidences = create_ground_truth_labels(df_sample['text'].values)

# Add to dataframe
df_sample['sentiment_label'] = y_true
df_sample['confidence'] = confidences

# Display distribution
label_names = {0: 'Negative', 1: 'Neutral', 2: 'Positive'}
print("\nSentiment Distribution:")
print(pd.Series(y_true).map(label_names).value_counts())
print(f"\nAverage confidence: {confidences.mean():.3f}")

Generating ground truth labels using ensemble voting...


Creating labels: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:01<00:00, 4014.65it/s]
Creating labels: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5000/5000 [00:01<00:00, 4014.65it/s]



Sentiment Distribution:
Neutral     3177
Positive    1514
Negative     309
Name: count, dtype: int64

Average confidence: 0.117


## 4. Model Initialization

### Research-Backed Model Selection

Based on recent HuggingFace research:
- **Twitter-RoBERTa**: State-of-the-art for short-form text (124M tweets)
- **FinBERT**: Domain-specific financial sentiment (97.4% accuracy on financial news)
- **DistilBERT**: 40% faster than BERT, 97% performance retention

In [None]:
class ModelComparator:
    def __init__(self, device='cpu'):
        self.device = device
        self.models = {}
        self.results = {}
    
    def load_models(self):
        """Load all models for comparison."""
        print("Loading models...\n")
        
        # 1. TextBlob (Baseline)
        print("‚úì TextBlob (Lexicon-based)")
        self.models['TextBlob'] = TextBlob
        
        # 2. VADER
        print("‚úì VADER (Social Media Optimized)")
        self.models['VADER'] = SentimentIntensityAnalyzer()
        
        # 3. DistilBERT (lightweight BERT, 40% faster)
        print("‚úì Loading DistilBERT...")
        self.models['DistilBERT'] = pipeline(
            "sentiment-analysis",
            model="distilbert/distilbert-base-uncased-finetuned-sst-2-english",
            device=0 if self.device.type == 'cuda' else -1
        )
        
        # 4. Twitter-RoBERTa (trained on 124M tweets)
        print("‚úì Loading Twitter-RoBERTa...")
        self.models['Twitter-RoBERTa'] = pipeline(
            "sentiment-analysis",
            model="cardiffnlp/twitter-roberta-base-sentiment-latest",
            device=0 if self.device.type == 'cuda' else -1
        )
        
        # 5. BERT-base (baseline transformer)
        print("‚úì Loading BERT-base...")
        self.models['BERT'] = pipeline(
            "sentiment-analysis",
            model="nlptown/bert-base-multilingual-uncased-sentiment",
            device=0 if self.device.type == 'cuda' else -1
        )
        
        # 7. FinBERT (Domain-specific)
        print("‚úì Loading FinBERT (Financial Domain)...")
        try:
            self.models['FinBERT'] = pipeline(
                "sentiment-analysis",
                model="ProsusAI/finbert",
                device=0 if self.device.type == 'cuda' else -1
            )
        except:
            print("  ‚ö† FinBERT not available, skipping")
        
        # 8. Flair Sentiment
        print("‚úì Loading Flair Sentiment...")
        try:
            self.models['Flair'] = TextClassifier.load('sentiment')
        except:
            print("  ‚ö† Flair not available, skipping")
        
        print(f"\n Loaded {len(self.models)} models successfully")
        return self
    
    def predict_textblob(self, texts):
        """TextBlob predictions."""
        predictions = []
        for text in tqdm(texts, desc="TextBlob", leave=False):
            try:
                polarity = TextBlob(str(text)).sentiment.polarity
                if polarity < -0.1:
                    predictions.append(0)  # Negative
                elif polarity > 0.1:
                    predictions.append(2)  # Positive
                else:
                    predictions.append(1)  # Neutral
            except:
                predictions.append(1)
        return np.array(predictions)
    
    def predict_vader(self, texts):
        """VADER predictions."""
        predictions = []
        vader = self.models['VADER']
        for text in tqdm(texts, desc="VADER", leave=False):
            try:
                score = vader.polarity_scores(str(text))['compound']
                if score < -0.1:
                    predictions.append(0)
                elif score > 0.1:
                    predictions.append(2)
                else:
                    predictions.append(1)
            except:
                predictions.append(1)
        return np.array(predictions)
    
    def predict_transformer(self, texts, model_name, batch_size=32):
        """Transformer model predictions with batching."""
        model = self.models[model_name]
        predictions = []
        
        # process in batches for speed
        for i in tqdm(range(0, len(texts), batch_size), desc=model_name, leave=False):
            batch = texts[i:i+batch_size]
            try:
                # ensure text is string and not empty
                batch = [str(text) if text else " " for text in batch]
                results = model(batch, truncation=True, max_length=512)
                
                for result in results:
                    label = result['label']
                    # map labels to 0/1/2 (handle different label formats)
                    # handle star ratings (1-5 stars from nlptown/bert-base)
                    if 'star' in label.lower():
                        stars = int(label.split()[0])
                        if stars <= 2:
                            predictions.append(0)  # negative
                        elif stars >= 4:
                            predictions.append(2)  # positive
                        else:
                            predictions.append(1)  # neutral
                    # handle positive/negative labels
                    elif 'negative' in label.lower() or 'neg' in label.lower() or label == 'LABEL_0':
                        predictions.append(0)
                    elif 'positive' in label.lower() or 'pos' in label.lower() or label == 'LABEL_2':
                        predictions.append(2)
                    else:
                        predictions.append(1)
            except Exception as e:
                # log error but continue
                print(f"\nError in batch {i//batch_size}: {str(e)[:100]}")
                predictions.extend([1] * len(batch))
        
        return np.array(predictions)
    
    def predict_flair(self, texts):
        """Flair predictions."""
        predictions = []
        model = self.models['Flair']
        for text in tqdm(texts, desc="Flair", leave=False):
            try:
                sentence = Sentence(str(text)[:512])
                model.predict(sentence)
                label = sentence.labels[0].value
                if label == 'NEGATIVE':
                    predictions.append(0)
                else:
                    predictions.append(2)
            except:
                predictions.append(1)
        return np.array(predictions)
    
    def evaluate_model(self, model_name, texts, y_true):
        """Evaluate a single model."""
        print(f"\nEvaluating {model_name}...")
        
        # Get predictions and measure time
        start_time = time.time()
        
        if model_name == 'TextBlob':
            y_pred = self.predict_textblob(texts)
        elif model_name == 'VADER':
            y_pred = self.predict_vader(texts)
        elif model_name == 'Flair' and 'Flair' in self.models:
            y_pred = self.predict_flair(texts)
        else:
            y_pred = self.predict_transformer(texts, model_name)
        
        inference_time = time.time() - start_time
        
        # Calculate metrics
        accuracy = accuracy_score(y_true, y_pred)
        precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
        recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
        f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
        
        # Store results
        self.results[model_name] = {
            'accuracy': accuracy,
            'precision': precision,
            'recall': recall,
            'f1_score': f1,
            'inference_time': inference_time,
            'emails_per_second': len(texts) / inference_time,
            'predictions': y_pred,
            'confusion_matrix': confusion_matrix(y_true, y_pred)
        }
        
        print(f"  Accuracy: {accuracy:.4f}")
        print(f"  F1-Score: {f1:.4f}")
        print(f"  Speed: {len(texts) / inference_time:.2f} emails/sec")
        
        return self.results[model_name]
    
    def run_comparison(self, texts, y_true):
        """Run full comparison on all models."""
        print("\n" + "="*60)
        print("STARTING MODEL COMPARISON")
        print("="*60)
        
        for model_name in self.models.keys():
            try:
                self.evaluate_model(model_name, texts, y_true)
            except Exception as e:
                print(f"‚ùå Error evaluating {model_name}: {str(e)}")
        
        print("\n" + "="*60)
        print("COMPARISON COMPLETE")
        print("="*60)
        
        return self.results

# Initialize comparator
comparator = ModelComparator(device=device)
comparator.load_models()

Loading models...

‚úì TextBlob (Lexicon-based)
‚úì VADER (Social Media Optimized)
‚úì Loading BERT-base-uncased...


Device set to use cpu


‚úì Loading RoBERTa-base...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Device set to use cpu


‚úì Loading DistilBERT...


Device set to use cpu


‚úì Loading Twitter-RoBERTa...


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cpu
Device set to use cpu


‚úì Loading FinBERT (Financial Domain)...


Device set to use cpu


‚úì Loading Flair Sentiment...
2025-12-01 10:33:45,820 https://nlp.informatik.hu-berlin.de/resources/models/sentiment-curated-distilbert/sentiment-en-mix-distillbert_4.pt not found in cache, downloading to /var/folders/f8/c5bs48fx0k12nhsp3jt3crlc0000gn/T/tmp83jri54s


100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 253M/253M [00:34<00:00, 7.60MB/s] 

2025-12-01 10:34:20,879 copying /var/folders/f8/c5bs48fx0k12nhsp3jt3crlc0000gn/T/tmp83jri54s to cache at /Users/feder/.flair/models/sentiment-en-mix-distillbert_4.pt





2025-12-01 10:34:21,024 removing temp file /var/folders/f8/c5bs48fx0k12nhsp3jt3crlc0000gn/T/tmp83jri54s

‚úÖ Loaded 8 models successfully

‚úÖ Loaded 8 models successfully


<__main__.ModelComparator at 0x175783640>

## 5. Run Model Comparison

In [6]:
# Run comparison on sample
texts = df_sample['text'].values
y_true = df_sample['sentiment_label'].values

results = comparator.run_comparison(texts, y_true)


STARTING MODEL COMPARISON

Evaluating TextBlob...


                                                               

  Accuracy: 0.7520
  F1-Score: 0.7343
  Speed: 6873.90 emails/sec

Evaluating VADER...


                                                             

  Accuracy: 0.9456
  F1-Score: 0.9456
  Speed: 19028.12 emails/sec

Evaluating BERT...


                                             

  Accuracy: 0.6354
  F1-Score: 0.4937
  Speed: 3297927.35 emails/sec

Evaluating RoBERTa...


                                                

  Accuracy: 0.6354
  F1-Score: 0.4937
  Speed: 3589169.95 emails/sec

Evaluating DistilBERT...


                                                   

  Accuracy: 0.6354
  F1-Score: 0.4937
  Speed: 3462932.63 emails/sec

Evaluating Twitter-RoBERTa...


                                                        

  Accuracy: 0.6354
  F1-Score: 0.4937
  Speed: 3765084.38 emails/sec

Evaluating FinBERT...


                                                

  Accuracy: 0.6354
  F1-Score: 0.4937
  Speed: 3918445.44 emails/sec

Evaluating Flair...


                                                          

  Accuracy: 0.1918
  F1-Score: 0.1252
  Speed: 7.50 emails/sec

COMPARISON COMPLETE




## 6. Results Visualization

In [7]:
# Create results DataFrame
results_df = pd.DataFrame([
    {
        'Model': name,
        'Accuracy': res['accuracy'],
        'Precision': res['precision'],
        'Recall': res['recall'],
        'F1-Score': res['f1_score'],
        'Speed (emails/sec)': res['emails_per_second'],
        'Total Time (sec)': res['inference_time']
    }
    for name, res in results.items()
]).sort_values('F1-Score', ascending=False)

print("\n" + "="*80)
print("FINAL RESULTS")
print("="*80)
print(results_df.to_string(index=False))
print("\n")


FINAL RESULTS
          Model  Accuracy  Precision  Recall  F1-Score  Speed (emails/sec)  Total Time (sec)
          VADER    0.9456   0.945859  0.9456  0.945620        1.902812e+04          0.262769
       TextBlob    0.7520   0.760718  0.7520  0.734322        6.873898e+03          0.727389
           BERT    0.6354   0.403733  0.6354  0.493742        3.297927e+06          0.001516
        RoBERTa    0.6354   0.403733  0.6354  0.493742        3.589170e+06          0.001393
     DistilBERT    0.6354   0.403733  0.6354  0.493742        3.462933e+06          0.001444
Twitter-RoBERTa    0.6354   0.403733  0.6354  0.493742        3.765084e+06          0.001328
        FinBERT    0.6354   0.403733  0.6354  0.493742        3.918445e+06          0.001276
          Flair    0.1918   0.101858  0.1918  0.125224        7.495301e+00        667.084656




In [8]:
# 1. Performance Bar Chart
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=['Accuracy Comparison', 'F1-Score Comparison', 
                    'Inference Speed', 'Precision vs Recall'],
    specs=[[{'type': 'bar'}, {'type': 'bar'}],
           [{'type': 'bar'}, {'type': 'scatter'}]]
)

# Accuracy
fig.add_trace(
    go.Bar(x=results_df['Model'], y=results_df['Accuracy'], 
           marker_color='steelblue', name='Accuracy'),
    row=1, col=1
)

# F1-Score
fig.add_trace(
    go.Bar(x=results_df['Model'], y=results_df['F1-Score'], 
           marker_color='coral', name='F1-Score'),
    row=1, col=2
)

# Speed
fig.add_trace(
    go.Bar(x=results_df['Model'], y=results_df['Speed (emails/sec)'], 
           marker_color='lightgreen', name='Speed'),
    row=2, col=1
)

# Precision vs Recall
fig.add_trace(
    go.Scatter(x=results_df['Recall'], y=results_df['Precision'],
               mode='markers+text', text=results_df['Model'],
               textposition='top center', marker=dict(size=15, color='purple'),
               name='Models'),
    row=2, col=2
)

fig.update_layout(height=800, title_text="Model Performance Comparison", showlegend=False)
fig.show()

ValueError: Mime type rendering requires nbformat>=4.2.0 but it is not installed

In [None]:
# 2. Confusion Matrices
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.flatten()

label_names = ['Negative', 'Neutral', 'Positive']

for idx, (model_name, res) in enumerate(results.items()):
    if idx >= 8:
        break
    
    cm = res['confusion_matrix']
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=label_names, yticklabels=label_names,
                ax=axes[idx], cbar=False)
    axes[idx].set_title(f"{model_name}\nF1: {res['f1_score']:.3f}")
    axes[idx].set_ylabel('True Label')
    axes[idx].set_xlabel('Predicted Label')

plt.tight_layout()
plt.savefig('confusion_matrices.png', dpi=300, bbox_inches='tight')
plt.show()

In [None]:
# 3. Radar Chart for Multi-Metric Comparison
categories = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'Speed (normalized)']

# Normalize speed to 0-1 range
max_speed = results_df['Speed (emails/sec)'].max()
results_df['Speed_norm'] = results_df['Speed (emails/sec)'] / max_speed

fig = go.Figure()

for _, row in results_df.iterrows():
    fig.add_trace(go.Scatterpolar(
        r=[row['Accuracy'], row['Precision'], row['Recall'], row['F1-Score'], row['Speed_norm']],
        theta=categories,
        fill='toself',
        name=row['Model']
    ))

fig.update_layout(
    polar=dict(radialaxis=dict(visible=True, range=[0, 1])),
    title="Multi-Metric Radar Chart",
    height=600
)
fig.show()

## 7. Advanced Analysis: Error Analysis

In [None]:
# Find emails where models disagree
print("\n" + "="*80)
print("ERROR ANALYSIS: Cases Where Models Disagree")
print("="*80 + "\n")

# Get predictions from all models
predictions_matrix = np.array([res['predictions'] for res in results.values()]).T

# Find high-disagreement cases
disagreement_scores = predictions_matrix.std(axis=1)
high_disagreement_idx = np.argsort(disagreement_scores)[-5:]

for idx in high_disagreement_idx:
    print(f"Email #{idx}:")
    print(f"Text: {texts[idx][:200]}...")
    print(f"True Label: {label_names[y_true[idx]]}")
    print("\nModel Predictions:")
    for model_name, res in results.items():
        pred = res['predictions'][idx]
        print(f"  {model_name:20s}: {label_names[pred]}")
    print("\n" + "-"*80 + "\n")

## 8. Production Recommendations

In [None]:
print("\n" + "="*80)
print("PRODUCTION DEPLOYMENT RECOMMENDATIONS")
print("="*80 + "\n")

best_accuracy = results_df.loc[results_df['Accuracy'].idxmax()]
best_f1 = results_df.loc[results_df['F1-Score'].idxmax()]
best_speed = results_df.loc[results_df['Speed (emails/sec)'].idxmax()]

recommendations = f"""
üèÜ BEST OVERALL PERFORMANCE (F1-Score):
   Model: {best_f1['Model']}
   F1-Score: {best_f1['F1-Score']:.4f}
   Accuracy: {best_f1['Accuracy']:.4f}
   Speed: {best_f1['Speed (emails/sec)']:.2f} emails/sec

‚ö° FASTEST MODEL (Production Speed):
   Model: {best_speed['Model']}
   Speed: {best_speed['Speed (emails/sec)']:.2f} emails/sec
   F1-Score: {best_speed['F1-Score']:.4f}

üéØ MOST ACCURATE:
   Model: {best_accuracy['Model']}
   Accuracy: {best_accuracy['Accuracy']:.4f}
   F1-Score: {best_accuracy['F1-Score']:.4f}

üí° DEPLOYMENT STRATEGY:

1. FOR REAL-TIME DASHBOARDS (Speed Priority):
   ‚Üí Use {best_speed['Model']} for instant feedback
   ‚Üí Processes {best_speed['Speed (emails/sec)']:.0f}+ emails/second
   ‚Üí Low memory footprint, CPU-friendly

2. FOR BATCH ANALYSIS (Accuracy Priority):
   ‚Üí Use {best_f1['Model']} for accurate insights
   ‚Üí Run overnight on 500K+ emails
   ‚Üí Leverage GPU acceleration for speed boost

3. HYBRID APPROACH (Recommended):
   ‚Üí Real-time: {best_speed['Model']} for live filtering
   ‚Üí Deep analysis: {best_f1['Model']} for weekly reports
   ‚Üí Ensemble: Average predictions for critical decisions

4. COST OPTIMIZATION:
   ‚Üí Cloud API (GPT-4, Claude): $0.50-2.00 per 1M tokens
   ‚Üí Self-hosted transformer: $0.10-0.30 per 1M tokens (GPU)
   ‚Üí Lexicon models (VADER/TextBlob): <$0.01 per 1M tokens
"""

print(recommendations)

# Save results to CSV
results_df.to_csv('model_comparison_results.csv', index=False)
print("\n‚úÖ Results saved to 'model_comparison_results.csv'")

## 9. Integration with Streamlit Dashboard

In [None]:
print("\n" + "="*80)
print("INTEGRATION CODE FOR STREAMLIT DASHBOARD")
print("="*80 + "\n")

integration_code = f'''
# Add this to sentiment.py to replace TextBlob with best model

from transformers import pipeline
import torch

# Load best performing model
device = 0 if torch.cuda.is_available() else -1
sentiment_model = pipeline(
    "sentiment-analysis",
    model="{best_f1['Model']}",
    device=device
)

def analyze_sentiment_advanced(text):
    """Advanced sentiment analysis using {best_f1['Model']}."""
    try:
        result = sentiment_model(text[:512], truncation=True)[0]
        
        # Map to polarity score (-1 to 1)
        if 'negative' in result['label'].lower():
            polarity = -result['score']
        elif 'positive' in result['label'].lower():
            polarity = result['score']
        else:
            polarity = 0
        
        return polarity
    except:
        # Fallback to TextBlob
        from textblob import TextBlob
        return TextBlob(text).sentiment.polarity

# Update analyze_sentiment() function in sentiment.py
# Replace TextBlob calls with analyze_sentiment_advanced()
'''

print(integration_code)

print("\nüìù NEXT STEPS:")
print("1. Copy the integration code above")
print("2. Paste into sentiment.py in your dashboard")
print("3. Install required packages: pip install transformers torch")
print("4. Restart Streamlit dashboard")
print(f"\nüéØ Expected Performance Boost:")
print(f"   Current (TextBlob): ~{results_df[results_df['Model']=='TextBlob']['F1-Score'].values[0]:.3f} F1-Score")
print(f"   Upgraded ({best_f1['Model']}): ~{best_f1['F1-Score']:.3f} F1-Score")
print(f"   Improvement: +{(best_f1['F1-Score'] - results_df[results_df['Model']=='TextBlob']['F1-Score'].values[0]) * 100:.1f}%")

## 10. Export and Documentation

In [None]:
# Create comprehensive report
report = f"""
# Sentiment Analysis Model Comparison Report
Date: {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M:%S')}
Dataset: Enron Corporate Emails
Sample Size: {len(texts):,} emails

## Executive Summary

This analysis compared 8 state-of-the-art sentiment analysis models on corporate email data.
The goal was to identify the optimal model for detecting employee stress and burnout signals.

### Key Findings:

1. **Best Overall**: {best_f1['Model']} achieved {best_f1['F1-Score']:.4f} F1-score
2. **Fastest**: {best_speed['Model']} processed {best_speed['Speed (emails/sec)']:.0f} emails/second
3. **Production Ready**: Hybrid approach recommended (fast + accurate)

## Detailed Results

{results_df.to_markdown(index=False)}

## Model Characteristics

### Transformer Models:
- **BERT**: 110M parameters, bidirectional encoding
- **RoBERTa**: 125M parameters, robustly optimized
- **DistilBERT**: 66M parameters, 40% faster, 97% accuracy retention
- **Twitter-RoBERTa**: Fine-tuned on 124M tweets
- **FinBERT**: Domain-specific for financial sentiment

### Traditional Models:
- **TextBlob**: Lexicon-based, pattern matching
- **VADER**: Social media optimized, rule-based
- **Flair**: Character-level embeddings

## Deployment Recommendations

See cell #8 output for detailed deployment strategies.

## References

1. Devlin et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers
2. Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach
3. Sanh et al. (2019). DistilBERT, a distilled version of BERT
4. Barbieri et al. (2020). TweetEval: Unified Benchmark for Tweet Classification
5. Araci (2019). FinBERT: Financial Sentiment Analysis

---
Generated by: Enron Corporate Crisis Analysis System
"""

with open('MODEL_COMPARISON_REPORT.md', 'w') as f:
    f.write(report)

print("\n‚úÖ Report saved to 'MODEL_COMPARISON_REPORT.md'")
print("\n" + "="*80)
print("ANALYSIS COMPLETE!")
print("="*80)