# Multi-Competitor NER & Sentiment Analysis - IMPROVED VERSION v2

**Based on User Feedback:**
- NER Model: **90.2% F1** ✅ Excellent (keeping same approach)
- Sentiment Model: **58.0% F1** → **Target: 70%+** 🎯

**🚀 SENTIMENT IMPROVEMENTS INTEGRATED:**
1. ✅ **Data Augmentation** - Expand 212 samples to ~640 using synonym replacement & contextual substitution (+5-10% F1)
2. ✅ **Focal Loss** - Better handle class imbalance, focus on hard examples (+3-5% F1)
3. ✅ **Optimized Hyperparameters** - Lower LR (1e-5), more epochs (10), cosine warmup (0.2) (+2-4% F1)
4. ✅ **Early Stopping** - Stop when val F1 plateaus (patience=3) (+1-2% F1)
5. ✅ **Better Contextualization** - Improved prompt: "Tweet: {text} | Sentiment about {competitor}:" (+1-3% F1)
6. ✅ **Detailed Analysis** - Confusion matrix, per-class F1, confidence analysis

**Expected Total Improvement:** Sentiment F1: **68-75%** (from 58%)

**All Previous Fixes Still Applied:**
- ✅ AdamW import fixed (torch.optim.AdamW)
- ✅ CSV UTF-8 auto-conversion
- ✅ Excel formatted output with color-coding
- ✅ Separate datasets for NER (3,183 rows) and Sentiment (265 rows)
- ✅ Single-label NER classification
- ✅ No more KeyError: 'SENTIMENT'

**Pipeline:**
1. Load and convert CSVs to UTF-8
2. Prepare separate datasets for NER and Sentiment
3. Train NER classifier (14-class) - **Keep excellent 90.2% F1**
4. **[NEW]** Augment sentiment training data (212 → ~640 samples)
5. **[NEW]** Train sentiment model with Focal Loss, early stopping, optimized hyperparameters
6. **[NEW]** Detailed sentiment analysis with confusion matrix
7. Generate predictions using hybrid NER + regex approach
8. Export to formatted Excel

## 1. Setup & Installation

In [None]:
# Install required libraries
!pip install -q transformers datasets torch torchvision accelerate
!pip install -q scikit-learn pandas numpy matplotlib seaborn
!pip install -q sentencepiece protobuf
!pip install -q openpyxl xlsxwriter  # For Excel export
!pip install -q chardet  # For encoding detection
!pip install -q nlpaug  # For data augmentation (IMPROVED)

# Download NLTK data for augmentation
import nltk
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)

print("✓ All libraries installed")
print("✓ NLTK data downloaded for augmentation")

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import re
import gc
import warnings
import chardet
import random
from pathlib import Path
warnings.filterwarnings('ignore')

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torch.cuda.amp import autocast, GradScaler
from torch.optim import AdamW  # FIXED: Use torch.optim.AdamW instead of transformers

# Transformers
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup,
    get_cosine_schedule_with_warmup  # IMPROVED: Cosine scheduler
)

# Sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, confusion_matrix,
    accuracy_score, f1_score, precision_recall_fscore_support
)
from sklearn.utils.class_weight import compute_class_weight

# Data Augmentation (IMPROVED)
import nlpaug.augmenter.word as naw

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# Set seeds
SEED = 42
np.random.seed(SEED)
torch.manual_seed(SEED)
random.seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print("✓ Libraries imported successfully!")
print(f"✓ PyTorch version: {torch.__version__}")
print(f"✓ Using AdamW from: torch.optim")
print(f"✓ Data augmentation ready (nlpaug)")

In [None]:
# Check GPU and configure device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    gpu_props = torch.cuda.get_device_properties(0)
    gpu_memory_gb = gpu_props.total_memory / 1e9
    print(f"GPU: {gpu_props.name}")
    print(f"GPU Memory: {gpu_memory_gb:.2f} GB")
    
    # Adaptive batch size
    if gpu_memory_gb >= 15:
        BATCH_SIZE = 16
        GRAD_ACCUM_STEPS = 2
    else:
        BATCH_SIZE = 8
        GRAD_ACCUM_STEPS = 4
else:
    BATCH_SIZE = 4
    GRAD_ACCUM_STEPS = 8

EFFECTIVE_BATCH_SIZE = BATCH_SIZE * GRAD_ACCUM_STEPS
print(f"\nBatch size: {BATCH_SIZE}, Accumulation: {GRAD_ACCUM_STEPS}, Effective: {EFFECTIVE_BATCH_SIZE}")

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create directories
!mkdir -p '/content/drive/MyDrive/KFC_ML_Models'
!mkdir -p '/content/results'

MODEL_SAVE_DIR = '/content/drive/MyDrive/KFC_ML_Models'
RESULTS_DIR = '/content/results'

## 2. Configuration

In [None]:
# Competitor list (14 total)
COMPETITORS = [
    'Burger King', 'Deliveroo', "Domino's", 'Five Guys', 'Greggs',
    'Just Eat', 'KFC', "McDonald's", "Nando's", "Papa John's",
    'Pizza Hut', 'Pret a Manger', 'Taco Bell', 'Uber Eats'
]

SENTIMENT_MAP = {0: 'negative', 1: 'neutral', 2: 'positive'}

# Model configs
NER_MODEL_NAME = 'bert-base-uncased'
SENTIMENT_MODEL_NAME = 'cardiffnlp/twitter-roberta-base-sentiment-latest'

# Hyperparameters
MAX_SEQ_LENGTH = 128

# NER hyperparameters (KEEP - performing excellently at 90.2% F1)
NER_LEARNING_RATE = 2e-5
NER_EPOCHS = 5
NER_WARMUP_RATIO = 0.1

# IMPROVED Sentiment hyperparameters (targeting 70%+ F1 from 58%)
SENTIMENT_LEARNING_RATE = 1e-5  # Lower LR for finer adjustments
SENTIMENT_EPOCHS = 10  # More training time
SENTIMENT_WARMUP_RATIO = 0.2  # More gradual warmup
SENTIMENT_EARLY_STOP_PATIENCE = 3  # Early stopping
AUGMENTATION_FACTOR = 3  # 212 samples → ~640 samples

# Other
WEIGHT_DECAY = 0.01
FOCAL_LOSS_GAMMA = 2.0  # Focal loss parameter

print(f"✓ Configuration loaded")
print(f"  Competitors: {len(COMPETITORS)}")
print(f"  NER Model: {NER_MODEL_NAME}")
print(f"  Sentiment Model: {SENTIMENT_MODEL_NAME}")
print(f"\\n🎯 SENTIMENT IMPROVEMENTS:")
print(f"  Learning Rate: {SENTIMENT_LEARNING_RATE} (was 2e-5)")
print(f"  Epochs: {SENTIMENT_EPOCHS} (was 5)")
print(f"  Warmup Ratio: {SENTIMENT_WARMUP_RATIO} (was 0.1)")
print(f"  Augmentation Factor: {AUGMENTATION_FACTOR}x")
print(f"  Early Stopping Patience: {SENTIMENT_EARLY_STOP_PATIENCE}")
print(f"  Focal Loss Gamma: {FOCAL_LOSS_GAMMA}")

## 3. CSV Loading with UTF-8 Conversion (FIXED)

In [None]:
def detect_encoding(file_path):
    """
    Detect file encoding using chardet.
    """
    with open(file_path, 'rb') as f:
        result = chardet.detect(f.read(100000))  # Read first 100KB
    return result['encoding'], result['confidence']

def convert_to_utf8(input_file, output_file=None):
    """
    Convert CSV file to UTF-8 encoding.
    If output_file is None, overwrites input_file.
    """
    if output_file is None:
        output_file = input_file
    
    # Detect encoding
    encoding, confidence = detect_encoding(input_file)
    print(f"  Detected encoding: {encoding} (confidence: {confidence:.2%})")
    
    if encoding and encoding.lower() != 'utf-8':
        # Read with detected encoding
        try:
            with open(input_file, 'r', encoding=encoding, errors='replace') as f:
                content = f.read()
            
            # Write as UTF-8
            with open(output_file, 'w', encoding='utf-8') as f:
                f.write(content)
            
            print(f"  ✓ Converted to UTF-8")
        except Exception as e:
            print(f"  ⚠ Conversion failed: {e}")
            print(f"  Trying with 'utf-8' and error handling...")
    else:
        print(f"  ✓ Already UTF-8")

def load_csv_safe(file_path):
    """
    Load CSV with automatic UTF-8 conversion.
    """
    print(f"\nLoading: {file_path}")
    
    # Convert to UTF-8 first
    convert_to_utf8(file_path)
    
    # Load with pandas
    try:
        df = pd.read_csv(file_path, encoding='utf-8', low_memory=False)
        print(f"  ✓ Loaded {len(df)} rows")
        return df
    except Exception as e:
        print(f"  ⚠ Error loading with UTF-8: {e}")
        print(f"  Trying with encoding detection...")
        encoding, _ = detect_encoding(file_path)
        df = pd.read_csv(file_path, encoding=encoding, low_memory=False, errors='replace')
        print(f"  ✓ Loaded {len(df)} rows with {encoding}")
        return df

print("✓ CSV loading functions defined")

In [None]:
# Upload CSV files
from google.colab import files
print("Please upload your CSV files:")
uploaded = files.upload()

In [None]:
# Load datasets with UTF-8 conversion
print("Loading and converting datasets to UTF-8...")

df_large = load_csv_safe('KFC_social_data.xlsx - Sheet1.csv')
df_train_sample = load_csv_safe('KFC_training_sample.csv')
df_test = load_csv_safe('KFC_test_sample.csv')
df_test_pred = load_csv_safe('KFC_test_sample_for_prediction.csv')

print(f"\n✓ All datasets loaded and converted to UTF-8")

## 4. Data Preprocessing

In [None]:
def clean_sentiment(value):
    """Extract numeric sentiment (0, 1, 2)"""
    if pd.isna(value):
        return None
    
    if isinstance(value, (int, float)):
        if value in [0, 1, 2]:
            return int(value)
        return None
    
    value_str = str(value).strip().lower()
    if value_str in ['0', 'negative']: return 0
    elif value_str in ['1', 'neutral']: return 1
    elif value_str in ['2', 'positive']: return 2
    
    match = re.match(r'^(\d)', value_str)
    if match:
        digit = int(match.group(1))
        if digit in [0, 1, 2]:
            return digit
    
    return None

def normalize_competitor_name(comp_str):
    """Normalize competitor names"""
    if pd.isna(comp_str) or not comp_str:
        return None
    
    comp_str = str(comp_str).strip()
    
    if comp_str in COMPETITORS:
        return comp_str
    
    for comp in COMPETITORS:
        if comp.lower() == comp_str.lower():
            return comp
    
    return None

def prepare_dataset(df, name="dataset"):
    """Clean and prepare dataset"""
    print(f"\nPreparing {name}...")
    print(f"  Initial rows: {len(df)}")
    
    # Select columns
    essential_cols = ['Competitor', 'Tweet', 'SENTIMENT']
    if 'Tweet' not in df.columns:
        if 'Full Text' in df.columns:
            df['Tweet'] = df['Full Text']
        elif 'Snippet' in df.columns:
            df['Tweet'] = df['Snippet']
    
    available_cols = [col for col in essential_cols if col in df.columns]
    metadata_cols = [col for col in ['Impact', 'Impressions', 'Reach (new)', 'Date', 'Url'] if col in df.columns]
    
    df_clean = df[available_cols + metadata_cols].copy()
    
    # Clean sentiment
    if 'SENTIMENT' in df_clean.columns:
        df_clean['SENTIMENT'] = df_clean['SENTIMENT'].apply(clean_sentiment)
        before = len(df_clean)
        df_clean = df_clean.dropna(subset=['SENTIMENT'])
        print(f"  Dropped {before - len(df_clean)} rows with invalid sentiment")
        df_clean['SENTIMENT'] = df_clean['SENTIMENT'].astype(int)
    
    # Clean competitor names
    df_clean['Competitor'] = df_clean['Competitor'].apply(normalize_competitor_name)
    before = len(df_clean)
    df_clean = df_clean.dropna(subset=['Competitor', 'Tweet'])
    print(f"  Dropped {before - len(df_clean)} rows with invalid competitor/tweet")
    
    # Clean tweet text
    df_clean['Tweet'] = df_clean['Tweet'].astype(str).str.strip()
    df_clean = df_clean[df_clean['Tweet'].str.len() > 0]
    
    df_clean = df_clean.reset_index(drop=True)
    
    print(f"  Final rows: {len(df_clean)}")
    print(f"  Unique competitors: {df_clean['Competitor'].nunique()}")
    
    if 'SENTIMENT' in df_clean.columns:
        sent_dist = df_clean['SENTIMENT'].value_counts().sort_index()
        print(f"  Sentiment distribution: {dict(sent_dist)}")
    
    return df_clean

# CRITICAL FIX: Prepare datasets separately for NER and Sentiment
print("="*70)
print("PREPARING DATASETS FOR DIFFERENT TASKS")
print("="*70)
print("NER model needs: 'Competitor' labels")
print("Sentiment model needs: 'SENTIMENT' labels")
print("="*70)

# NER Dataset (use large dataset - has Competitor labels)
df_large_clean = prepare_dataset(df_large, "Large dataset for NER")

# Sentiment Dataset (use training sample - has SENTIMENT labels)
df_train_sample_clean = prepare_dataset(df_train_sample, "Training sample for Sentiment")

# Test datasets
df_test_clean = prepare_dataset(df_test, "Test dataset")

# For prediction data (no sentiment)
if 'SENTIMENT' not in df_test_pred.columns:
    df_test_pred['SENTIMENT'] = 1  # Dummy value
df_test_pred_clean = prepare_dataset(df_test_pred, "Test prediction dataset")

print("\n" + "="*70)
print("DATASET SUMMARY:")
print(f"  NER training data: {len(df_large_clean)} rows (from large CSV)")
print(f"  Sentiment training data: {len(df_train_sample_clean)} rows (from training sample)")
print(f"  Test data: {len(df_test_clean)} rows")
print("="*70)

## 5. Multi-Competitor Extraction (Regex)

In [None]:
def extract_all_competitors(tweet_text):
    """
    Extract ALL competitors mentioned using regex.
    """
    found_competitors = set()
    tweet_lower = tweet_text.lower()
    
    patterns = {
        'Burger King': [r'\bburger\s*king\b', r'\bbk\b'],
        'Deliveroo': [r'\bdeliveroo\b'],
        "Domino's": [r'\bdomino(?:s|\'s)?\b'],
        'Five Guys': [r'\bfive\s*guys\b'],
        'Greggs': [r'\bgreggs?\b'],
        'Just Eat': [r'\bjust\s*eat\b'],
        'KFC': [r'\bkfc\b', r'\bkentucky\s*fried\s*chicken\b', r'@kfc'],
        "McDonald's": [r'\bmcdonald(?:s|\'s)?\b', r'\bmaccies\b', r'\bmaccas\b', r'\bmcdonalds\b', r'@mcdonald'],
        "Nando's": [r'\bnando(?:s|\'s)\b', r'@nando'],
        "Papa John's": [r'\bpapa\s*john(?:s|\'s)?\b', r'@papajohn'],
        'Pizza Hut': [r'\bpizza\s*hut\b', r'@pizzahut'],
        'Pret a Manger': [r'\bpret(?:\s*a\s*manger)?\b', r'@pret'],
        'Taco Bell': [r'\btaco\s*bell\b', r'@tacobell'],
        'Uber Eats': [r'\buber\s*eats\b', r'@ubereats']
    }
    
    for competitor, pattern_list in patterns.items():
        for pattern in pattern_list:
            if re.search(pattern, tweet_lower):
                found_competitors.add(competitor)
                break
    
    return list(found_competitors)

# Test
test_tweet = "I love KFC's chicken but McDonald's has better fries!"
print(f"Test tweet: {test_tweet}")
print(f"Found competitors: {extract_all_competitors(test_tweet)}")

## 6. Train/Val Split

In [None]:
# ============================================================
# TRAIN/VAL SPLIT - SEPARATE FOR NER AND SENTIMENT (FIXED)
# ============================================================

# NER: Split large dataset (for competitor identification)
print("\n" + "="*70)
print("Creating NER train/val split...")
print("="*70)
ner_train_df, ner_val_df = train_test_split(
    df_large_clean,
    test_size=0.2,
    random_state=SEED,
    stratify=df_large_clean['Competitor']
)

print(f"NER Dataset Split:")
print(f"  Training: {len(ner_train_df)} samples")
print(f"  Validation: {len(ner_val_df)} samples")
print(f"  Competitors: {ner_train_df['Competitor'].nunique()}")
print(f"\nTop 5 competitors in NER training set:")
print(ner_train_df['Competitor'].value_counts().head())

# Sentiment: Split training sample (for sentiment classification)
print("\n" + "="*70)
print("Creating Sentiment train/val split...")
print("="*70)
sentiment_train_df, sentiment_val_df = train_test_split(
    df_train_sample_clean,
    test_size=0.2,
    random_state=SEED,
    stratify=df_train_sample_clean['SENTIMENT']
)

print(f"Sentiment Dataset Split:")
print(f"  Training: {len(sentiment_train_df)} samples")
print(f"  Validation: {len(sentiment_val_df)} samples")
print(f"  Sentiment distribution in training:")
sentiment_dist = sentiment_train_df['SENTIMENT'].value_counts().sort_index()
for sent, count in sentiment_dist.items():
    print(f"    {SENTIMENT_MAP[sent]:8s}: {count} ({count/len(sentiment_train_df)*100:.1f}%)")

print("\n" + "="*70)
print(f"Test data: {len(df_test_clean)} samples")
print("="*70)

## 6b. Sentiment Data Augmentation (IMPROVED)

In [None]:
# ============================================================
# DATA AUGMENTATION FOR SENTIMENT TRAINING (IMPROVED)
# ============================================================

print("\n" + "="*70)
print("AUGMENTING SENTIMENT TRAINING DATA")
print("="*70)
print(f"Original training samples: {len(sentiment_train_df)}")
print(f"Target: {len(sentiment_train_df) * AUGMENTATION_FACTOR} samples ({AUGMENTATION_FACTOR}x)")

# Create augmenters
try:
    aug_synonym = naw.SynonymAug(aug_src='wordnet', aug_p=0.3)
    print("✓ Synonym augmenter loaded")
except Exception as e:
    print(f"⚠ Synonym augmenter failed: {e}")
    aug_synonym = None

try:
    aug_contextual = naw.ContextualWordEmbsAug(
        model_path='bert-base-uncased',
        action="substitute",
        aug_p=0.3,
        device='cuda' if torch.cuda.is_available() else 'cpu'
    )
    print("✓ Contextual augmenter loaded")
except Exception as e:
    print(f"⚠ Contextual augmenter failed: {e}")
    aug_contextual = None

# Augment data
augmented_rows = []

for idx, row in tqdm(sentiment_train_df.iterrows(), total=len(sentiment_train_df), desc="Augmenting"):
    # Add original
    augmented_rows.append(row.to_dict())
    
    tweet = row['Tweet']
    
    # Try synonym replacement
    if aug_synonym is not None and len(augmented_rows) < (idx + 1) * AUGMENTATION_FACTOR:
        try:
            aug_text = aug_synonym.augment(tweet)
            if isinstance(aug_text, list):
                aug_text = aug_text[0]
            if aug_text != tweet:
                aug_row = row.to_dict()
                aug_row['Tweet'] = aug_text
                augmented_rows.append(aug_row)
        except:
            pass
    
    # Try contextual substitution
    if aug_contextual is not None and len(augmented_rows) < (idx + 1) * AUGMENTATION_FACTOR:
        try:
            aug_text = aug_contextual.augment(tweet)
            if isinstance(aug_text, list):
                aug_text = aug_text[0]
            if aug_text != tweet:
                aug_row = row.to_dict()
                aug_row['Tweet'] = aug_text
                augmented_rows.append(aug_row)
        except:
            pass

sentiment_train_df_augmented = pd.DataFrame(augmented_rows)

print(f"\n✓ Augmentation complete!")
print(f"  Original: {len(sentiment_train_df)} samples")
print(f"  Augmented: {len(sentiment_train_df_augmented)} samples")
print(f"  Increase: {len(sentiment_train_df_augmented) / len(sentiment_train_df):.1f}x")

print(f"\nAugmented Sentiment Distribution:")
for sent, count in sentiment_train_df_augmented['SENTIMENT'].value_counts().sort_index().items():
    print(f"  {SENTIMENT_MAP[sent]:8s}: {count} ({count/len(sentiment_train_df_augmented)*100:.1f}%)")

print("="*70)

## 6b. Sentiment Data Augmentation (IMPROVED)

## 6b. Improvement Classes (Focal Loss & Early Stopping)

## 7. NER Model - Single-Label Classification

In [None]:
class CompetitorDataset(Dataset):
    """Dataset for competitor classification (0-13)"""
    
    def __init__(self, dataframe, tokenizer, max_length=128):
        self.data = dataframe.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_length = max_length
        self.competitor_to_idx = {comp: idx for idx, comp in enumerate(COMPETITORS)}
        
        # Validation
        unique_comps = self.data['Competitor'].unique()
        print(f"\nDataset has {len(unique_comps)} unique competitors:")
        for comp in unique_comps:
            if comp in self.competitor_to_idx:
                count = (self.data['Competitor'] == comp).sum()
                print(f"  ✓ {comp}: {count} samples (label {self.competitor_to_idx[comp]})")
            else:
                print(f"  ✗ {comp}: NOT IN COMPETITOR LIST!")
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        tweet = row['Tweet']
        competitor = row['Competitor']
        label = self.competitor_to_idx[competitor]
        
        encoding = self.tokenizer(
            tweet,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(label, dtype=torch.long)
        }

print("✓ CompetitorDataset defined")

In [None]:
# Load NER tokenizer
print(f"Loading NER tokenizer: {NER_MODEL_NAME}")
ner_tokenizer = AutoTokenizer.from_pretrained(NER_MODEL_NAME)

# Create datasets - FIXED: Use ner_train_df and ner_val_df
print("\nCreating NER datasets...")
ner_train_dataset = CompetitorDataset(ner_train_df, ner_tokenizer, MAX_SEQ_LENGTH)
ner_val_dataset = CompetitorDataset(ner_val_df, ner_tokenizer, MAX_SEQ_LENGTH)
ner_test_dataset = CompetitorDataset(df_test_clean, ner_tokenizer, MAX_SEQ_LENGTH)

# DataLoaders
ner_train_loader = DataLoader(ner_train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
ner_val_loader = DataLoader(ner_val_dataset, batch_size=BATCH_SIZE*2, shuffle=False, num_workers=2)
ner_test_loader = DataLoader(ner_test_dataset, batch_size=BATCH_SIZE*2, shuffle=False, num_workers=2)

print(f"\n✓ NER DataLoaders created")

In [None]:
# Class weights for imbalanced data - FIXED: Use ner_train_df
train_labels = [ner_train_dataset.competitor_to_idx[comp] for comp in ner_train_df['Competitor']]
ner_class_weights = compute_class_weight(
    'balanced',
    classes=np.arange(len(COMPETITORS)),
    y=train_labels
)
ner_class_weights = torch.tensor(ner_class_weights, dtype=torch.float).to(device)

print("NER Class Weights:")
for i, comp in enumerate(COMPETITORS):
    print(f"  {comp:20s}: {ner_class_weights[i]:.3f}")

## 6c. Focal Loss & Early Stopping Classes (IMPROVED)

In [None]:
# ============================================================
# FOCAL LOSS FOR HANDLING CLASS IMBALANCE (IMPROVED)
# ============================================================

class FocalLoss(nn.Module):
    """
    Focal Loss: FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t)
    
    Focuses training on hard-to-classify examples.
    Down-weights easy examples to prevent them from dominating training.
    """
    def __init__(self, alpha=None, gamma=2.0, reduction='mean'):
        super(FocalLoss, self).__init__()
        self.alpha = alpha  # Class weights
        self.gamma = gamma  # Focusing parameter
        self.reduction = reduction
    
    def forward(self, inputs, targets):
        ce_loss = F.cross_entropy(inputs, targets, weight=self.alpha, reduction='none')
        p_t = torch.exp(-ce_loss)
        focal_loss = (1 - p_t) ** self.gamma * ce_loss
        
        if self.reduction == 'mean':
            return focal_loss.mean()
        elif self.reduction == 'sum':
            return focal_loss.sum()
        else:
            return focal_loss

# ============================================================
# EARLY STOPPING TO PREVENT OVERFITTING (IMPROVED)
# ============================================================

class EarlyStopping:
    """Stop training when validation F1 stops improving"""
    def __init__(self, patience=3, min_delta=0.001, mode='max'):
        self.patience = patience
        self.min_delta = min_delta
        self.mode = mode
        self.counter = 0
        self.best_score = None
        self.early_stop = False
    
    def __call__(self, val_score):
        if self.best_score is None:
            self.best_score = val_score
            return False
        
        if self.mode == 'max':
            improved = val_score > self.best_score + self.min_delta
        else:
            improved = val_score < self.best_score - self.min_delta
        
        if improved:
            self.best_score = val_score
            self.counter = 0
        else:
            self.counter += 1
        
        if self.counter >= self.patience:
            self.early_stop = True
            return True
        
        return False

print("✓ FocalLoss class defined")
print("✓ EarlyStopping class defined")

In [None]:
def train_model(model, train_loader, val_loader, epochs, learning_rate, class_weights, model_name="model"):
    """
    Training function with torch.optim.AdamW (FIXED)
    """
    model = model.to(device)
    
    # FIXED: Use torch.optim.AdamW
    optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=WEIGHT_DECAY)
    
    total_steps = len(train_loader) * epochs // GRAD_ACCUM_STEPS
    warmup_steps = int(total_steps * WARMUP_RATIO)
    scheduler = get_linear_schedule_with_warmup(optimizer, warmup_steps, total_steps)
    
    criterion = nn.CrossEntropyLoss(weight=class_weights)
    scaler = GradScaler()
    
    history = {'train_loss': [], 'val_loss': [], 'val_accuracy': [], 'val_f1': []}
    best_val_f1 = 0
    
    print(f"\nTraining {model_name}...")
    print(f"  Epochs: {epochs}, Steps: {total_steps}, Warmup: {warmup_steps}")
    print(f"  Using optimizer: torch.optim.AdamW\n")
    
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}/{epochs}")
        print("-" * 50)
        
        # Training
        model.train()
        train_loss = 0
        optimizer.zero_grad()
        
        train_pbar = tqdm(train_loader, desc="Training")
        for step, batch in enumerate(train_pbar):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            with autocast():
                outputs = model(input_ids, attention_mask)
                loss = criterion(outputs.logits, labels) / GRAD_ACCUM_STEPS
            
            scaler.scale(loss).backward()
            
            if (step + 1) % GRAD_ACCUM_STEPS == 0:
                scaler.step(optimizer)
                scaler.update()
                scheduler.step()
                optimizer.zero_grad()
            
            train_loss += loss.item() * GRAD_ACCUM_STEPS
            train_pbar.set_postfix({'loss': f'{loss.item() * GRAD_ACCUM_STEPS:.4f}'})
        
        avg_train_loss = train_loss / len(train_loader)
        history['train_loss'].append(avg_train_loss)
        
        # Validation
        model.eval()
        val_loss = 0
        all_preds = []
        all_labels = []
        
        with torch.no_grad():
            val_pbar = tqdm(val_loader, desc="Validation")
            for batch in val_pbar:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                
                with autocast():
                    outputs = model(input_ids, attention_mask)
                    loss = criterion(outputs.logits, labels)
                
                val_loss += loss.item()
                preds = torch.argmax(outputs.logits, dim=1)
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())
        
        avg_val_loss = val_loss / len(val_loader)
        val_accuracy = accuracy_score(all_labels, all_preds)
        val_f1 = f1_score(all_labels, all_preds, average='macro')
        
        history['val_loss'].append(avg_val_loss)
        history['val_accuracy'].append(val_accuracy)
        history['val_f1'].append(val_f1)
        
        print(f"\nResults:")
        print(f"  Train Loss: {avg_train_loss:.4f}")
        print(f"  Val Loss: {avg_val_loss:.4f}")
        print(f"  Val Accuracy: {val_accuracy:.4f}")
        print(f"  Val F1 (macro): {val_f1:.4f}")
        
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            print(f"  ✓ New best F1! Saving model...")
            torch.save(model.state_dict(), f'{MODEL_SAVE_DIR}/{model_name}_best.pt')
        print()
        
        torch.cuda.empty_cache()
        gc.collect()
    
    print(f"\n✓ Training complete! Best val F1: {best_val_f1:.4f}")
    return model, history

print("✓ Training function defined (using torch.optim.AdamW)")

In [None]:
def train_sentiment_model_improved(model, train_loader, val_loader, epochs, learning_rate,
                                   class_weights, early_stopping_patience=3, model_name="sentiment_model"):
    """
    IMPROVED training function for sentiment model with:
    - Focal Loss
    - Early Stopping
    - Cosine scheduler
    - Better progress tracking
    """
    model = model.to(device)
    
    optimizer = AdamW(model.parameters(), lr=learning_rate, weight_decay=WEIGHT_DECAY)
    
    total_steps = len(train_loader) * epochs // GRAD_ACCUM_STEPS
    warmup_steps = int(total_steps * SENTIMENT_WARMUP_RATIO)
    
    # IMPROVED: Use cosine scheduler for sentiment
    scheduler = get_cosine_schedule_with_warmup(optimizer, warmup_steps, total_steps)
    
    # IMPROVED: Use Focal Loss instead of CrossEntropyLoss
    criterion = FocalLoss(alpha=class_weights, gamma=FOCAL_LOSS_GAMMA)
    
    # IMPROVED: Add early stopping
    early_stopping = EarlyStopping(patience=early_stopping_patience, mode='max')
    
    scaler = GradScaler()
    
    history = {'train_loss': [], 'val_loss': [], 'val_accuracy': [], 'val_f1': []}
    best_val_f1 = 0
    
    print(f"\nTraining {model_name} with IMPROVEMENTS...")
    print(f"  Epochs: {epochs}, Steps: {total_steps}, Warmup: {warmup_steps}")
    print(f"  Learning Rate: {learning_rate}")
    print(f"  Using: Focal Loss (gamma={FOCAL_LOSS_GAMMA}), Cosine Scheduler, Early Stopping (patience={early_stopping_patience})")
    print(f"  Training samples: {len(train_loader.dataset)}\n")
    
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}/{epochs}")
        print("-" * 50)
        
        # Training
        model.train()
        train_loss = 0
        optimizer.zero_grad()
        
        train_pbar = tqdm(train_loader, desc="Training")
        for step, batch in enumerate(train_pbar):
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            with autocast():
                outputs = model(input_ids, attention_mask)
                loss = criterion(outputs.logits, labels) / GRAD_ACCUM_STEPS
            
            scaler.scale(loss).backward()
            
            if (step + 1) % GRAD_ACCUM_STEPS == 0:
                scaler.step(optimizer)
                scaler.update()
                scheduler.step()
                optimizer.zero_grad()
            
            train_loss += loss.item() * GRAD_ACCUM_STEPS
            train_pbar.set_postfix({'loss': f'{loss.item() * GRAD_ACCUM_STEPS:.4f}'})
        
        avg_train_loss = train_loss / len(train_loader)
        history['train_loss'].append(avg_train_loss)
        
        # Validation
        model.eval()
        val_loss = 0
        all_preds = []
        all_labels = []
        
        with torch.no_grad():
            val_pbar = tqdm(val_loader, desc="Validation")
            for batch in val_pbar:
                input_ids = batch['input_ids'].to(device)
                attention_mask = batch['attention_mask'].to(device)
                labels = batch['labels'].to(device)
                
                with autocast():
                    outputs = model(input_ids, attention_mask)
                    loss = criterion(outputs.logits, labels)
                
                val_loss += loss.item()
                preds = torch.argmax(outputs.logits, dim=1)
                all_preds.extend(preds.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())
        
        avg_val_loss = val_loss / len(val_loader)
        val_accuracy = accuracy_score(all_labels, all_preds)
        val_f1 = f1_score(all_labels, all_preds, average='macro')
        
        history['val_loss'].append(avg_val_loss)
        history['val_accuracy'].append(val_accuracy)
        history['val_f1'].append(val_f1)
        
        print(f"\nResults:")
        print(f"  Train Loss: {avg_train_loss:.4f}")
        print(f"  Val Loss: {avg_val_loss:.4f}")
        print(f"  Val Accuracy: {val_accuracy:.4f}")
        print(f"  Val F1 (macro): {val_f1:.4f}")
        
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
            print(f"  ✓ New best F1! Saving model...")
            torch.save(model.state_dict(), f'{MODEL_SAVE_DIR}/{model_name}_best.pt')
        
        # IMPROVED: Check early stopping
        if early_stopping(val_f1):
            print(f"  ⚠ Early stopping triggered (no improvement for {early_stopping_patience} epochs)")
            break
        
        print()
        torch.cuda.empty_cache()
        gc.collect()
    
    print(f"\n✓ Training complete! Best val F1: {best_val_f1:.4f}")
    print(f"  Stopped at epoch {epoch + 1}/{epochs}")
    
    return model, history

print("✓ Improved sentiment training function defined")

In [None]:
# Train NER model
print(f"Initializing NER model: {NER_MODEL_NAME}")
ner_model = AutoModelForSequenceClassification.from_pretrained(
    NER_MODEL_NAME,
    num_labels=len(COMPETITORS)
)

ner_model, ner_history = train_model(
    ner_model,
    ner_train_loader,
    ner_val_loader,
    epochs=NER_EPOCHS,
    learning_rate=NER_LEARNING_RATE,
    class_weights=ner_class_weights,
    model_name="ner_model"
)

In [None]:
# Plot NER history
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].plot(ner_history['train_loss'], label='Train', marker='o')
axes[0].plot(ner_history['val_loss'], label='Val', marker='s')
axes[0].set_title('NER - Loss', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[0].grid(alpha=0.3)

axes[1].plot(ner_history['val_accuracy'], marker='o', color='blue')
axes[1].set_title('NER - Validation Accuracy', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].grid(alpha=0.3)

axes[2].plot(ner_history['val_f1'], marker='o', color='green')
axes[2].set_title('NER - Validation F1', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('F1 Score')
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(f'{RESULTS_DIR}/ner_training.png', dpi=300)
plt.show()

print(f"✓ Best NER F1: {max(ner_history['val_f1']):.4f}")

## 8. Sentiment Model

In [None]:
class SentimentDataset(Dataset):
    """IMPROVED: Competitor-aware sentiment dataset with better contextualization"""
    
    def __init__(self, dataframe, tokenizer, max_length=128):
        self.data = dataframe.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        tweet = row['Tweet']
        competitor = row['Competitor']
        sentiment = row['SENTIMENT']
        
        # IMPROVED: Better contextualization
        text = f"Tweet: {tweet} | Sentiment about {competitor}:"
        
        encoding = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(sentiment, dtype=torch.long)
        }

print("✓ IMPROVED SentimentDataset defined")

In [None]:
# Load sentiment tokenizer
print(f"Loading Sentiment tokenizer: {SENTIMENT_MODEL_NAME}")
sentiment_tokenizer = AutoTokenizer.from_pretrained(SENTIMENT_MODEL_NAME)

# Create datasets - IMPROVED: Use AUGMENTED training data
print("\nCreating IMPROVED Sentiment datasets...")
sentiment_train_dataset = SentimentDataset(sentiment_train_df_augmented, sentiment_tokenizer, MAX_SEQ_LENGTH)  # AUGMENTED!
sentiment_val_dataset = SentimentDataset(sentiment_val_df, sentiment_tokenizer, MAX_SEQ_LENGTH)  # Keep validation original
sentiment_test_dataset = SentimentDataset(df_test_clean, sentiment_tokenizer, MAX_SEQ_LENGTH)

print(f"  Train (AUGMENTED): {len(sentiment_train_dataset)} samples")
print(f"  Val: {len(sentiment_val_dataset)} samples")
print(f"  Test: {len(sentiment_test_dataset)} samples")

# DataLoaders
sentiment_train_loader = DataLoader(sentiment_train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
sentiment_val_loader = DataLoader(sentiment_val_dataset, batch_size=BATCH_SIZE*2, shuffle=False, num_workers=2)
sentiment_test_loader = DataLoader(sentiment_test_dataset, batch_size=BATCH_SIZE*2, shuffle=False, num_workers=2)

print(f"\n✓ IMPROVED Sentiment DataLoaders created with augmented training data")

In [None]:
# Sentiment class weights - IMPROVED: Use augmented training data
sentiment_class_weights = compute_class_weight(
    'balanced',
    classes=np.arange(3),
    y=sentiment_train_df_augmented['SENTIMENT'].values  # AUGMENTED!
)
sentiment_class_weights = torch.tensor(sentiment_class_weights, dtype=torch.float).to(device)

print("Sentiment Class Weights (from augmented data):")
for i, label in SENTIMENT_MAP.items():
    print(f"  {label:8s}: {sentiment_class_weights[i]:.3f}")

In [None]:
# Train Sentiment model with IMPROVEMENTS
print(f"Initializing Sentiment model: {SENTIMENT_MODEL_NAME}")
sentiment_model = AutoModelForSequenceClassification.from_pretrained(
    SENTIMENT_MODEL_NAME,
    num_labels=3,
    ignore_mismatched_sizes=True
)

# IMPROVED: Use new training function with all improvements
sentiment_model, sentiment_history = train_sentiment_model_improved(
    sentiment_model,
    sentiment_train_loader,
    sentiment_val_loader,
    epochs=SENTIMENT_EPOCHS,
    learning_rate=SENTIMENT_LEARNING_RATE,
    class_weights=sentiment_class_weights,
    early_stopping_patience=SENTIMENT_EARLY_STOP_PATIENCE,
    model_name="sentiment_model_improved"
)

In [None]:
# Plot Sentiment history
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

axes[0].plot(sentiment_history['train_loss'], label='Train', marker='o')
axes[0].plot(sentiment_history['val_loss'], label='Val', marker='s')
axes[0].set_title('Sentiment - Loss', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].legend()
axes[0].grid(alpha=0.3)

axes[1].plot(sentiment_history['val_accuracy'], marker='o', color='blue')
axes[1].set_title('Sentiment - Validation Accuracy', fontsize=14, fontweight='bold')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Accuracy')
axes[1].grid(alpha=0.3)

axes[2].plot(sentiment_history['val_f1'], marker='o', color='green')
axes[2].set_title('Sentiment - Validation F1', fontsize=14, fontweight='bold')
axes[2].set_xlabel('Epoch')
axes[2].set_ylabel('F1 Score')
axes[2].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(f'{RESULTS_DIR}/sentiment_training.png', dpi=300)
plt.show()

print(f"✓ Best Sentiment F1: {max(sentiment_history['val_f1']):.4f}")

## 8b. Detailed Sentiment Analysis (IMPROVED)

In [None]:
# ============================================================
# DETAILED SENTIMENT PERFORMANCE ANALYSIS (IMPROVED)
# ============================================================

print("\n" + "="*70)
print("SENTIMENT MODEL DETAILED ANALYSIS")
print("="*70)

sentiment_model.eval()
all_preds = []
all_labels = []
all_probs = []

with torch.no_grad():
    for batch in tqdm(sentiment_val_loader, desc="Analyzing"):
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        
        outputs = sentiment_model(input_ids, attention_mask)
        probs = F.softmax(outputs.logits, dim=1)
        preds = torch.argmax(outputs.logits, dim=1)
        
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())
        all_probs.extend(probs.cpu().numpy())

all_preds = np.array(all_preds)
all_labels = np.array(all_labels)
all_probs = np.array(all_probs)

# Classification report
print("\nClassification Report:")
print(classification_report(
    all_labels,
    all_preds,
    target_names=['negative', 'neutral', 'positive'],
    digits=4
))

# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
plt.figure(figsize=(8, 6))
sns.heatmap(
    cm,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=['negative', 'neutral', 'positive'],
    yticklabels=['negative', 'neutral', 'positive']
)
plt.title('Sentiment Confusion Matrix - IMPROVED MODEL', fontsize=14, fontweight='bold')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.savefig(f'{RESULTS_DIR}/sentiment_confusion_matrix_improved.png', dpi=300)
plt.show()

# Per-class metrics
precision, recall, f1, support = precision_recall_fscore_support(
    all_labels, all_preds, average=None
)

print("\nPer-Class Metrics:")
print(f"{'Class':<10} {'Precision':<10} {'Recall':<10} {'F1':<10} {'Support':<10}")
print("-" * 50)
for i, label in enumerate(['negative', 'neutral', 'positive']):
    print(f"{label:<10} {precision[i]:<10.4f} {recall[i]:<10.4f} {f1[i]:<10.4f} {support[i]:<10}")

# Overall metrics
overall_f1 = f1.mean()
print(f"\nOverall F1 (Macro Avg): {overall_f1:.4f}")
print(f"Improvement from baseline: {overall_f1 - 0.58:.4f} ({(overall_f1 - 0.58)/0.58*100:+.1f}% change)")

# Confidence analysis
avg_confidence = np.max(all_probs, axis=1).mean()
print(f"\nAverage Prediction Confidence: {avg_confidence:.3f}")

# Misclassifications
misclassified = all_preds != all_labels
num_misclass = misclassified.sum()
print(f"\nMisclassifications: {num_misclass} / {len(all_labels)} ({num_misclass/len(all_labels)*100:.1f}%)")

print("\n" + "="*70)
print("✓ Detailed analysis complete!")
print("="*70)

## 9. Integrated Pipeline & Predictions

In [None]:
def predict_tweet(tweet_text, ner_model, sentiment_model, ner_tokenizer, sentiment_tokenizer):
    """
    Full pipeline: NER + Regex → Sentiment per competitor
    """
    ner_model.eval()
    sentiment_model.eval()
    
    # NER prediction
    encoding = ner_tokenizer(
        tweet_text,
        add_special_tokens=True,
        max_length=MAX_SEQ_LENGTH,
        padding='max_length',
        truncation=True,
        return_tensors='pt'
    )
    
    with torch.no_grad():
        with autocast():
            outputs = ner_model(encoding['input_ids'].to(device), encoding['attention_mask'].to(device))
    
    predicted_idx = torch.argmax(outputs.logits, dim=1).item()
    primary_competitor = COMPETITORS[predicted_idx]
    
    # Regex extraction
    regex_competitors = extract_all_competitors(tweet_text)
    
    # Combine
    all_competitors = set([primary_competitor] + regex_competitors)
    
    # Sentiment for each
    results = []
    for competitor in all_competitors:
        text = f"{tweet_text} This tweet is about {competitor}."
        
        sentiment_encoding = sentiment_tokenizer(
            text,
            add_special_tokens=True,
            max_length=MAX_SEQ_LENGTH,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        
        with torch.no_grad():
            with autocast():
                sentiment_outputs = sentiment_model(
                    sentiment_encoding['input_ids'].to(device),
                    sentiment_encoding['attention_mask'].to(device)
                )
        
        predicted_sentiment = torch.argmax(sentiment_outputs.logits, dim=1).item()
        results.append((competitor, predicted_sentiment))
    
    return results

# Test
test_tweet = "KFC's chicken is amazing but McDonald's has terrible service"
print(f"Test tweet: {test_tweet}")
predictions = predict_tweet(test_tweet, ner_model, sentiment_model, ner_tokenizer, sentiment_tokenizer)
for comp, sent in predictions:
    print(f"  {comp}: {SENTIMENT_MAP[sent]}")

In [None]:
def process_dataset(df, ner_model, sentiment_model, ner_tokenizer, sentiment_tokenizer):
    """
    Process entire dataset with pipeline.
    Returns DataFrame with one row per detected competitor.
    """
    results = []
    
    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing tweets"):
        tweet = row['Tweet']
        
        # Get predictions
        predictions = predict_tweet(tweet, ner_model, sentiment_model, ner_tokenizer, sentiment_tokenizer)
        
        # Create row for each detected competitor
        for competitor, sentiment in predictions:
            result_row = {
                'Competitor': competitor,
                'Tweet': tweet,
                'Predicted_Sentiment': sentiment,
                'Sentiment_Label': SENTIMENT_MAP[sentiment]
            }
            
            # Add metadata
            for col in ['Impact', 'Impressions', 'Reach (new)', 'Date', 'Url']:
                if col in row.index and pd.notna(row[col]):
                    result_row[col] = row[col]
            
            results.append(result_row)
    
    return pd.DataFrame(results)

print("✓ Batch processing function defined")

In [None]:
# Generate predictions for test data
print("Generating predictions for test data...\n")

# Load best models
ner_model.load_state_dict(torch.load(f'{MODEL_SAVE_DIR}/ner_model_best.pt'))
sentiment_model.load_state_dict(torch.load(f'{MODEL_SAVE_DIR}/sentiment_model_best.pt'))

# Process
predictions_df = process_dataset(
    df_test_pred_clean,
    ner_model,
    sentiment_model,
    ner_tokenizer,
    sentiment_tokenizer
)

print(f"\n✓ Generated {len(predictions_df)} predictions from {len(df_test_pred_clean)} tweets")
print(f"  Average competitors per tweet: {len(predictions_df) / len(df_test_pred_clean):.2f}")

print("\nSample predictions:")
print(predictions_df.head(10))

## 10. Excel Export with Formatting (FIXED)

In [None]:
def export_to_excel(df, filename, include_summary=True):
    """
    Export predictions to formatted Excel file.
    """
    filepath = f'{RESULTS_DIR}/{filename}'
    
    # Create Excel writer
    with pd.ExcelWriter(filepath, engine='xlsxwriter') as writer:
        # Write main data
        df.to_excel(writer, sheet_name='Predictions', index=False)
        
        # Get workbook and worksheet
        workbook = writer.book
        worksheet = writer.sheets['Predictions']
        
        # Define formats
        header_format = workbook.add_format({
            'bold': True,
            'text_wrap': True,
            'valign': 'top',
            'fg_color': '#D7E4BD',
            'border': 1
        })
        
        # Sentiment color formats
        negative_format = workbook.add_format({'bg_color': '#FFC7CE', 'font_color': '#9C0006'})
        neutral_format = workbook.add_format({'bg_color': '#FFEB9C', 'font_color': '#9C6500'})
        positive_format = workbook.add_format({'bg_color': '#C6EFCE', 'font_color': '#006100'})
        
        # Apply header format
        for col_num, value in enumerate(df.columns.values):
            worksheet.write(0, col_num, value, header_format)
        
        # Set column widths
        worksheet.set_column('A:A', 15)  # Competitor
        worksheet.set_column('B:B', 60)  # Tweet
        worksheet.set_column('C:C', 12)  # Predicted_Sentiment
        worksheet.set_column('D:D', 15)  # Sentiment_Label
        worksheet.set_column('E:Z', 12)  # Other columns
        
        # Apply conditional formatting for sentiment
        if 'Sentiment_Label' in df.columns:
            sentiment_col = df.columns.get_loc('Sentiment_Label')
            
            # Negative
            worksheet.conditional_format(1, sentiment_col, len(df), sentiment_col, {
                'type': 'text',
                'criteria': 'containing',
                'value': 'negative',
                'format': negative_format
            })
            
            # Neutral
            worksheet.conditional_format(1, sentiment_col, len(df), sentiment_col, {
                'type': 'text',
                'criteria': 'containing',
                'value': 'neutral',
                'format': neutral_format
            })
            
            # Positive
            worksheet.conditional_format(1, sentiment_col, len(df), sentiment_col, {
                'type': 'text',
                'criteria': 'containing',
                'value': 'positive',
                'format': positive_format
            })
        
        # Add summary sheet if requested
        if include_summary:
            summary_data = []
            
            # Overall stats
            summary_data.append(['Metric', 'Value'])
            summary_data.append(['Total Predictions', len(df)])
            summary_data.append(['Unique Tweets', df['Tweet'].nunique()])
            summary_data.append(['Unique Competitors', df['Competitor'].nunique()])
            summary_data.append([''])
            
            # Sentiment distribution
            summary_data.append(['Sentiment', 'Count', 'Percentage'])
            sent_dist = df['Sentiment_Label'].value_counts()
            for sent, count in sent_dist.items():
                pct = count / len(df) * 100
                summary_data.append([sent, count, f'{pct:.1f}%'])
            
            summary_data.append([''])
            
            # Per-competitor stats
            summary_data.append(['Competitor', 'Mentions', 'Positive', 'Neutral', 'Negative'])
            for comp in sorted(df['Competitor'].unique()):
                comp_df = df[df['Competitor'] == comp]
                pos = len(comp_df[comp_df['Sentiment_Label'] == 'positive'])
                neu = len(comp_df[comp_df['Sentiment_Label'] == 'neutral'])
                neg = len(comp_df[comp_df['Sentiment_Label'] == 'negative'])
                summary_data.append([comp, len(comp_df), pos, neu, neg])
            
            # Write summary
            summary_df = pd.DataFrame(summary_data)
            summary_df.to_excel(writer, sheet_name='Summary', index=False, header=False)
            
            # Format summary sheet
            summary_ws = writer.sheets['Summary']
            summary_ws.set_column('A:A', 20)
            summary_ws.set_column('B:E', 12)
    
    print(f"\n✓ Excel file saved: {filepath}")
    return filepath

print("✓ Excel export function defined")

In [None]:
# Export to Excel
excel_path = export_to_excel(
    predictions_df,
    'KFC_Predictions_Complete.xlsx',
    include_summary=True
)

print("\n✓ Predictions exported to formatted Excel file!")
print(f"  File: {excel_path}")
print(f"  Sheets: 'Predictions' (main data), 'Summary' (statistics)")
print(f"  Format: Color-coded sentiment, auto-width columns, summary stats")

In [None]:
# Also save to Google Drive
drive_path = f'{MODEL_SAVE_DIR}/KFC_Predictions_Complete.xlsx'
predictions_df.to_excel(drive_path, index=False, engine='openpyxl')
print(f"\n✓ Also saved to Google Drive: {drive_path}")

## 11. Summary & Download

In [None]:
print("\n" + "="*70)
print("TRAINING & PREDICTION COMPLETE - IMPROVED VERSION")
print("="*70)

print("\n✅ All Fixes Applied:")
print("  1. AdamW import (torch.optim.AdamW)")
print("  2. CSV encoding (auto-converted to UTF-8)")
print("  3. Excel output (formatted .xlsx with color-coding)")
print("  4. NER model (single-label classification)")
print("  5. Separate datasets (NER: large, Sentiment: training sample)")

print("\n🚀 Sentiment Improvements Applied:")
print(f"  1. Data Augmentation: {len(sentiment_train_df)} → {len(sentiment_train_df_augmented)} samples")
print(f"  2. Focal Loss (gamma={FOCAL_LOSS_GAMMA})")
print(f"  3. Optimized Hyperparameters (LR={SENTIMENT_LEARNING_RATE}, Epochs={SENTIMENT_EPOCHS})")
print(f"  4. Early Stopping (patience={SENTIMENT_EARLY_STOP_PATIENCE})")
print("  5. Better Contextualization")
print("  6. Detailed Analysis with Confusion Matrix")

print("\n📊 Results:")
print(f"  NER Best F1: {max(ner_history['val_f1']):.4f}")
print(f"  Sentiment Best F1: {max(sentiment_history['val_f1']):.4f}")
print(f"  Sentiment Improvement: {(max(sentiment_history['val_f1']) - 0.58) / 0.58 * 100:+.1f}% from baseline (0.58)")
print(f"  Total predictions: {len(predictions_df)}")
print(f"  From {len(df_test_pred_clean)} tweets")

print("\n📁 Output Files:")
print(f"  Excel (formatted): {excel_path}")
print(f"  Google Drive: {drive_path}")
print(f"  Models: {MODEL_SAVE_DIR}/")
print(f"  Plots: {RESULTS_DIR}/")

print("\n" + "="*70)
print("✓ Ready to download results!")
print("="*70)

In [None]:
# Download Excel file
from google.colab import files
files.download(excel_path)