# Phase 3: Model Development (Classification)

In this notebook, we will train and evaluate baseline classification models using the **Balanced Corpus** generated in Phase 2.

**Objectives:**
1. Load `processed_corpus_balanced.csv`.
2. Train Baseline Models (Logistic Regression, SVM, Random Forest) using TF-IDF.
3. Compare performance.

In [None]:
%load_ext autoreload
%autoreload 2
import sys
import os
from pathlib import Path
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Add src to path
sys.path.append(os.path.abspath("../src"))
from models import SentimentClassifier

## 1. Load Data

In [None]:
data_path = Path("../data/processed_corpus_balanced.csv")
df = pd.read_csv(data_path)
print(f"Loaded {len(df)} rows.")
df.head()

## 2. Train/Test Split

In [None]:
# Drop rows with NaN in text or sentiment (just safety)
df = df.dropna(subset=['clean_text', 'sentiment_score'])

X = df['clean_text']
y = df['sentiment_score']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}")

## 3. Baseline Experiments
We will test: Logistic Regression, SVM, Random Forest.

In [None]:
results = {}
models = ['logreg', 'svm', 'rf']

for m in models:
    clf = SentimentClassifier(model_type=m)
    clf.train(X_train, y_train)
    res = clf.evaluate(X_test, y_test)
    results[m] = res

## 4. Evaluation

In [None]:
# Compare Accuracies
acc_df = pd.DataFrame([(m, res['accuracy']) for m, res in results.items()], columns=['Model', 'Accuracy'])
sns.barplot(data=acc_df, x='Model', y='Accuracy')
plt.title('Baseline Model Comparison')
plt.ylim(0, 1)
plt.show()

In [None]:
# Heatmap for best model (likely SVM or LogReg)
best_model = acc_df.sort_values('Accuracy', ascending=False).iloc[0]['Model']
print(f"Best Model: {best_model}")

cm = results[best_model]['confusion_matrix']
plt.figure(figsize=(8,6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title(f'Confusion Matrix ({best_model})')
plt.ylabel('True')
plt.xlabel('Predicted')
plt.show()