# Quick Start: Genre Classification

This notebook demonstrates how to use the refactored modules for quick experimentation.

**What this notebook does:**
- Loads and prepares data with a single function call
- Trains a TF-IDF baseline model
- Evaluates performance and visualizes results

**All the complex code is now in `src/` modules!**

In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Import from our modules
from src.data_loader import load_and_prepare_data
from src.models import TFIDFModel
from src.evaluate import evaluate_model, plot_confusion_matrix, plot_per_genre_metrics
from src.utils import set_seed

print("✓ Modules imported successfully!")

## 1. Load Data

One line to load, clean, balance, and split the data!

In [None]:
set_seed(42)

# Load balanced dataset with 5,000 samples per genre
X_train, X_test, y_train, y_test = load_and_prepare_data(
    samples_per_genre=5000,  # Start small for quick testing
    test_size=0.2,
    use_cached=True,
    random_state=42
)

print(f"Training samples: {len(X_train):,}")
print(f"Test samples: {len(X_test):,}")
print(f"Genres: {sorted(y_train.unique())}")

## 2. Train TF-IDF Model

Simple two-line training!

In [None]:
# Create and train model
model = TFIDFModel(
    classifier_type='logistic',
    max_features=10000,
    ngram_range=(1, 2),
    min_df=5,
    max_df=0.8
)

model.fit(X_train, y_train)
print("✓ Model trained!")

## 3. Evaluate Performance

In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Comprehensive evaluation
results = evaluate_model(y_test, y_pred, model_name="TF-IDF Baseline")

print(f"\nOverall Accuracy: {results['accuracy']:.4f}")
print(f"Macro F1 Score: {results['macro_avg']['f1']:.4f}")

## 4. Visualize Results

In [None]:
# Confusion matrix
plot_confusion_matrix(
    y_test, 
    y_pred,
    title="TF-IDF Confusion Matrix",
    normalize=True
)

In [None]:
# Per-genre performance
plot_per_genre_metrics(
    results,
    title="TF-IDF Per-Genre Performance"
)

## 5. Analyze Top Features (Optional)

See what words are most important for each genre.

In [None]:
import pandas as pd

# Get top features for each genre
feature_importance = model.get_feature_importance(top_n=10)

# Display as DataFrame for easy viewing
for genre, features in feature_importance.items():
    print(f"\n{genre.upper()}")
    df = pd.DataFrame(features, columns=['Feature', 'Importance'])
    print(df.to_string(index=False))

## Summary

**That's it!** Notice how clean this notebook is:
- No data processing code (handled by `src/data_loader.py`)
- No model implementation details (handled by `src/models.py`)
- No evaluation code (handled by `src/evaluate.py`)

**Next steps:**
- Try `02_compare_models.ipynb` to compare TF-IDF, Word2Vec, and BERT
- Or use command line: `python scripts/train.py --config experiments/configs/tfidf_config.yaml`