# Model Comparison: TF-IDF vs Word2Vec vs BERT

This notebook compares all three approaches for genre classification:
1. **TF-IDF + Logistic Regression** - Keyword-based baseline
2. **Word2Vec + Logistic Regression** - Semantic embeddings
3. **BERT/DistilBERT** - Transformer-based contextual understanding

All models use the same train/test split for fair comparison.

In [None]:
import sys
from pathlib import Path
import time

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

from src.data_loader import load_and_prepare_data
from src.models import TFIDFModel, Word2VecModel, BERTModel
from src.evaluate import (
    evaluate_model, 
    plot_confusion_matrix, 
    plot_per_genre_metrics,
    compare_models,
    create_results_table
)
from src.utils import set_seed, get_gpu_info

print("✓ Modules imported successfully!")

## Configuration

In [None]:
# Set seeds for reproducibility
RANDOM_STATE = 42
set_seed(RANDOM_STATE)

# Data parameters
SAMPLES_PER_GENRE = 20000  # Adjust based on available time/resources
TEST_SIZE = 0.2

# Model flags - set to False to skip training
TRAIN_TFIDF = True
TRAIN_WORD2VEC = True
TRAIN_BERT = True  # Set to False if no GPU available

print(f"Configuration:")
print(f"  Samples per genre: {SAMPLES_PER_GENRE:,}")
print(f"  Test size: {TEST_SIZE}")
print(f"  Random state: {RANDOM_STATE}")

## Check GPU Availability (for BERT)

In [None]:
gpu_info = get_gpu_info()
print(f"CUDA available: {gpu_info['cuda_available']}")
print(f"Number of GPUs: {gpu_info['device_count']}")

for gpu in gpu_info['devices']:
    print(f"  GPU {gpu['id']}: {gpu['name']} ({gpu['memory_total']:.1f} GB)")

if not gpu_info['cuda_available'] and TRAIN_BERT:
    print("\n⚠️  No GPU detected. BERT training will be very slow on CPU.")
    print("Consider setting TRAIN_BERT = False")

## Load Data

Load once and use for all models to ensure fair comparison.

In [None]:
X_train, X_test, y_train, y_test = load_and_prepare_data(
    samples_per_genre=SAMPLES_PER_GENRE,
    test_size=TEST_SIZE,
    use_cached=True,
    random_state=RANDOM_STATE
)

print(f"Training samples: {len(X_train):,}")
print(f"Test samples: {len(X_test):,}")
print(f"Genres: {sorted(y_train.unique())}")

## 1. TF-IDF Model

In [None]:
if TRAIN_TFIDF:
    print("="*70)
    print("TRAINING TF-IDF MODEL")
    print("="*70)
    
    start_time = time.time()
    
    tfidf_model = TFIDFModel(
        classifier_type='logistic',
        max_features=10000,
        ngram_range=(1, 2),
        min_df=5,
        max_df=0.8
    )
    
    tfidf_model.fit(X_train, y_train)
    y_pred_tfidf = tfidf_model.predict(X_test)
    
    tfidf_time = time.time() - start_time
    results_tfidf = evaluate_model(y_test, y_pred_tfidf, model_name="TF-IDF")
    
    print(f"\n✓ TF-IDF trained in {tfidf_time:.1f}s")
    print(f"Accuracy: {results_tfidf['accuracy']:.4f}")
    print(f"Macro F1: {results_tfidf['macro_avg']['f1']:.4f}")
else:
    print("Skipping TF-IDF training")

## 2. Word2Vec Model

In [None]:
if TRAIN_WORD2VEC:
    print("="*70)
    print("TRAINING WORD2VEC MODEL")
    print("="*70)
    
    start_time = time.time()
    
    w2v_model = Word2VecModel(
        classifier_type='logistic',
        vector_size=200,
        window=5,
        min_count=5,
        epochs=10
    )
    
    w2v_model.fit(X_train, y_train)
    y_pred_w2v = w2v_model.predict(X_test)
    
    w2v_time = time.time() - start_time
    results_w2v = evaluate_model(y_test, y_pred_w2v, model_name="Word2Vec")
    
    print(f"\n✓ Word2Vec trained in {w2v_time:.1f}s")
    print(f"Accuracy: {results_w2v['accuracy']:.4f}")
    print(f"Macro F1: {results_w2v['macro_avg']['f1']:.4f}")
else:
    print("Skipping Word2Vec training")

## 3. BERT Model

**Note:** This will take significantly longer (~50-60 minutes with 20K samples/genre on 8xA16 GPUs)

You can:
- Reduce `SAMPLES_PER_GENRE` for faster testing
- Set `TRAIN_BERT = False` above to skip
- Or use the command-line script instead: `python scripts/train.py --config experiments/configs/bert_config.yaml`

In [None]:
if TRAIN_BERT:
    print("="*70)
    print("TRAINING BERT MODEL")
    print("="*70)
    
    start_time = time.time()
    
    bert_model = BERTModel(
        model_name='distilbert-base-multilingual-cased',
        max_length=256,  # Optimized for speed
        batch_size=96,   # Total batch size across all GPUs
        learning_rate=2e-5,
        epochs=5,        # Early stopping recommended
        warmup_ratio=0.1,
        weight_decay=0.1,  # Stronger regularization to prevent overfitting
        use_amp=True     # Mixed precision for 2x speedup
    )
    
    bert_model.fit(X_train, y_train)
    y_pred_bert = bert_model.predict(X_test)
    
    bert_time = time.time() - start_time
    results_bert = evaluate_model(y_test, y_pred_bert, model_name="BERT")
    
    print(f"\n✓ BERT trained in {bert_time/60:.1f} minutes")
    print(f"Accuracy: {results_bert['accuracy']:.4f}")
    print(f"Macro F1: {results_bert['macro_avg']['f1']:.4f}")
else:
    print("Skipping BERT training")

## Results Comparison

In [None]:
# Collect results from trained models
results_list = []
if TRAIN_TFIDF:
    results_list.append(results_tfidf)
if TRAIN_WORD2VEC:
    results_list.append(results_w2v)
if TRAIN_BERT:
    results_list.append(results_bert)

# Create comparison table
if results_list:
    comparison_df = create_results_table(results_list)
    print("\n" + "="*70)
    print("MODEL COMPARISON")
    print("="*70)
    print(comparison_df.to_string(index=False))
else:
    print("No models trained - enable at least one model above")

## Visualize Model Comparison

In [None]:
if results_list:
    # Compare F1 scores across genres
    compare_models(
        results_list,
        metric='f1',
        title="F1 Score Comparison by Genre"
    )

In [None]:
if results_list:
    # Compare precision across genres
    compare_models(
        results_list,
        metric='precision',
        title="Precision Comparison by Genre"
    )

## Individual Confusion Matrices

In [None]:
if TRAIN_TFIDF:
    plot_confusion_matrix(
        y_test, 
        y_pred_tfidf,
        title="TF-IDF Confusion Matrix",
        normalize=True
    )

In [None]:
if TRAIN_WORD2VEC:
    plot_confusion_matrix(
        y_test, 
        y_pred_w2v,
        title="Word2Vec Confusion Matrix",
        normalize=True
    )

In [None]:
if TRAIN_BERT:
    plot_confusion_matrix(
        y_test, 
        y_pred_bert,
        title="BERT Confusion Matrix",
        normalize=True
    )

## Key Findings

**Expected performance (based on previous experiments):**

1. **TF-IDF (~61% accuracy)**
   - Fast training (< 1 minute)
   - Good baseline for keyword-based features
   - Best for: country, rap, rock (distinctive vocabulary)
   - Worst for: pop (generic language)

2. **Word2Vec (~56% accuracy)**
   - Medium training time (2-3 minutes)
   - Surprisingly underperforms TF-IDF
   - May benefit from larger corpus or pre-trained embeddings

3. **BERT (~61-63% accuracy)**
   - Slow training (50-60 minutes with 20K samples/genre)
   - Similar to TF-IDF but with better contextual understanding
   - Prone to overfitting after epoch 3 (use early stopping!)
   - Best overall but marginal improvement may not justify cost

**Common challenges:**
- Pop genre is consistently hardest to classify across all models
- Rap/R&B confusion due to overlapping themes
- Country performs best (most distinctive features)

## Next Steps

1. **Error Analysis**: Examine misclassified examples to understand model limitations
2. **Hyperparameter Tuning**: Use the config files in `experiments/configs/` to experiment
3. **Feature Engineering**: Add metadata (artist, year) to improve predictions
4. **Ensemble Methods**: Combine TF-IDF and BERT predictions
5. **Publication**: Use the command-line scripts for reproducible experiments:
   ```bash
   python scripts/train.py --config experiments/configs/tfidf_config.yaml
   python scripts/train.py --config experiments/configs/word2vec_config.yaml
   python scripts/train.py --config experiments/configs/bert_config.yaml
   ```