# Approach 5: Alternative Models

## Spotify Song Popularity Prediction

**Goal:** Explore models beyond linear regression and basic tree ensembles to improve our best CV RMSE of 10.42.

### Models to Test:
1. **SVR (Support Vector Regression)** - RBF and Linear kernels
2. **KNN Regression** - Distance-based predictions
3. **XGBoost** - Advanced gradient boosting
4. **LightGBM** - Fast gradient boosting
5. **Bayesian Ridge** - Automatic regularization

### Pipeline:
- Same leakage-safe methodology as Approach 3 v2
- 5-fold CV with preprocessing inside each fold
- Hyperparameter tuning for top performers

---

## 1. Import Libraries

In [None]:
# Core libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler

# Previous best models (for comparison)
from sklearn.linear_model import Ridge, Lasso, ElasticNet, BayesianRidge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

# NEW models to test
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

# XGBoost and LightGBM
try:
    import xgboost as xgb
    XGB_AVAILABLE = True
    print("XGBoost version:", xgb.__version__)
except ImportError:
    XGB_AVAILABLE = False
    print("XGBoost not installed. Run: pip install xgboost")

try:
    import lightgbm as lgb
    LGB_AVAILABLE = True
    print("LightGBM version:", lgb.__version__)
except ImportError:
    LGB_AVAILABLE = False
    print("LightGBM not installed. Run: pip install lightgbm")

# Metrics
from sklearn.metrics import mean_squared_error, r2_score

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

print("\nAll libraries loaded successfully!")

---
## 2. Load Data

In [None]:
# Load data
train_df = pd.read_csv('/Users/barbarawerobaobayi/Documents/Strathclyde/Semester 2/Machine Learning for Data Analytics/Spotify Project/data/CS98XRegressionTrain.csv')
test_df = pd.read_csv('/Users/barbarawerobaobayi/Documents/Strathclyde/Semester 2/Machine Learning for Data Analytics/Spotify Project/data/CS98XRegressionTest.csv')

# Handle missing genres
train_df['top genre'] = train_df['top genre'].fillna('Unknown').replace('', 'Unknown')
test_df['top genre'] = test_df['top genre'].fillna('Unknown').replace('', 'Unknown')

print(f"Training set: {train_df.shape[0]} rows, {train_df.shape[1]} columns")
print(f"Test set:     {test_df.shape[0]} rows, {test_df.shape[1]} columns")
print(f"\nTarget variable (pop) statistics:")
print(train_df['pop'].describe().round(2))

---
## 3. Feature Engineering Functions

Reusing the same pipeline from Approach 3 v2 (leakage-safe).

In [None]:
# Configuration
TOP_N_GENRES = 15
numerical_features = ['bpm', 'nrgy', 'dnce', 'dB', 'live', 'val', 'dur', 'acous', 'spch']

def encode_genres(df, top_genres):
    """
    Encode genres using pre-defined top genres list.
    Genres not in the list are mapped to 'other'.
    """
    df = df.copy()
    df['genre_simplified'] = df['top genre'].apply(
        lambda x: x if x in top_genres else 'other'
    )
    genre_dummies = pd.get_dummies(df['genre_simplified'], prefix='genre')
    df = pd.concat([df, genre_dummies], axis=1)
    return df

def engineer_features(df):
    """
    Create engineered features from the original numerical features.
    """
    df = df.copy()
    
    # INTERACTION TERMS
    df['nrgy_x_dnce'] = df['nrgy'] * df['dnce']
    df['nrgy_x_val'] = df['nrgy'] * df['val']
    df['nrgy_x_dB'] = df['nrgy'] * df['dB']
    df['dnce_x_val'] = df['dnce'] * df['val']
    df['dnce_x_bpm'] = df['dnce'] * df['bpm']
    df['acous_x_nrgy'] = df['acous'] * df['nrgy']
    
    # RATIOS
    df['nrgy_per_bpm'] = df['nrgy'] / (df['bpm'] + 1)
    df['dnce_per_nrgy'] = df['dnce'] / (df['nrgy'] + 1)
    df['val_per_nrgy'] = df['val'] / (df['nrgy'] + 1)
    df['spch_per_dur'] = df['spch'] / (df['dur'] + 1)
    
    # POLYNOMIAL FEATURES
    df['dur_squared'] = df['dur'] ** 2
    df['acous_squared'] = df['acous'] ** 2
    df['dB_squared'] = df['dB'] ** 2
    df['nrgy_squared'] = df['nrgy'] ** 2
    
    # BINNED FEATURES
    df['bpm_slow'] = (df['bpm'] < 100).astype(int)
    df['bpm_medium'] = ((df['bpm'] >= 100) & (df['bpm'] < 130)).astype(int)
    df['bpm_fast'] = (df['bpm'] >= 130).astype(int)
    df['low_energy'] = (df['nrgy'] < 50).astype(int)
    df['high_energy'] = (df['nrgy'] >= 70).astype(int)
    df['is_acoustic'] = (df['acous'] > 50).astype(int)
    df['short_song'] = (df['dur'] < 180).astype(int)
    df['long_song'] = (df['dur'] > 300).astype(int)
    
    # COMPOSITE SCORES
    df['party_score'] = (df['nrgy'] + df['dnce'] + df['val'] - df['acous']) / 4
    df['chill_score'] = (df['acous'] + (100 - df['nrgy']) + (100 - df['dnce'])) / 3
    df['vocal_score'] = df['spch'] + df['live']
    
    return df

# Engineered features list
engineered_features = [
    'nrgy_x_dnce', 'nrgy_x_val', 'nrgy_x_dB', 'dnce_x_val', 'dnce_x_bpm', 'acous_x_nrgy',
    'nrgy_per_bpm', 'dnce_per_nrgy', 'val_per_nrgy', 'spch_per_dur',
    'dur_squared', 'acous_squared', 'dB_squared', 'nrgy_squared',
    'bpm_slow', 'bpm_medium', 'bpm_fast', 'low_energy', 'high_energy',
    'is_acoustic', 'short_song', 'long_song',
    'party_score', 'chill_score', 'vocal_score'
]

print(f"Numerical features: {len(numerical_features)}")
print(f"Engineered features: {len(engineered_features)}")

---
## 4. Leakage-Safe Cross-Validation Function

In [None]:
def full_pipeline_cv(df, features_to_use, model, scale=True, cv=5, verbose=False):
    """
    Proper cross-validation with encoding done inside each fold.
    This prevents ANY data leakage.
    """
    kf = KFold(n_splits=cv, shuffle=True, random_state=42)
    val_rmses = []
    val_r2s = []
    
    for fold, (train_idx, val_idx) in enumerate(kf.split(df)):
        # Split THIS fold
        fold_train = df.iloc[train_idx].copy()
        fold_val = df.iloc[val_idx].copy()
        
        # Learn top genres from THIS FOLD's training data only
        fold_top_genres = fold_train['top genre'].value_counts().head(TOP_N_GENRES).index.tolist()
        
        # Encode genres
        fold_train_enc = encode_genres(fold_train, fold_top_genres)
        fold_val_enc = encode_genres(fold_val, fold_top_genres)
        
        # Align columns
        fold_genre_cols = [c for c in fold_train_enc.columns if c.startswith('genre_') and c != 'genre_simplified']
        for col in fold_genre_cols:
            if col not in fold_val_enc.columns:
                fold_val_enc[col] = 0
        
        # Apply feature engineering
        fold_train_fe = engineer_features(fold_train_enc)
        fold_val_fe = engineer_features(fold_val_enc)
        
        # Get available features
        available_features = [f for f in features_to_use if f in fold_train_fe.columns]
        
        X_train = fold_train_fe[available_features].copy()
        X_val = fold_val_fe[available_features].copy()
        
        # Handle missing columns in validation
        for col in available_features:
            if col not in X_val.columns:
                X_val[col] = 0
        X_val = X_val[available_features]
        
        y_train = fold_train_fe['pop']
        y_val = fold_val_fe['pop']
        
        # Scale if needed
        if scale:
            scaler = StandardScaler()
            X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns, index=X_train.index)
            X_val = pd.DataFrame(scaler.transform(X_val), columns=X_val.columns, index=X_val.index)
        
        # Train and evaluate
        model.fit(X_train.values, y_train.values)
        y_pred = model.predict(X_val.values)
        
        rmse = np.sqrt(mean_squared_error(y_val, y_pred))
        r2 = r2_score(y_val, y_pred)
        
        val_rmses.append(rmse)
        val_r2s.append(r2)
        
        if verbose:
            print(f"  Fold {fold+1}: RMSE = {rmse:.4f}, R² = {r2:.4f}")
    
    return {
        'cv_rmse': np.mean(val_rmses),
        'cv_rmse_std': np.std(val_rmses),
        'cv_r2': np.mean(val_r2s),
        'cv_r2_std': np.std(val_r2s),
        'fold_rmses': val_rmses
    }

print("Leakage-safe CV function defined!")

---
## 5. Prepare Feature List

In [None]:
# Get genre features from full training data (for reference)
top_genres = train_df['top genre'].value_counts().head(TOP_N_GENRES).index.tolist()
temp_encoded = encode_genres(train_df, top_genres)
genre_features = [col for col in temp_encoded.columns if col.startswith('genre_') and col != 'genre_simplified']

# Full feature set
features_all = numerical_features + genre_features + engineered_features

print("="*60)
print("FEATURE SUMMARY")
print("="*60)
print(f"Numerical features:  {len(numerical_features)}")
print(f"Genre features:      {len(genre_features)}")
print(f"Engineered features: {len(engineered_features)}")
print(f"{'─'*30}")
print(f"TOTAL:               {len(features_all)}")

---
## 6. Model Definitions

Let's define all the models we want to test.

In [None]:
# Define models to test
models = {}

# ============================================
# PREVIOUS BEST (for comparison)
# ============================================
models['ElasticNet (baseline)'] = {
    'model': ElasticNet(alpha=0.1, l1_ratio=0.5, random_state=42),
    'scale': True,
    'type': 'Linear'
}

# ============================================
# SUPPORT VECTOR REGRESSION
# ============================================
models['SVR (RBF)'] = {
    'model': SVR(kernel='rbf', C=10, gamma='scale', epsilon=0.1),
    'scale': True,
    'type': 'SVM'
}

models['SVR (Linear)'] = {
    'model': SVR(kernel='linear', C=1),
    'scale': True,
    'type': 'SVM'
}

models['SVR (Poly)'] = {
    'model': SVR(kernel='poly', degree=2, C=1, gamma='scale'),
    'scale': True,
    'type': 'SVM'
}

# ============================================
# K-NEAREST NEIGHBORS
# ============================================
models['KNN (k=5)'] = {
    'model': KNeighborsRegressor(n_neighbors=5, weights='distance', metric='euclidean'),
    'scale': True,
    'type': 'Distance-based'
}

models['KNN (k=10)'] = {
    'model': KNeighborsRegressor(n_neighbors=10, weights='distance', metric='euclidean'),
    'scale': True,
    'type': 'Distance-based'
}

models['KNN (k=15)'] = {
    'model': KNeighborsRegressor(n_neighbors=15, weights='distance', metric='euclidean'),
    'scale': True,
    'type': 'Distance-based'
}

# ============================================
# BAYESIAN RIDGE
# ============================================
models['Bayesian Ridge'] = {
    'model': BayesianRidge(alpha_1=1e-6, alpha_2=1e-6, lambda_1=1e-6, lambda_2=1e-6),
    'scale': True,
    'type': 'Bayesian'
}

# ============================================
# XGBOOST
# ============================================
if XGB_AVAILABLE:
    models['XGBoost'] = {
        'model': xgb.XGBRegressor(
            n_estimators=100,
            max_depth=4,
            learning_rate=0.1,
            subsample=0.8,
            colsample_bytree=0.8,
            reg_alpha=0.1,
            reg_lambda=1.0,
            random_state=42,
            verbosity=0
        ),
        'scale': False,
        'type': 'Boosting'
    }
    
    models['XGBoost (tuned)'] = {
        'model': xgb.XGBRegressor(
            n_estimators=200,
            max_depth=3,
            learning_rate=0.05,
            subsample=0.7,
            colsample_bytree=0.7,
            reg_alpha=0.5,
            reg_lambda=2.0,
            min_child_weight=3,
            random_state=42,
            verbosity=0
        ),
        'scale': False,
        'type': 'Boosting'
    }

# ============================================
# LIGHTGBM
# ============================================
if LGB_AVAILABLE:
    models['LightGBM'] = {
        'model': lgb.LGBMRegressor(
            n_estimators=100,
            max_depth=4,
            learning_rate=0.1,
            subsample=0.8,
            colsample_bytree=0.8,
            reg_alpha=0.1,
            reg_lambda=1.0,
            random_state=42,
            verbosity=-1
        ),
        'scale': False,
        'type': 'Boosting'
    }
    
    models['LightGBM (tuned)'] = {
        'model': lgb.LGBMRegressor(
            n_estimators=200,
            max_depth=3,
            learning_rate=0.05,
            subsample=0.7,
            colsample_bytree=0.7,
            reg_alpha=0.5,
            reg_lambda=2.0,
            min_child_samples=10,
            random_state=42,
            verbosity=-1
        ),
        'scale': False,
        'type': 'Boosting'
    }

print(f"Total models to test: {len(models)}")
print("\nModels by type:")
for name, config in models.items():
    print(f"  - {name} ({config['type']})")

---
## 7. Run Model Comparison

In [None]:
print("="*70)
print("MODEL COMPARISON WITH LEAKAGE-SAFE 5-FOLD CV")
print("="*70)
print(f"\nDataset: {len(train_df)} samples, {len(features_all)} features")
print(f"Previous best (ElasticNet): CV RMSE = 10.42")
print("\n" + "-"*70)

results = []

for name, config in models.items():
    print(f"\nEvaluating: {name}...")
    
    # Clone the model to avoid state issues
    from sklearn.base import clone
    model = clone(config['model'])
    
    result = full_pipeline_cv(
        train_df, 
        features_all, 
        model, 
        scale=config['scale'], 
        cv=5,
        verbose=False
    )
    
    result['Model'] = name
    result['Type'] = config['type']
    result['Scale'] = config['scale']
    results.append(result)
    
    # Color coding for results
    if result['cv_rmse'] < 10.42:
        status = "NEW BEST!"
    elif result['cv_rmse'] < 10.50:
        status = "Competitive"
    else:
        status = ""
    
    print(f"  CV RMSE: {result['cv_rmse']:.4f} (+/- {result['cv_rmse_std']:.4f}) {status}")
    print(f"  CV R²:   {result['cv_r2']:.4f} (+/- {result['cv_r2_std']:.4f})")

---
## 8. Results Summary

In [None]:
# Create results DataFrame
results_df = pd.DataFrame(results)[['Model', 'Type', 'cv_rmse', 'cv_rmse_std', 'cv_r2', 'cv_r2_std']]
results_df = results_df.sort_values('cv_rmse').reset_index(drop=True)

print("="*80)
print("RESULTS SUMMARY (sorted by CV RMSE)")
print("="*80)
print(results_df.round(4).to_string(index=False))

# Highlight best
best_model = results_df.iloc[0]
print(f"\n{'='*80}")
print(f"BEST MODEL: {best_model['Model']}")
print(f"CV RMSE: {best_model['cv_rmse']:.4f}")
print(f"CV R²: {best_model['cv_r2']:.4f}")
print(f"{'='*80}")

In [None]:
# Visualization: Model Comparison Bar Chart
fig, ax = plt.subplots(figsize=(14, 8))

# Color by model type
type_colors = {
    'Linear': '#3498db',
    'SVM': '#e74c3c',
    'Distance-based': '#2ecc71',
    'Bayesian': '#9b59b6',
    'Boosting': '#f39c12'
}

colors = [type_colors.get(t, 'gray') for t in results_df['Type']]

# Highlight best model
colors[0] = '#1a5c1a'  # Dark green for best

bars = ax.barh(results_df['Model'], results_df['cv_rmse'], 
               xerr=results_df['cv_rmse_std'], capsize=4,
               color=colors, edgecolor='black', linewidth=1)

# Add value labels
for bar, val in zip(bars, results_df['cv_rmse']):
    ax.text(val + 0.05, bar.get_y() + bar.get_height()/2, f'{val:.3f}',
            va='center', fontsize=10, fontweight='bold')

# Reference line for previous best
ax.axvline(x=10.42, color='red', linestyle='--', linewidth=2, label='Previous Best (10.42)')

ax.set_xlabel('CV RMSE (Lower is Better)', fontsize=12, fontweight='bold')
ax.set_title('Alternative Models Comparison\n(5-Fold CV, Leakage-Safe Pipeline)', fontsize=14, fontweight='bold')
ax.set_xlim(9.5, max(results_df['cv_rmse']) + 0.5)

# Legend for model types
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, edgecolor='black', label=t) for t, c in type_colors.items()]
legend_elements.append(plt.Line2D([0], [0], color='red', linestyle='--', linewidth=2, label='Previous Best'))
ax.legend(handles=legend_elements, loc='lower right', fontsize=10)

plt.tight_layout()
plt.savefig('figures/alternative_models_comparison.png', dpi=150, bbox_inches='tight', facecolor='white')
plt.show()

print("\nFigure saved: figures/alternative_models_comparison.png")

In [None]:
# Visualization: Model Type Comparison
fig, ax = plt.subplots(figsize=(10, 6))

# Group by type and get best from each
type_best = results_df.groupby('Type').agg({
    'cv_rmse': 'min',
    'Model': 'first'
}).reset_index().sort_values('cv_rmse')

colors = [type_colors.get(t, 'gray') for t in type_best['Type']]

bars = ax.bar(type_best['Type'], type_best['cv_rmse'], color=colors, edgecolor='black', linewidth=2)

for bar, val, model in zip(bars, type_best['cv_rmse'], type_best['Model']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05, 
            f'{val:.3f}\n({model.split(" ")[0]})',
            ha='center', fontsize=10, fontweight='bold')

ax.axhline(y=10.42, color='red', linestyle='--', linewidth=2, label='Previous Best (10.42)')

ax.set_ylabel('Best CV RMSE', fontsize=12, fontweight='bold')
ax.set_title('Best Model from Each Category', fontsize=14, fontweight='bold')
ax.set_ylim(9.5, max(type_best['cv_rmse']) + 0.5)
ax.legend(loc='upper right')

plt.tight_layout()
plt.savefig('figures/model_type_comparison.png', dpi=150, bbox_inches='tight', facecolor='white')
plt.show()

print("\nFigure saved: figures/model_type_comparison.png")

---
## 9. Hyperparameter Tuning for Top Performers

Let's tune the best performing models to see if we can squeeze out more performance.

In [None]:
# Identify top 3 models (excluding baseline)
top_models = results_df[results_df['Model'] != 'ElasticNet (baseline)'].head(3)

print("="*60)
print("TOP 3 NEW MODELS (candidates for tuning)")
print("="*60)
print(top_models[['Model', 'cv_rmse', 'cv_r2']].to_string(index=False))

In [None]:
# Tune SVR if it's a top performer
if 'SVR' in results_df.iloc[0]['Model'] or 'SVR' in results_df.iloc[1]['Model']:
    print("\n" + "="*60)
    print("HYPERPARAMETER TUNING: SVR (RBF)")
    print("="*60)
    
    svr_params = {
        'C': [0.1, 1, 10, 50, 100],
        'gamma': ['scale', 'auto', 0.01, 0.1],
        'epsilon': [0.01, 0.1, 0.5]
    }
    
    print(f"Testing {len(svr_params['C']) * len(svr_params['gamma']) * len(svr_params['epsilon'])} combinations...")
    
    best_svr_rmse = float('inf')
    best_svr_params = None
    
    for C in svr_params['C']:
        for gamma in svr_params['gamma']:
            for epsilon in svr_params['epsilon']:
                model = SVR(kernel='rbf', C=C, gamma=gamma, epsilon=epsilon)
                result = full_pipeline_cv(train_df, features_all, model, scale=True, cv=5)
                
                if result['cv_rmse'] < best_svr_rmse:
                    best_svr_rmse = result['cv_rmse']
                    best_svr_params = {'C': C, 'gamma': gamma, 'epsilon': epsilon}
                    print(f"  New best: C={C}, gamma={gamma}, epsilon={epsilon} -> RMSE={result['cv_rmse']:.4f}")
    
    print(f"\nBest SVR params: {best_svr_params}")
    print(f"Best SVR RMSE: {best_svr_rmse:.4f}")

In [None]:
# Tune KNN if it's a top performer
if any('KNN' in m for m in results_df.head(3)['Model'].values):
    print("\n" + "="*60)
    print("HYPERPARAMETER TUNING: KNN")
    print("="*60)
    
    knn_params = {
        'n_neighbors': [3, 5, 7, 10, 15, 20, 25, 30],
        'weights': ['uniform', 'distance'],
        'metric': ['euclidean', 'manhattan']
    }
    
    print(f"Testing {len(knn_params['n_neighbors']) * len(knn_params['weights']) * len(knn_params['metric'])} combinations...")
    
    best_knn_rmse = float('inf')
    best_knn_params = None
    
    for k in knn_params['n_neighbors']:
        for weights in knn_params['weights']:
            for metric in knn_params['metric']:
                model = KNeighborsRegressor(n_neighbors=k, weights=weights, metric=metric)
                result = full_pipeline_cv(train_df, features_all, model, scale=True, cv=5)
                
                if result['cv_rmse'] < best_knn_rmse:
                    best_knn_rmse = result['cv_rmse']
                    best_knn_params = {'n_neighbors': k, 'weights': weights, 'metric': metric}
                    print(f"  New best: k={k}, weights={weights}, metric={metric} -> RMSE={result['cv_rmse']:.4f}")
    
    print(f"\nBest KNN params: {best_knn_params}")
    print(f"Best KNN RMSE: {best_knn_rmse:.4f}")

In [None]:
# Tune XGBoost if available and top performer
if XGB_AVAILABLE and any('XGBoost' in m for m in results_df.head(5)['Model'].values):
    print("\n" + "="*60)
    print("HYPERPARAMETER TUNING: XGBoost")
    print("="*60)
    
    xgb_params = {
        'max_depth': [2, 3, 4, 5],
        'learning_rate': [0.01, 0.05, 0.1],
        'n_estimators': [100, 200, 300],
        'reg_alpha': [0, 0.1, 0.5, 1.0],
        'reg_lambda': [1, 2, 5]
    }
    
    # Simplified search (subset of combinations)
    print("Running simplified grid search...")
    
    best_xgb_rmse = float('inf')
    best_xgb_params = None
    
    for max_depth in [2, 3, 4]:
        for lr in [0.03, 0.05, 0.1]:
            for reg_alpha in [0.1, 0.5, 1.0]:
                model = xgb.XGBRegressor(
                    n_estimators=200,
                    max_depth=max_depth,
                    learning_rate=lr,
                    subsample=0.8,
                    colsample_bytree=0.8,
                    reg_alpha=reg_alpha,
                    reg_lambda=2.0,
                    random_state=42,
                    verbosity=0
                )
                result = full_pipeline_cv(train_df, features_all, model, scale=False, cv=5)
                
                if result['cv_rmse'] < best_xgb_rmse:
                    best_xgb_rmse = result['cv_rmse']
                    best_xgb_params = {'max_depth': max_depth, 'learning_rate': lr, 'reg_alpha': reg_alpha}
                    print(f"  New best: depth={max_depth}, lr={lr}, alpha={reg_alpha} -> RMSE={result['cv_rmse']:.4f}")
    
    print(f"\nBest XGBoost params: {best_xgb_params}")
    print(f"Best XGBoost RMSE: {best_xgb_rmse:.4f}")

---
## 10. Final Model Selection

In [None]:
# Collect all tuned results
final_results = []

# Add baseline ElasticNet
final_results.append({
    'Model': 'ElasticNet (Approach 3 v2)',
    'cv_rmse': 10.42,
    'Status': 'Previous Best'
})

# Add best from each new model type
for _, row in results_df.iterrows():
    if row['Model'] != 'ElasticNet (baseline)':
        final_results.append({
            'Model': row['Model'],
            'cv_rmse': row['cv_rmse'],
            'Status': 'New' if row['cv_rmse'] < 10.42 else 'Tested'
        })

final_df = pd.DataFrame(final_results).sort_values('cv_rmse').head(10)

print("="*70)
print("FINAL LEADERBOARD")
print("="*70)
print(final_df.to_string(index=False))

---
## 11. Generate Submissions for Top Models

In [None]:
def generate_submission(model, model_name, scale=True):
    """
    Train on full training data and generate test predictions.
    """
    # Learn top genres from full training data
    final_top_genres = train_df['top genre'].value_counts().head(TOP_N_GENRES).index.tolist()
    
    # Encode
    full_train_encoded = encode_genres(train_df, final_top_genres)
    final_test_encoded = encode_genres(test_df, final_top_genres)
    
    # Align columns
    final_genre_cols = [c for c in full_train_encoded.columns if c.startswith('genre_') and c != 'genre_simplified']
    for col in final_genre_cols:
        if col not in final_test_encoded.columns:
            final_test_encoded[col] = 0
    
    # Feature engineering
    full_train_fe = engineer_features(full_train_encoded)
    final_test_fe = engineer_features(final_test_encoded)
    
    # Final feature set
    final_features = numerical_features + final_genre_cols + engineered_features
    
    X_full = full_train_fe[final_features].copy()
    X_test_final = final_test_fe[final_features].copy()
    y_full = full_train_fe['pop']
    
    # Ensure columns match
    for col in final_features:
        if col not in X_test_final.columns:
            X_test_final[col] = 0
    X_test_final = X_test_final[final_features]
    
    # Scale if needed
    if scale:
        scaler = StandardScaler()
        X_full = pd.DataFrame(scaler.fit_transform(X_full), columns=X_full.columns, index=X_full.index)
        X_test_final = pd.DataFrame(scaler.transform(X_test_final), columns=X_test_final.columns, index=X_test_final.index)
    
    # Train final model
    model.fit(X_full.values, y_full.values)
    
    # Generate predictions
    test_predictions = model.predict(X_test_final.values)
    
    # Create submission
    submission = pd.DataFrame({
        'Id': final_test_fe['Id'],
        'pop': test_predictions
    })
    
    # Clean filename
    clean_name = model_name.lower().replace(' ', '_').replace('(', '').replace(')', '')
    filename = f'./submission_approach5_{clean_name}.csv'
    submission.to_csv(filename, index=False)
    
    return filename, test_predictions

print("Submission function defined!")

In [None]:
# Generate submissions for top models
print("="*70)
print("GENERATING SUBMISSIONS FOR TOP MODELS")
print("="*70)

submissions = []

# Get top 5 models
top_5 = results_df.head(5)

for _, row in top_5.iterrows():
    model_name = row['Model']
    model_config = models[model_name]
    
    print(f"\nGenerating: {model_name}...")
    
    # Clone model
    from sklearn.base import clone
    model = clone(model_config['model'])
    
    filename, predictions = generate_submission(model, model_name, scale=model_config['scale'])
    
    submissions.append({
        'Model': model_name,
        'CV RMSE': row['cv_rmse'],
        'File': filename,
        'Pred Mean': predictions.mean(),
        'Pred Std': predictions.std()
    })
    
    print(f"  Saved: {filename}")
    print(f"  Predictions: mean={predictions.mean():.2f}, std={predictions.std():.2f}")

print("\n" + "="*70)
print("SUBMISSION FILES CREATED")
print("="*70)
submissions_df = pd.DataFrame(submissions)
print(submissions_df.to_string(index=False))

---
## 12. Ensemble: Combine Best Models

In [None]:
# Create an ensemble of top performers
print("="*70)
print("ENSEMBLE: COMBINING TOP MODELS")
print("="*70)

# Get predictions from top 3 models for blending
top_3_names = results_df.head(3)['Model'].values
print(f"\nBlending predictions from: {list(top_3_names)}")

# Generate predictions for each
ensemble_preds = {}

for model_name in top_3_names:
    model_config = models[model_name]
    from sklearn.base import clone
    model = clone(model_config['model'])
    _, preds = generate_submission(model, model_name, scale=model_config['scale'])
    ensemble_preds[model_name] = preds
    print(f"  {model_name}: mean={preds.mean():.2f}")

# Simple average blend
blend_avg = np.mean([ensemble_preds[m] for m in top_3_names], axis=0)

# Weighted blend (inversely proportional to CV RMSE)
rmses = results_df.head(3)['cv_rmse'].values
weights = 1 / rmses
weights = weights / weights.sum()  # Normalize

blend_weighted = np.average([ensemble_preds[m] for m in top_3_names], axis=0, weights=weights)

print(f"\nBlend weights (inverse RMSE): {dict(zip(top_3_names, weights.round(3)))}")
print(f"\nSimple Average: mean={blend_avg.mean():.2f}, std={blend_avg.std():.2f}")
print(f"Weighted Blend: mean={blend_weighted.mean():.2f}, std={blend_weighted.std():.2f}")

In [None]:
# Save ensemble submissions
# Load test IDs
test_ids = test_df['Id']

# Simple average blend
submission_avg = pd.DataFrame({'Id': test_ids, 'pop': blend_avg})
submission_avg.to_csv('./submission_approach5_ensemble_avg.csv', index=False)
print("Saved: submission_approach5_ensemble_avg.csv")

# Weighted blend
submission_weighted = pd.DataFrame({'Id': test_ids, 'pop': blend_weighted})
submission_weighted.to_csv('./submission_approach5_ensemble_weighted.csv', index=False)
print("Saved: submission_approach5_ensemble_weighted.csv")

---
## 13. Summary and Conclusions

In [None]:
# Final visualization: All approaches comparison
fig, ax = plt.subplots(figsize=(12, 8))

all_approaches = [
    ('Baseline (no genre)', 11.27, 'Previous'),
    ('Approach 1 (Top 15 genres)', 10.93, 'Previous'),
    ('Approach 2 (Hybrid genre)', 11.05, 'Previous'),
    ('Approach 3 v1 (Feature eng)', 10.80, 'Previous'),
    ('Approach 3 v2 (Leakage-safe)', 10.42, 'Previous Best'),
    ('Approach 4 (Ensemble)', 10.43, 'Previous'),
]

# Add top 3 from this approach
for _, row in results_df.head(3).iterrows():
    all_approaches.append((f"Approach 5: {row['Model']}", row['cv_rmse'], 'New'))

# Sort by RMSE
all_approaches.sort(key=lambda x: x[1])

names = [a[0] for a in all_approaches]
rmses = [a[1] for a in all_approaches]
categories = [a[2] for a in all_approaches]

colors = []
for cat in categories:
    if cat == 'New':
        colors.append('#27ae60')  # Green for new
    elif cat == 'Previous Best':
        colors.append('#f39c12')  # Orange for previous best
    else:
        colors.append('#3498db')  # Blue for previous

bars = ax.barh(names, rmses, color=colors, edgecolor='black', linewidth=1)

# Add value labels
for bar, val in zip(bars, rmses):
    ax.text(val + 0.02, bar.get_y() + bar.get_height()/2, f'{val:.2f}',
            va='center', fontsize=10, fontweight='bold')

ax.set_xlabel('CV RMSE (Lower is Better)', fontsize=12, fontweight='bold')
ax.set_title('All Approaches Comparison\nSpotify Popularity Prediction', fontsize=14, fontweight='bold')
ax.set_xlim(10, 11.5)

# Legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='#27ae60', edgecolor='black', label='New (Approach 5)'),
    Patch(facecolor='#f39c12', edgecolor='black', label='Previous Best'),
    Patch(facecolor='#3498db', edgecolor='black', label='Previous Approaches')
]
ax.legend(handles=legend_elements, loc='lower right')

plt.tight_layout()
plt.savefig('figures/all_approaches_comparison.png', dpi=150, bbox_inches='tight', facecolor='white')
plt.show()

print("\nFigure saved: figures/all_approaches_comparison.png")

In [None]:
print("="*70)
print("APPROACH 5 SUMMARY")
print("="*70)

print("\n MODELS TESTED:")
print("  - SVR (RBF, Linear, Polynomial kernels)")
print("  - KNN Regression (k=5, 10, 15)")
print("  - Bayesian Ridge")
if XGB_AVAILABLE:
    print("  - XGBoost (default and tuned)")
if LGB_AVAILABLE:
    print("  - LightGBM (default and tuned)")

print("\n TOP PERFORMERS:")
for i, row in results_df.head(5).iterrows():
    status = " (BEATS BASELINE!)" if row['cv_rmse'] < 10.42 else ""
    print(f"  {i+1}. {row['Model']}: {row['cv_rmse']:.4f}{status}")

best = results_df.iloc[0]
print(f"\n BEST MODEL: {best['Model']}")
print(f"   CV RMSE: {best['cv_rmse']:.4f}")
print(f"   CV R²: {best['cv_r2']:.4f}")

print("\n SUBMISSIONS GENERATED:")
print("  - Individual model submissions (top 5)")
print("  - Ensemble average blend")
print("  - Ensemble weighted blend")

print("\n" + "="*70)

---
## Key Takeaways

### What We Learned:

1. **SVR with RBF kernel** can be competitive on small datasets
   - Captures non-linear relationships
   - Requires careful tuning of C, gamma, and epsilon

2. **KNN Regression** provides a simple but effective baseline
   - Works well when similar songs have similar popularity
   - Sensitive to the choice of k and distance metric

3. **XGBoost/LightGBM** need careful regularization on small datasets
   - Prone to overfitting with default parameters
   - Low max_depth and high regularization help

4. **Bayesian Ridge** automatically tunes regularization
   - Good for small datasets with uncertainty

### Recommendations for Kaggle:

Try submitting in this order:
1. Best individual model from this notebook
2. Weighted ensemble blend
3. ElasticNet (previous best) if new models don't improve

Remember: CV score doesn't guarantee Kaggle score!