# 03 - Model Training: Computer Price Prediction

This notebook trains and evaluates models to predict computer prices.

**Data is pre-processed in notebook 02 with:**
- Features with >60% missing values removed
- CPU/GPU matching with close neighbor and family mean imputation
- Complete benchmark features: mark, rank, value, price for both CPU and GPU
- Target leakage columns excluded (cpu_match_score, gpu_match_score, Ofertas)

**Models compared:**
1. Baseline (DummyRegressor - predicts mean)
2. Ridge Regression (L2 regularization)
3. RandomForestRegressor
4. HistGradientBoostingRegressor
5. CatBoostRegressor (with native categorical handling)
6. CatBoost Quantile models (for price range prediction)

**Metrics:**
- RMSE (primary)
- MAE
- R²
- MAPE

---

## 1. Imports and Setup

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import warnings

# Add src to path
sys.path.append('..')

# Reload modules
for mod in ['src.modeling', 'modeling', 'src.features', 'features']:
    if mod in sys.modules:
        del sys.modules[mod]

from src.modeling import (
    infer_feature_types,
    get_feature_summary,
    build_sklearn_pipeline,
    build_catboost_model,
    prepare_catboost_data,
    evaluate_sklearn_cv,
    evaluate_predictions,
    compare_models,
    save_model,
    load_features_data,
    CATBOOST_AVAILABLE
)

from src.features import get_feature_columns

from sklearn.model_selection import train_test_split

# Display settings
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.2f}'.format)
sns.set_theme(style='whitegrid')
warnings.filterwarnings('ignore')

print("Libraries loaded!")
print(f"CatBoost available: {CATBOOST_AVAILABLE}")

<cell_type>markdown</cell_type>## 2. Load Pre-Processed Data

Data has been prepared in notebook 02:
- Rows without target removed
- Features with >60% missing values removed
- Only engineered features (starting with _) included

In [None]:
DATA_DIR = Path('../data')

# Try to load parquet, fall back to CSV
try:
    df = pd.read_parquet(DATA_DIR / 'db_features.parquet')
    print("Loaded from parquet")
except (ImportError, FileNotFoundError):
    df = pd.read_csv(DATA_DIR / 'db_features.csv')
    print("Loaded from CSV")

print(f"\nDataset shape: {df.shape}")
print(f"Rows: {len(df):,}")
print(f"Columns: {len(df.columns)}")

In [None]:
# Check target variable
TARGET_COL = '_precio_num'

print(f"Target: {TARGET_COL}")
print(f"  Non-null: {df[TARGET_COL].notna().sum():,}")
print(f"  Min: {df[TARGET_COL].min():,.2f}")
print(f"  Max: {df[TARGET_COL].max():,.2f}")
print(f"  Mean: {df[TARGET_COL].mean():,.2f}")
print(f"  Median: {df[TARGET_COL].median():,.2f}")

<cell_type>markdown</cell_type>## 3. Prepare Features

Use pre-computed feature types from the features module.

In [None]:
# Get feature types (data is already prepared in notebook 02)
feature_cols, numeric_cols, categorical_cols = get_feature_columns(df, TARGET_COL)

print(f"Total features: {len(feature_cols)}")
print(f"  Numeric: {len(numeric_cols)}")
print(f"  Categorical: {len(categorical_cols)}")

print(f"\nNumeric features: {numeric_cols}")
print(f"Categorical features: {categorical_cols}")

In [None]:
# Show feature summary
summary = get_feature_summary(df, numeric_cols, categorical_cols)

print("=== Numeric Features ===")
display(summary[summary['type'] == 'numeric'].head(20))

print("\n=== Categorical Features (sample) ===")
display(summary[summary['type'] == 'categorical'].head(20))

In [None]:
# Prepare X and y (data is already cleaned, no need to filter)
X = df[feature_cols].copy()
y = df[TARGET_COL].copy()

print(f"Training data shape: X={X.shape}, y={y.shape}")

In [None]:
# Train/test split for final evaluation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Train: {len(X_train):,} samples")
print(f"Test: {len(X_test):,} samples")

## 4. Train and Evaluate Models

We'll train multiple models and compare their performance using cross-validation.

In [None]:
# Dictionary to store results
all_results = {}

### 4.1 Baseline Model (DummyRegressor)

In [None]:
print("=" * 60)
print("BASELINE MODEL (Predict Mean)")
print("=" * 60)

baseline_pipeline = build_sklearn_pipeline('dummy', numeric_cols, categorical_cols)
baseline_results = evaluate_sklearn_cv(baseline_pipeline, X_train, y_train, cv=5)

all_results['Baseline (Mean)'] = baseline_results

print(f"\nCross-Validation Results:")
print(f"  RMSE: {baseline_results['rmse_mean']:,.2f} (+/- {baseline_results['rmse_std']:,.2f})")
print(f"  MAE:  {baseline_results['mae_mean']:,.2f} (+/- {baseline_results['mae_std']:,.2f})")
print(f"  R²:   {baseline_results['r2_mean']:.4f} (+/- {baseline_results['r2_std']:.4f})")

### 4.2 Ridge Regression

In [None]:
print("=" * 60)
print("RIDGE REGRESSION")
print("=" * 60)

ridge_pipeline = build_sklearn_pipeline('ridge', numeric_cols, categorical_cols)
ridge_results = evaluate_sklearn_cv(ridge_pipeline, X_train, y_train, cv=5)

all_results['Ridge'] = ridge_results

print(f"\nCross-Validation Results:")
print(f"  RMSE: {ridge_results['rmse_mean']:,.2f} (+/- {ridge_results['rmse_std']:,.2f})")
print(f"  MAE:  {ridge_results['mae_mean']:,.2f} (+/- {ridge_results['mae_std']:,.2f})")
print(f"  R²:   {ridge_results['r2_mean']:.4f} (+/- {ridge_results['r2_std']:.4f})")

### 4.3 Random Forest

In [None]:
print("=" * 60)
print("RANDOM FOREST")
print("=" * 60)

rf_pipeline = build_sklearn_pipeline('random_forest', numeric_cols, categorical_cols)
rf_results = evaluate_sklearn_cv(rf_pipeline, X_train, y_train, cv=5)

all_results['Random Forest'] = rf_results

print(f"\nCross-Validation Results:")
print(f"  RMSE: {rf_results['rmse_mean']:,.2f} (+/- {rf_results['rmse_std']:,.2f})")
print(f"  MAE:  {rf_results['mae_mean']:,.2f} (+/- {rf_results['mae_std']:,.2f})")
print(f"  R²:   {rf_results['r2_mean']:.4f} (+/- {rf_results['r2_std']:.4f})")

### 4.4 HistGradientBoosting

In [None]:
print("=" * 60)
print("HIST GRADIENT BOOSTING")
print("=" * 60)

hgb_pipeline = build_sklearn_pipeline('hist_gradient_boosting', numeric_cols, categorical_cols)
hgb_results = evaluate_sklearn_cv(hgb_pipeline, X_train, y_train, cv=5)

all_results['HistGradientBoosting'] = hgb_results

print(f"\nCross-Validation Results:")
print(f"  RMSE: {hgb_results['rmse_mean']:,.2f} (+/- {hgb_results['rmse_std']:,.2f})")
print(f"  MAE:  {hgb_results['mae_mean']:,.2f} (+/- {hgb_results['mae_std']:,.2f})")
print(f"  R²:   {hgb_results['r2_mean']:.4f} (+/- {hgb_results['r2_std']:.4f})")

### 4.5 CatBoost (if available)

In [None]:
if CATBOOST_AVAILABLE:
    from sklearn.model_selection import cross_val_score
    from catboost import CatBoostRegressor
    
    print("=" * 60)
    print("CATBOOST")
    print("=" * 60)
    
    # Prepare data for CatBoost (data is already cleaned from notebook 02)
    X_cb, y_cb = prepare_catboost_data(
        df.copy(),
        numeric_cols,
        categorical_cols,
        TARGET_COL
    )
    
    X_cb_train, X_cb_test, y_cb_train, y_cb_test = train_test_split(
        X_cb, y_cb, test_size=0.2, random_state=42
    )
    
    # Build CatBoost model
    cat_model = build_catboost_model(categorical_cols, loss_function='RMSE')
    
    # Cross-validation using sklearn
    neg_rmse_scores = cross_val_score(
        cat_model, X_cb_train, y_cb_train, 
        cv=5, scoring='neg_root_mean_squared_error'
    )
    neg_mae_scores = cross_val_score(
        cat_model, X_cb_train, y_cb_train,
        cv=5, scoring='neg_mean_absolute_error'
    )
    r2_scores = cross_val_score(
        cat_model, X_cb_train, y_cb_train,
        cv=5, scoring='r2'
    )
    
    cat_results = {
        'rmse_mean': -neg_rmse_scores.mean(),
        'rmse_std': neg_rmse_scores.std(),
        'mae_mean': -neg_mae_scores.mean(),
        'mae_std': neg_mae_scores.std(),
        'r2_mean': r2_scores.mean(),
        'r2_std': r2_scores.std(),
    }
    
    all_results['CatBoost'] = cat_results
    
    print(f"\nCross-Validation Results:")
    print(f"  RMSE: {cat_results['rmse_mean']:,.2f} (+/- {cat_results['rmse_std']:,.2f})")
    print(f"  MAE:  {cat_results['mae_mean']:,.2f} (+/- {cat_results['mae_std']:,.2f})")
    print(f"  R²:   {cat_results['r2_mean']:.4f} (+/- {cat_results['r2_std']:.4f})")
else:
    print("CatBoost not available. Install with: pip install catboost")

## 5. Model Comparison

In [None]:
# Compare all models
comparison = compare_models(all_results)

print("=" * 80)
print("MODEL COMPARISON (sorted by RMSE)")
print("=" * 80)
display(comparison)

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# RMSE
ax = axes[0]
ax.barh(comparison['Model'], comparison['rmse_mean'], xerr=comparison['rmse_std'], capsize=5)
ax.set_xlabel('RMSE (€)')
ax.set_title('RMSE by Model')
ax.invert_yaxis()

# MAE
ax = axes[1]
ax.barh(comparison['Model'], comparison['mae_mean'], xerr=comparison['mae_std'], capsize=5, color='orange')
ax.set_xlabel('MAE (€)')
ax.set_title('MAE by Model')
ax.invert_yaxis()

# R²
ax = axes[2]
ax.barh(comparison['Model'], comparison['r2_mean'], xerr=comparison['r2_std'], capsize=5, color='green')
ax.set_xlabel('R²')
ax.set_title('R² by Model')
ax.invert_yaxis()
ax.set_xlim(0, 1)

plt.tight_layout()
plt.show()

## 6. Train Best Model on Full Training Data

In [None]:
# Select best model based on RMSE
best_model_name = comparison.iloc[0]['Model']
print(f"Best model: {best_model_name}")

# Train on full training data
if best_model_name == 'CatBoost' and CATBOOST_AVAILABLE:
    # Train CatBoost
    final_model = build_catboost_model(categorical_cols, loss_function='RMSE')
    final_model.fit(X_cb_train, y_cb_train)
    
    # Evaluate on test set
    y_pred = final_model.predict(X_cb_test)
    final_metrics = evaluate_predictions(y_cb_test, y_pred)
    
    # Feature columns for metadata
    model_feature_cols = list(X_cb.columns)
    
else:
    # Train sklearn model
    model_type_map = {
        'Random Forest': 'random_forest',
        'HistGradientBoosting': 'hist_gradient_boosting',
        'Ridge': 'ridge',
        'ElasticNet': 'elasticnet',
        'Baseline (Mean)': 'dummy'
    }
    model_type = model_type_map.get(best_model_name, 'random_forest')
    
    final_model = build_sklearn_pipeline(model_type, numeric_cols, categorical_cols)
    final_model.fit(X_train, y_train)
    
    # Evaluate on test set
    y_pred = final_model.predict(X_test)
    final_metrics = evaluate_predictions(y_test, y_pred)
    
    model_feature_cols = feature_cols

print(f"\nTest Set Performance:")
print(f"  RMSE: {final_metrics['rmse']:,.2f}")
print(f"  MAE:  {final_metrics['mae']:,.2f}")
print(f"  R²:   {final_metrics['r2']:.4f}")
print(f"  MAPE: {final_metrics['mape']:.2f}%")

## 7. Prediction Analysis

In [None]:
# Actual vs Predicted plot
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Get the right y values
if best_model_name == 'CatBoost' and CATBOOST_AVAILABLE:
    y_actual = y_cb_test
else:
    y_actual = y_test

# Scatter plot
ax = axes[0]
ax.scatter(y_actual, y_pred, alpha=0.3, s=10)
ax.plot([0, y_actual.max()], [0, y_actual.max()], 'r--', lw=2, label='Perfect prediction')
ax.set_xlabel('Actual Price (€)')
ax.set_ylabel('Predicted Price (€)')
ax.set_title('Actual vs Predicted Prices')
ax.legend()

# Residuals
ax = axes[1]
residuals = y_actual - y_pred
ax.hist(residuals, bins=50, edgecolor='black', alpha=0.7)
ax.axvline(0, color='red', linestyle='--', lw=2)
ax.set_xlabel('Residual (€)')
ax.set_ylabel('Frequency')
ax.set_title(f'Residual Distribution (Mean: {residuals.mean():,.0f}€, Std: {residuals.std():,.0f}€)')

plt.tight_layout()
plt.show()

In [None]:
# Error by price range
results_df = pd.DataFrame({
    'actual': y_actual,
    'predicted': y_pred,
    'error': np.abs(y_actual - y_pred),
    'pct_error': np.abs(y_actual - y_pred) / y_actual * 100
})

# Price bins
results_df['price_bin'] = pd.cut(results_df['actual'], 
                                  bins=[0, 500, 1000, 1500, 2000, 3000, 10000],
                                  labels=['0-500', '500-1000', '1000-1500', '1500-2000', '2000-3000', '3000+'])

# Stats by price range
print("=== Error by Price Range ===")
error_by_range = results_df.groupby('price_bin', observed=True).agg({
    'actual': 'count',
    'error': ['mean', 'std'],
    'pct_error': 'mean'
}).round(2)
error_by_range.columns = ['Count', 'MAE (€)', 'Std (€)', 'MAPE (%)']
display(error_by_range)

## 8. CatBoost Quantile Regression (Price Range Prediction)

In [None]:
if CATBOOST_AVAILABLE:
    print("=" * 60)
    print("QUANTILE REGRESSION (Price Range Prediction)")
    print("=" * 60)
    
    # Train models for different quantiles
    quantiles = [0.1, 0.5, 0.9]
    quantile_models = {}
    
    for q in quantiles:
        print(f"\nTraining quantile={q} model...")
        q_model = build_catboost_model(categorical_cols, loss_function='Quantile', quantile=q)
        q_model.fit(X_cb_train, y_cb_train)
        quantile_models[q] = q_model
    
    # Predict on test set
    pred_low = quantile_models[0.1].predict(X_cb_test)
    pred_mid = quantile_models[0.5].predict(X_cb_test)
    pred_high = quantile_models[0.9].predict(X_cb_test)
    
    # Check coverage
    in_range = (y_cb_test >= pred_low) & (y_cb_test <= pred_high)
    coverage = in_range.mean() * 100
    
    print(f"\n=== Quantile Prediction Results ===")
    print(f"Expected coverage (10%-90% interval): 80%")
    print(f"Actual coverage: {coverage:.1f}%")
    print(f"Average interval width: {(pred_high - pred_low).mean():,.0f}€")
else:
    print("CatBoost not available - skipping quantile regression")

In [None]:
if CATBOOST_AVAILABLE:
    # Visualize quantile predictions
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Sample for visualization
    sample_idx = np.random.choice(len(y_cb_test), min(100, len(y_cb_test)), replace=False)
    sample_idx = np.sort(sample_idx)
    
    x_pos = np.arange(len(sample_idx))
    
    # Plot intervals (pred_low, pred_mid, pred_high are numpy arrays)
    ax.fill_between(x_pos, pred_low[sample_idx], pred_high[sample_idx], 
                    alpha=0.3, label='80% Prediction Interval')
    ax.scatter(x_pos, y_cb_test.iloc[sample_idx].values, c='red', s=20, label='Actual', zorder=5)
    ax.plot(x_pos, pred_mid[sample_idx], 'b-', lw=1, label='Median Prediction', alpha=0.7)
    
    ax.set_xlabel('Sample Index')
    ax.set_ylabel('Price (€)')
    ax.set_title('Price Predictions with Uncertainty Intervals')
    ax.legend()
    
    plt.tight_layout()
    plt.show()

## 9. Save Best Model

In [None]:
# Save the best model with metadata
metadata = {
    'model_type': best_model_name,
    'feature_cols': model_feature_cols,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'target_col': TARGET_COL,
    'test_metrics': final_metrics,
    'cv_metrics': all_results.get(best_model_name, {}),
}

save_model(final_model, '../models/price_model.pkl', metadata)

print(f"\nModel saved with metadata:")
for key, value in metadata.items():
    if key in ['feature_cols', 'numeric_cols', 'categorical_cols']:
        print(f"  {key}: {len(value)} columns")
    else:
        print(f"  {key}: {value}")

## Summary

### Results

We trained and compared multiple models for predicting computer prices:

1. **Baseline** - Simple mean prediction
2. **Ridge Regression** - Linear model with L2 regularization
3. **Random Forest** - Ensemble of decision trees
4. **HistGradientBoosting** - sklearn's fast gradient boosting
5. **CatBoost** - Native categorical handling (if available)

### Data Preparation (from notebook 02)

- Features with >60% missing values removed before modeling
- CPU/GPU benchmarks matched using close neighbor + family mean imputation
- Complete benchmark features included:
  - CPU: `_cpu_mark`, `_cpu_rank`, `_cpu_value`, `_cpu_price_usd`
  - GPU: `_gpu_mark`, `_gpu_rank`, `_gpu_value`, `_gpu_price_usd`
- Excluded columns: `cpu_match_score`, `gpu_match_score`, `Ofertas`

### Key Findings

- Best model saved to `models/price_model.pkl`
- Quantile regression enables price range predictions
- Error varies by price range (larger absolute errors for expensive items)

### Next Steps

1. Hyperparameter tuning (notebook 04)
2. Feature selection to reduce complexity
3. Deploy model via backend API