# Model Evaluation - House Price Dataset
## Introduction
This notebook performs **model evaluation** for the house price prediction project. It assesses the performance of the trained models using various metrics, cross-validation, and residual analysis, and provides visualizations to interpret the results.

**Dataset:** Housing Price Prediction Data (Kaggle)

**Objective:** Evaluate the predictive performance of the final models and interpret their results.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Evaluation steps:**
1. Import libraries and load processed data and models
2. Prepare data for evaluation
3. Evaluate model performance using regression metrics
4. Perform cross-validation analysis
5. Analyze residuals
6. Visualize evaluation results
7. Summarize findings and next steps

## 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split, cross_val_score
import joblib
import os
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

df = pd.read_csv('../data/processed/engineered_data.csv')
print('Columns in engineered_data.csv:', list(df.columns))
# Try to infer the target column
possible_targets = [col for col in df.columns if 'price' in col.lower()]
if possible_targets:
    target_col = possible_targets[0]
    print(f"Using '{target_col}' as the target column.")
else:
    raise KeyError('No column containing "price" found in engineered_data.csv. Please check your data.')
features_list = [col for col in df.columns if col != target_col]

## 2. Load the Best Model

In [None]:
models_dir = '../models/'
best_model_file = [f for f in os.listdir(models_dir) if 'best_model' in f][0]
best_model = joblib.load(os.path.join(models_dir, best_model_file))
print(f"Loaded best model: {best_model_file}")

# Get model's feature names if available
model_features = None
if hasattr(best_model, 'feature_names_in_'):
    model_features = list(best_model.feature_names_in_)
    print('Model was trained with features:', model_features)
else:
    print('Model does not have feature_names_in_ attribute. Using all features from data.')

# Use only the intersection of features for prediction
data_features = features_list
if model_features is not None:
    used_features = [f for f in model_features if f in data_features]
    missing_in_data = [f for f in model_features if f not in data_features]
    extra_in_data = [f for f in data_features if f not in model_features]
    print('Features used for prediction:', used_features)
    if missing_in_data:
        print('Warning: These features were in the model but not in the data:', missing_in_data)
    if extra_in_data:
        print('Note: These features are in the data but not in the model:', extra_in_data)
else:
    used_features = data_features

X = df[used_features]
y = df[target_col]
print(f"Final features used: {used_features}")

## 3. Evaluate Model Performance

In [None]:
# 4. Split data (train/val/test, same as model development)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
print(f"Train set: {X_train.shape}, Validation set: {X_val.shape}, Test set: {X_test.shape}")

# 5. Evaluate model performance with metrics
def regression_metrics(y_true, y_pred):
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    return {'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R2': r2}

results = {}
for split, X_, y_ in zip(['Train', 'Validation', 'Test'], [X_train, X_val, X_test], [y_train, y_val, y_test]):
    y_pred = best_model.predict(X_)
    results[split] = regression_metrics(y_, y_pred)
    print(f"{split} set:")
    for k, v in results[split].items():
        print(f"  {k}: {v:.4f}")

## 4. Cross-validation

In [None]:
cv_scores = cross_val_score(best_model, X_train, y_train, cv=5, scoring='neg_root_mean_squared_error')
print(f"\n5-Fold CV RMSE: {-cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

## 5. Residual Analysis

In [None]:
y_test_pred = best_model.predict(X_test)
residuals = y_test - y_test_pred
plt.figure()
sns.histplot(residuals, kde=True)
plt.title('Residuals Distribution (Test Set)')
plt.xlabel('Residual')
plt.show()

plt.figure()
plt.scatter(y_test_pred, residuals, alpha=0.6)
plt.axhline(0, color='red', linestyle='--')
plt.title('Residuals vs. Predicted (Test Set)')
plt.xlabel('Predicted SalePrice')
plt.ylabel('Residual')
plt.show()

## 6. Summarize Findings

In [None]:
print("\nSummary:")
print(f"Best model: {best_model_file}")
print(f"Test RMSE: {results['Test']['RMSE']:.4f}")
print(f"Test R2: {results['Test']['R2']:.4f}")

## Model Evaluation Summary & Next Steps
The evaluation process provided a thorough assessment of the final model's predictive performance using multiple metrics, cross-validation, and residual analysis. The results confirm the model's strengths and highlight areas for further improvement. With these insights, the project is ready for deployment and real-world application, or for further refinement if desired.