# Model Development - House Price Dataset
## Introduction
This notebook develops and compares multiple machine learning models for house price prediction. Using the engineered features from `03_feature_engineering.ipynb`, we will train, evaluate, and select the best performing model.

**Dataset:** Housing Price Prediction Data (Kaggle)

**Objective:** Train, tune, and compare regression models to predict house prices as accurately as possible.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Model development steps:**

### 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

df = pd.read_csv("../data/processed/engineered_features.csv")

print(f"Dataset loaded: {df.shape[0]} rows, {df.shape[1]} columns")
df.head()

### 2. Train-Test Split

In [None]:
X = df.drop(columns='Price')
y = df['Price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Train: {X_train.shape}")
print(f"Test: {X_test.shape}")

### 3. Model Training

In [None]:
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest": RandomForestRegressor(
        n_estimators=50,
        max_depth=10,
        min_samples_split=10,
        random_state=42
    )
}

results = []

for name, model in models.items():
    model.fit(X_train, y_train)
    
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)
    
    train_rmse = np.sqrt(mean_squared_error(y_train, y_pred_train))
    test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))
    test_r2 = r2_score(y_test, y_pred_test)
    test_mae = mean_absolute_error(y_test, y_pred_test)
    
    results.append({
        'Model': name,
        'Train_RMSE': train_rmse,
        'Test_RMSE': test_rmse,
        'Test_MAE': test_mae,
        'Test_R2': test_r2
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

### 4. Model Evaluation

In [None]:
best_model_name = results_df.loc[results_df['Test_R2'].idxmax(), 'Model']
best_model = models[best_model_name]

y_pred = best_model.predict(X_test)

print(f"Best model: {best_model_name}")
print(f"R2 Score: {r2_score(y_test, y_pred):.4f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.2f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.2f}")

plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred, alpha=0.5, color='blue')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title(f'Actual vs Predicted - {best_model_name}')
plt.tight_layout()
plt.show()

### 5. Save Best Model

In [None]:
model_path = '../models/best_model.pkl'
joblib.dump(best_model, model_path)

print(f"Model saved: {model_path}")
print(f"Model type: {type(best_model).__name__}")

### Conclusion

Two regression models were trained and evaluated on the engineered features dataset. Linear Regression and Random Forest were compared using RMSE, MAE, and R² metrics. The best performing model was selected based on R² score and saved to `best_model.pkl`.