# Team Leader's Contribution: Advanced Analysis & Ensemble Learning

## Objective
This notebook integrates the work from the team (SVM and Random Forest models) and performs advanced analysis. 
**CRITICAL UPDATE**: Due to missing contributions, the Team Leader has implemented the **Linear Regression (Baseline)** and **Deep Learning (MLP)** models to ensure project completeness.

### Scope:
1.  **Baseline Model**: Linear Regression (Implemented by Team Lead).
2.  **Teammate Models**: SVM (Optimized) and Random Forest (Tuned).
3.  **Advanced Model**: Deep Learning / MLPRegressor (Implemented by Team Lead).
4.  **Ensemble Learning**: Combining ALL 4 models using VotingRegressor.
5.  **Model Comparison**: Visualizing performance across all approaches.

## 1. Imports & Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, VotingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Set plot style
sns.set(style="whitegrid")

## 2. Data Loading & Preprocessing
We use the consistent preprocessing pipeline defined by the team.

In [None]:
# Load dataset
df = pd.read_csv('../data/audi.csv')

# Define features
numeric_features = ['year', 'mileage', 'tax', 'mpg']
categorical_features = ['model', 'transmission', 'fuelType']

X = df[numeric_features + categorical_features]
y = df['price']

# Numeric transformer
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical transformer
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training samples: {X_train.shape[0]}")
print(f"Testing samples: {X_test.shape[0]}")

### Baseline Model: Linear Regression (Implemented by Team Lead due to missing module)
Since the Linear Regression module was missing from the repository, I am implementing it here as a baseline for comparison.

In [None]:
lr_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

print("Training Linear Regression (Baseline)...")
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
r2_lr = r2_score(y_test, y_pred_lr)
print(f"Linear Regression R2 Score: {r2_lr:.4f}")

## 3. Model 1: Optimized SVM
Using the best parameters found in the SVM notebook: `C=500`, `epsilon=0.2`, `gamma='scale'`.

In [None]:
svm_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', SVR(kernel='rbf', C=500, epsilon=0.2, gamma='scale'))
])

print("Training Optimized SVM...")
svm_pipeline.fit(X_train, y_train)
y_pred_svm = svm_pipeline.predict(X_test)

r2_svm = r2_score(y_test, y_pred_svm)
print(f"Optimized SVM R2 Score: {r2_svm:.4f}")

## 4. Model 2: Random Forest (Tuning)
We start with the base Random Forest model and then perform Hyperparameter Tuning.

In [None]:
# Base Random Forest
rf_base = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(random_state=42, n_jobs=-1))
])

print("Training Base Random Forest...")
rf_base.fit(X_train, y_train)
y_pred_rf_base = rf_base.predict(X_test)
r2_rf_base = r2_score(y_test, y_pred_rf_base)
print(f"Base Random Forest R2 Score: {r2_rf_base:.4f}")

# Hyperparameter Tuning for Random Forest
param_grid_rf = {
    'regressor__n_estimators': [100, 200],
    'regressor__max_depth': [None, 10, 20],
    'regressor__min_samples_split': [2, 5]
}

print("\nTuning Random Forest (this may take a while)...")
grid_rf = GridSearchCV(rf_base, param_grid_rf, cv=3, scoring='r2', n_jobs=-1, verbose=1)
grid_rf.fit(X_train, y_train)

best_rf = grid_rf.best_estimator_
print("Best RF Params:", grid_rf.best_params_)

y_pred_rf_opt = best_rf.predict(X_test)
r2_rf_opt = r2_score(y_test, y_pred_rf_opt)
print(f"Optimized Random Forest R2 Score: {r2_rf_opt:.4f}")

### Advanced Model: Deep Learning / MLP (Implemented by ntthang-dev)
Implementing a Multi-Layer Perceptron (MLP) Regressor to capture complex non-linear relationships.

In [None]:
mlp_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', MLPRegressor(hidden_layer_sizes=(128, 64, 32),
                               activation='relu',
                               solver='adam',
                               max_iter=500,
                               random_state=42))
])

print("Training MLP Regressor (Deep Learning)...")
mlp_model.fit(X_train, y_train)
y_pred_mlp = mlp_model.predict(X_test)
r2_mlp = r2_score(y_test, y_pred_mlp)
print(f"MLP Regressor R2 Score: {r2_mlp:.4f}")

## 5. Model 3: Ensemble Learning (Voting Regressor)
Combining ALL 4 models: Linear Regression, Optimized SVM, Optimized Random Forest, and MLP.

In [None]:
voting_regressor = VotingRegressor(
    estimators=[
        ('lr', lr_model),
        ('svm', svm_pipeline),
        ('rf', best_rf),
        ('mlp', mlp_model)
    ]
)

print("Training Ensemble Model (LR + SVM + RF + MLP)...")
voting_regressor.fit(X_train, y_train)
y_pred_ensemble = voting_regressor.predict(X_test)

r2_ensemble = r2_score(y_test, y_pred_ensemble)
print(f"Ensemble Model R2 Score: {r2_ensemble:.4f}")

## 6. Model Comparison & Visualization

In [None]:
# Collect results
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Optimized SVM', 'Optimized RF', 'MLP Regressor', 'Ensemble'],
    'R2 Score': [r2_lr, r2_svm, r2_rf_opt, r2_mlp, r2_ensemble]
})

print(results)

# Bar Chart Comparison
plt.figure(figsize=(12, 6))
sns.barplot(x='R2 Score', y='Model', data=results, palette='viridis')
plt.title('Final Model Comparison: R2 Score')
plt.xlim(0.7, 1.0)  # Zoom in on high scores
plt.show()

# Scatter Plot: True vs Predicted (Ensemble)
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_ensemble, alpha=0.5, color='purple')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--', lw=2)
plt.xlabel('True Price')
plt.ylabel('Predicted Price')
plt.title('Ensemble Model: True vs Predicted Price')
plt.show()

## Conclusion
The Team Leader has successfully integrated all models, filling the gaps left by missing contributions. The Ensemble model, combining Linear Regression, SVM, Random Forest, and MLP, provides a robust and high-performing solution for car price prediction.