# Ford Fiesta Price Prediction: Advanced Models Comparison

**Goal:** Compare Ridge Regression, Lasso Regression, and Random Forest models for predicting Ford Fiesta prices using the full feature set (including one-hot encoded state and trim variables).

**Dataset:** Used Ford Fiestas scraped from cars.com with one-hot encoded categorical variables

**Models:** 
1. Ridge Regression (L2 Regularization)
2. Lasso Regression (L1 Regularization)
3. Random Forest Regressor

**Why These Models?**
- **Ridge & Lasso:** Handle multicollinearity better than linear regression, especially important with many one-hot encoded features
- **Random Forest:** Captures non-linear relationships and feature interactions automatically

## Step 1: Import Libraries

Import all the libraries needed for analysis and modeling.

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# Settings
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
%matplotlib inline

# Random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

## Step 2: Load the Data

Load the one-hot encoded Ford Fiesta dataset.

In [None]:
# Load the one-hot encoded data
df = pd.read_csv('ford_fiestas_extrap_one_hot.csv')

# Display first few rows
print("First 5 rows of the dataset:")
print(df.head())

# Basic info
print("\nDataset shape:", df.shape)
print("\nColumn names and types:")
print(df.info())

## Step 3: Data Preparation

Prepare the feature matrix (X) and target variable (y). We'll use the numeric features plus all one-hot encoded columns.

In [None]:
# Separate features and target
# We want: year, mileage, distance, plus all one-hot encoded state and trim columns
# Exclude: title, price, location, trim, state (original categorical columns)

exclude_columns = ['title', 'price', 'location', 'trim', 'state']
feature_columns = [col for col in df.columns if col not in exclude_columns]

X = df[feature_columns]
y = df['price']

print(f"Feature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")
print(f"\nFeatures being used ({len(feature_columns)}):")
print(feature_columns)

## Step 4: Train-Test Split

Split the data into training and testing sets (80-20 split).

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print(f"\nTraining set price range: ${y_train.min():,.0f} - ${y_train.max():,.0f}")
print(f"Test set price range: ${y_test.min():,.0f} - ${y_test.max():,.0f}")

## Step 5: Feature Scaling

Scale features for Ridge and Lasso regression. This is important because regularization is sensitive to feature scales.

**Note:** Random Forest doesn't require scaling, but we'll create scaled versions for Ridge/Lasso.

In [None]:
# Create a scaler and fit on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Features have been scaled using StandardScaler")
print(f"Scaled training data shape: {X_train_scaled.shape}")
print(f"Scaled test data shape: {X_test_scaled.shape}")

## Step 6: Model 1 - Ridge Regression

**Ridge Regression** adds L2 regularization, which penalizes large coefficients. This helps prevent overfitting when you have many features.

The regularization term is controlled by the `alpha` parameter (higher alpha = more regularization).

In [None]:
# Train Ridge Regression model
ridge_model = Ridge(alpha=1.0, random_state=RANDOM_STATE)
ridge_model.fit(X_train_scaled, y_train)

# Make predictions
ridge_train_pred = ridge_model.predict(X_train_scaled)
ridge_test_pred = ridge_model.predict(X_test_scaled)

# Calculate metrics
ridge_train_mae = mean_absolute_error(y_train, ridge_train_pred)
ridge_test_mae = mean_absolute_error(y_test, ridge_test_pred)
ridge_train_rmse = np.sqrt(mean_squared_error(y_train, ridge_train_pred))
ridge_test_rmse = np.sqrt(mean_squared_error(y_test, ridge_test_pred))
ridge_train_r2 = r2_score(y_train, ridge_train_pred)
ridge_test_r2 = r2_score(y_test, ridge_test_pred)

print("\n" + "="*60)
print("RIDGE REGRESSION RESULTS")
print("="*60)
print(f"\nTraining Set:")
print(f"  MAE:  ${ridge_train_mae:,.2f}")
print(f"  RMSE: ${ridge_train_rmse:,.2f}")
print(f"  R¬≤:   {ridge_train_r2:.4f}")
print(f"\nTest Set:")
print(f"  MAE:  ${ridge_test_mae:,.2f}")
print(f"  RMSE: ${ridge_test_rmse:,.2f}")
print(f"  R¬≤:   {ridge_test_r2:.4f}")
print("="*60)

## Step 7: Model 2 - Lasso Regression

**Lasso Regression** adds L1 regularization, which can shrink some coefficients to exactly zero. This performs automatic feature selection.

This is useful when you suspect many features aren't important.

In [None]:
# Train Lasso Regression model
lasso_model = Lasso(alpha=1.0, random_state=RANDOM_STATE, max_iter=10000)
lasso_model.fit(X_train_scaled, y_train)

# Make predictions
lasso_train_pred = lasso_model.predict(X_train_scaled)
lasso_test_pred = lasso_model.predict(X_test_scaled)

# Calculate metrics
lasso_train_mae = mean_absolute_error(y_train, lasso_train_pred)
lasso_test_mae = mean_absolute_error(y_test, lasso_test_pred)
lasso_train_rmse = np.sqrt(mean_squared_error(y_train, lasso_train_pred))
lasso_test_rmse = np.sqrt(mean_squared_error(y_test, lasso_test_pred))
lasso_train_r2 = r2_score(y_train, lasso_train_pred)
lasso_test_r2 = r2_score(y_test, lasso_test_pred)

# Count non-zero coefficients (features selected)
n_features_selected = np.sum(lasso_model.coef_ != 0)

print("\n" + "="*60)
print("LASSO REGRESSION RESULTS")
print("="*60)
print(f"\nFeatures selected: {n_features_selected} out of {len(feature_columns)}")
print(f"\nTraining Set:")
print(f"  MAE:  ${lasso_train_mae:,.2f}")
print(f"  RMSE: ${lasso_train_rmse:,.2f}")
print(f"  R¬≤:   {lasso_train_r2:.4f}")
print(f"\nTest Set:")
print(f"  MAE:  ${lasso_test_mae:,.2f}")
print(f"  RMSE: ${lasso_test_rmse:,.2f}")
print(f"  R¬≤:   {lasso_test_r2:.4f}")
print("="*60)

## Step 8: Model 3 - Random Forest

**Random Forest** is an ensemble method that builds many decision trees and averages their predictions. It can:
- Capture non-linear relationships
- Automatically detect feature interactions
- Handle features of different scales without standardization

We'll use unscaled data since Random Forest doesn't require it.

In [None]:
# Train Random Forest model
rf_model = RandomForestRegressor(
    n_estimators=100,      # Number of trees
    max_depth=10,          # Maximum depth of each tree
    min_samples_split=5,   # Minimum samples to split a node
    min_samples_leaf=2,    # Minimum samples in a leaf node
    random_state=RANDOM_STATE,
    n_jobs=-1              # Use all CPU cores
)

print("Training Random Forest model (this may take a moment)...")
rf_model.fit(X_train, y_train)
print("Training complete!")

# Make predictions
rf_train_pred = rf_model.predict(X_train)
rf_test_pred = rf_model.predict(X_test)

# Calculate metrics
rf_train_mae = mean_absolute_error(y_train, rf_train_pred)
rf_test_mae = mean_absolute_error(y_test, rf_test_pred)
rf_train_rmse = np.sqrt(mean_squared_error(y_train, rf_train_pred))
rf_test_rmse = np.sqrt(mean_squared_error(y_test, rf_test_pred))
rf_train_r2 = r2_score(y_train, rf_train_pred)
rf_test_r2 = r2_score(y_test, rf_test_pred)

print("\n" + "="*60)
print("RANDOM FOREST RESULTS")
print("="*60)
print(f"\nTraining Set:")
print(f"  MAE:  ${rf_train_mae:,.2f}")
print(f"  RMSE: ${rf_train_rmse:,.2f}")
print(f"  R¬≤:   {rf_train_r2:.4f}")
print(f"\nTest Set:")
print(f"  MAE:  ${rf_test_mae:,.2f}")
print(f"  RMSE: ${rf_test_rmse:,.2f}")
print(f"  R¬≤:   {rf_test_r2:.4f}")
print("="*60)

## Step 9: Model Comparison

Compare all three models side-by-side.

In [None]:
# Create comparison dataframe
comparison_df = pd.DataFrame({
    'Model': ['Ridge Regression', 'Lasso Regression', 'Random Forest'],
    'Train MAE': [ridge_train_mae, lasso_train_mae, rf_train_mae],
    'Test MAE': [ridge_test_mae, lasso_test_mae, rf_test_mae],
    'Train RMSE': [ridge_train_rmse, lasso_train_rmse, rf_train_rmse],
    'Test RMSE': [ridge_test_rmse, lasso_test_rmse, rf_test_rmse],
    'Train R¬≤': [ridge_train_r2, lasso_train_r2, rf_train_r2],
    'Test R¬≤': [ridge_test_r2, lasso_test_r2, rf_test_r2]
})

print("\n" + "="*90)
print("MODEL COMPARISON SUMMARY")
print("="*90)
print(comparison_df.to_string(index=False))
print("="*90)

# Find best model by test MAE (lower is better)
best_model_idx = comparison_df['Test MAE'].idxmin()
best_model_name = comparison_df.loc[best_model_idx, 'Model']
print(f"\nüèÜ Best Model (by Test MAE): {best_model_name}")
print(f"   Test MAE: ${comparison_df.loc[best_model_idx, 'Test MAE']:,.2f}")
print(f"   Test R¬≤:  {comparison_df.loc[best_model_idx, 'Test R¬≤']:.4f}")

## Step 10: Visualize Model Performance

Create visualizations to compare model performance.

In [None]:
# Bar plot comparing test metrics
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Test MAE comparison
axes[0].bar(comparison_df['Model'], comparison_df['Test MAE'], color=['steelblue', 'coral', 'forestgreen'], alpha=0.7)
axes[0].set_ylabel('Test MAE ($)', fontsize=11)
axes[0].set_title('Test Mean Absolute Error', fontsize=12, fontweight='bold')
axes[0].tick_params(axis='x', rotation=15)
axes[0].grid(True, alpha=0.3, axis='y')

# Test RMSE comparison
axes[1].bar(comparison_df['Model'], comparison_df['Test RMSE'], color=['steelblue', 'coral', 'forestgreen'], alpha=0.7)
axes[1].set_ylabel('Test RMSE ($)', fontsize=11)
axes[1].set_title('Test Root Mean Squared Error', fontsize=12, fontweight='bold')
axes[1].tick_params(axis='x', rotation=15)
axes[1].grid(True, alpha=0.3, axis='y')

# Test R¬≤ comparison
axes[2].bar(comparison_df['Model'], comparison_df['Test R¬≤'], color=['steelblue', 'coral', 'forestgreen'], alpha=0.7)
axes[2].set_ylabel('Test R¬≤ Score', fontsize=11)
axes[2].set_title('Test R¬≤ Score', fontsize=12, fontweight='bold')
axes[2].tick_params(axis='x', rotation=15)
axes[2].set_ylim([0, 1])
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

## Step 11: Prediction vs Actual Plots

Visualize how well each model's predictions match the actual prices.

In [None]:
# Create prediction vs actual plots for all three models
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

models_data = [
    ('Ridge Regression', ridge_test_pred, 'steelblue'),
    ('Lasso Regression', lasso_test_pred, 'coral'),
    ('Random Forest', rf_test_pred, 'forestgreen')
]

for idx, (model_name, predictions, color) in enumerate(models_data):
    axes[idx].scatter(y_test, predictions, alpha=0.5, color=color, s=30)
    
    # Add perfect prediction line
    min_val = min(y_test.min(), predictions.min())
    max_val = max(y_test.max(), predictions.max())
    axes[idx].plot([min_val, max_val], [min_val, max_val], 'r--', lw=2, label='Perfect Prediction')
    
    axes[idx].set_xlabel('Actual Price ($)', fontsize=11)
    axes[idx].set_ylabel('Predicted Price ($)', fontsize=11)
    axes[idx].set_title(f'{model_name}', fontsize=12, fontweight='bold')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 12: Feature Importance Analysis

Let's examine which features are most important for each model.

In [None]:
# Random Forest Feature Importances (top 15)
rf_importances = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': rf_model.feature_importances_
}).sort_values('Importance', ascending=False)

print("\n" + "="*60)
print("TOP 15 FEATURES - RANDOM FOREST")
print("="*60)
print(rf_importances.head(15).to_string(index=False))
print("="*60)

In [None]:
# Visualize Random Forest feature importances (top 15)
top_15_features = rf_importances.head(15)

plt.figure(figsize=(10, 7))
plt.barh(range(len(top_15_features)), top_15_features['Importance'], color='forestgreen', alpha=0.7)
plt.yticks(range(len(top_15_features)), top_15_features['Feature'])
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Top 15 Features - Random Forest', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

In [None]:
# Ridge Regression Coefficients (top 15 by absolute value)
ridge_coefficients = pd.DataFrame({
    'Feature': feature_columns,
    'Coefficient': ridge_model.coef_
})
ridge_coefficients['Abs_Coefficient'] = ridge_coefficients['Coefficient'].abs()
ridge_coefficients = ridge_coefficients.sort_values('Abs_Coefficient', ascending=False)

print("\n" + "="*60)
print("TOP 15 FEATURES - RIDGE REGRESSION (by absolute coefficient)")
print("="*60)
print(ridge_coefficients[['Feature', 'Coefficient']].head(15).to_string(index=False))
print("="*60)

In [None]:
# Lasso Regression Coefficients (only non-zero)
lasso_coefficients = pd.DataFrame({
    'Feature': feature_columns,
    'Coefficient': lasso_model.coef_
})
lasso_coefficients = lasso_coefficients[lasso_coefficients['Coefficient'] != 0]
lasso_coefficients['Abs_Coefficient'] = lasso_coefficients['Coefficient'].abs()
lasso_coefficients = lasso_coefficients.sort_values('Abs_Coefficient', ascending=False)

print("\n" + "="*60)
print(f"NON-ZERO FEATURES - LASSO REGRESSION ({len(lasso_coefficients)} selected)")
print("="*60)
print(lasso_coefficients[['Feature', 'Coefficient']].to_string(index=False))
print("="*60)

## Step 13: Residual Analysis

Analyze the prediction errors (residuals) for each model.

In [None]:
# Calculate residuals
ridge_residuals = y_test - ridge_test_pred
lasso_residuals = y_test - lasso_test_pred
rf_residuals = y_test - rf_test_pred

# Plot residuals
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

residuals_data = [
    ('Ridge Regression', ridge_residuals, ridge_test_pred, 'steelblue'),
    ('Lasso Regression', lasso_residuals, lasso_test_pred, 'coral'),
    ('Random Forest', rf_residuals, rf_test_pred, 'forestgreen')
]

for idx, (model_name, residuals, predictions, color) in enumerate(residuals_data):
    axes[idx].scatter(predictions, residuals, alpha=0.5, color=color, s=30)
    axes[idx].axhline(y=0, color='r', linestyle='--', lw=2)
    axes[idx].set_xlabel('Predicted Price ($)', fontsize=11)
    axes[idx].set_ylabel('Residuals ($)', fontsize=11)
    axes[idx].set_title(f'{model_name}', fontsize=12, fontweight='bold')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Step 14: Make Predictions with All Models

Test all three models by predicting prices for sample cars.

In [None]:
# Create sample cars for prediction
sample_cars = [
    {'year': 2019, 'mileage': 50000, 'distance': 500, 'trim': 'SE', 'state': 'CO'},
    {'year': 2016, 'mileage': 100000, 'distance': 1000, 'trim': 'ST', 'state': 'TX'},
    {'year': 2017, 'mileage': 75000, 'distance': 750, 'trim': 'Titanium', 'state': 'CA'}
]

print("\n" + "="*90)
print("SAMPLE PRICE PREDICTIONS")
print("="*90)

for i, car in enumerate(sample_cars, 1):
    # Create feature vector with one-hot encoding
    new_car = pd.DataFrame(0, index=[0], columns=feature_columns)
    new_car['year'] = car['year']
    new_car['mileage'] = car['mileage']
    new_car['distance'] = car['distance']
    
    # Set one-hot encoded columns
    trim_col = f"trim_{car['trim']}"
    state_col = f"state_{car['state']}"
    
    if trim_col in new_car.columns:
        new_car[trim_col] = True
    if state_col in new_car.columns:
        new_car[state_col] = True
    
    # Make predictions
    new_car_scaled = scaler.transform(new_car)
    
    ridge_pred = ridge_model.predict(new_car_scaled)[0]
    lasso_pred = lasso_model.predict(new_car_scaled)[0]
    rf_pred = rf_model.predict(new_car)[0]
    
    print(f"\nCar #{i}: {car['year']} Ford Fiesta {car['trim']}")
    print(f"  Mileage: {car['mileage']:,} miles")
    print(f"  Location: {car['state']} ({car['distance']} mi. from Denver)")
    print(f"  -" * 30)
    print(f"  Ridge Prediction:  ${ridge_pred:,.2f}")
    print(f"  Lasso Prediction:  ${lasso_pred:,.2f}")
    print(f"  RF Prediction:     ${rf_pred:,.2f}")
    print(f"  Average:           ${(ridge_pred + lasso_pred + rf_pred) / 3:,.2f}")

print("\n" + "="*90)

## Step 15: Summary and Key Insights

### Model Comparison Summary

**Ridge Regression:**
- Uses L2 regularization to prevent overfitting
- Shrinks all coefficients but keeps all features
- Good baseline for high-dimensional data
- Interpretable coefficients

**Lasso Regression:**
- Uses L1 regularization for automatic feature selection
- Sets some coefficients to exactly zero
- Helps identify the most important features
- Creates a simpler, more interpretable model

**Random Forest:**
- Ensemble of decision trees
- Captures non-linear relationships automatically
- Handles feature interactions well
- More complex but often more accurate

### Key Findings

1. **Most Important Features:**
   - Year, mileage, and distance are typically the strongest predictors
   - Certain trim levels (ST, Titanium) have significant impact
   - Geographic location (state) has varying influence

2. **Model Selection:**
   - If interpretability is key ‚Üí Use Ridge or Lasso
   - If accuracy is paramount ‚Üí Use Random Forest
   - If feature selection is needed ‚Üí Use Lasso

3. **Next Steps:**
   - Hyperparameter tuning (GridSearchCV)
   - Cross-validation for more robust evaluation
   - Feature engineering (interactions, polynomials)
   - Try gradient boosting models (XGBoost, LightGBM)
   - Ensemble methods (combining multiple models)