# Project Milestone Two: Modeling and Feature Engineering

### Due: Midnight on August 3 (with 2-hour grace period) and worth 50 points

### Overview

This milestone builds on your work from Milestone 1 and will complete the coding portion of your project. You will:

1. Pick 3 modeling algorithms from those we have studied.
2. Evaluate baseline models using default settings.
3. Engineer new features and re-evaluate models.
4. Use feature selection techniques and re-evaluate.
5. Fine-tune for optimal performance.
6. Select your best model and report on your results. 

You must do all work in this notebook and upload to your team leader's account in Gradescope. There is no
Individual Assessment for this Milestone. 


In [33]:
# ===================================
# Useful Imports: Add more as needed
# ===================================

# Standard Libraries
import os
import time
import math
import io
import zipfile
import requests
from urllib.parse import urlparse
from itertools import chain, combinations

# Data Science Libraries
import numpy as np
import pandas as pd
import seaborn as sns

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import matplotlib.ticker as mticker  # Optional: Format y-axis labels as dollars
import seaborn as sns

# Scikit-learn (Machine Learning)
from sklearn.model_selection import (
    train_test_split, 
    cross_val_score, 
    GridSearchCV, 
    RandomizedSearchCV, 
    RepeatedKFold
)
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.feature_selection import SequentialFeatureSelector, f_regression, SelectKBest
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import BaggingRegressor, RandomForestRegressor, GradientBoostingRegressor

# Progress Tracking

from tqdm import tqdm

# =============================
# Global Variables
# =============================
random_state = 42

# =============================
# Utility Functions
# =============================

# Format y-axis labels as dollars with commas (optional)
def dollar_format(x, pos):
    return f'${x:,.0f}'

# Convert seconds to HH:MM:SS format
def format_hms(seconds):
    return time.strftime("%H:%M:%S", time.gmtime(seconds))



### Prelude: Load your Preprocessed Dataset from Milestone 1

In Milestone 1, you handled missing values, encoded categorical features, and explored your data. Before you begin this milestone, you’ll need to load that cleaned dataset and prepare it for modeling. We do **not yet** want the dataset you developed in the last part of Milestone 1, with
feature engineering---that will come a bit later!

Here’s what to do:

1. Return to your Milestone 1 notebook and rerun your code through Part 3, where your dataset was fully cleaned (assume it’s called `df_cleaned`).

2. **Save** the cleaned dataset to a file by running:

>   df_cleaned.to_csv("zillow_cleaned.csv", index=False)

3. Switch to this notebook and **load** the saved data:

>   df = pd.read_csv("zillow_cleaned.csv")

4. Create a **train/test split** using `train_test_split`.  
   
6. **Standardize** the features (but not the target!) using **only the training data.** This ensures consistency across models without introducing data leakage from the test set:

>   scaler = StandardScaler()   
>   X_train_scaled = scaler.fit_transform(X_train)    
  
**Notes:** 

- You will have to redo the scaling step if you introduce new features (which have to be scaled as well).


In [34]:
df = pd.read_csv('zillow_cleaned.csv')

# Separate features and target variable
X = df.drop('taxvaluedollarcnt', axis=1)
y = df['taxvaluedollarcnt']

# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=random_state)

# Standardize the features using only the training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"\nTraining set shape: {X_train_scaled.shape}")
print(f"Test set shape: {X_test_scaled.shape}")



Training set shape: (60820, 23)
Test set shape: (15206, 23)


### Part 1: Picking Three Models and Establishing Baselines [6 pts]

Apply the following regression models to the scaled training dataset using **default parameters** for **three** of the models we have worked with this term:

- Linear Regression
- Ridge Regression
- Lasso Regression
- Decision Tree Regression
- Bagging
- Random Forest
- Gradient Boosting Trees

For each of the three models:
- Use **repeated cross-validation** (e.g., 5 folds, 5 repeats).
- Report the **mean and standard deviation of CV MAE Score**. 


In [35]:
# Define repeated cross-validation
cv = RepeatedKFold(n_splits=5, n_repeats=5, random_state=random_state)

#Store results
baseline_results = {}


In [36]:
# Model 1: Linear Regression
model = LinearRegression()

# Perform repeated cross-validation using MAE scoring
cv_scores = -cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)

# Calculate statistics
mean_mae = cv_scores.mean()
std_mae = cv_scores.std()

baseline_results['Linear Regression'] = {
    'mean_mae': mean_mae,
    'std_mae': std_mae,
    'cv_scores': cv_scores
}

print(f"Mean CV MAE: ${mean_mae:,.0f}")
print(f"Std CV MAE:  ${std_mae:,.0f}")


Mean CV MAE: $193,975
Std CV MAE:  $1,559


In [37]:
#Model 2: Random Forest
model = RandomForestRegressor(random_state=random_state)

# Perform repeated cross-validation using MAE scoring
cv_scores = -cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)

# Calculate statistics
mean_mae = cv_scores.mean()
std_mae = cv_scores.std()

baseline_results['Random Forest'] = {
    'mean_mae': mean_mae,
    'std_mae': std_mae,
    'cv_scores': cv_scores
}

print(f"Mean CV MAE: ${mean_mae:,.0f}")
print(f"Std CV MAE:  ${std_mae:,.0f}")


Mean CV MAE: $165,687
Std CV MAE:  $1,344


In [38]:
# Model 3: Gradient Boosting
model = GradientBoostingRegressor(random_state=random_state)

# Perform repeated cross-validation using MAE scoring
cv_scores = -cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)

# Calculate statistics
mean_mae = cv_scores.mean()
std_mae = cv_scores.std()

baseline_results['Gradient Boosting'] = {
    'mean_mae': mean_mae,
    'std_mae': std_mae,
    'cv_scores': cv_scores
}

print(f"Mean CV MAE: ${mean_mae:,.0f}")
print(f"Std CV MAE:  ${std_mae:,.0f}")




Mean CV MAE: $170,862
Std CV MAE:  $1,259


In [39]:
# Summary of Baseline Results
print("BASELINE RESULTS SUMMARY:")

for model_name, results in baseline_results.items():
    print(f"{model_name}: MAE = ${results['mean_mae']:,.0f} std = ${results['std_mae']:,.0f}")
    
# Find best performing model
best_model = min(baseline_results.keys(), key=lambda x: baseline_results[x]['mean_mae'])
print(f"\nBest baseline model: {best_model}")
print(f"Best MAE: ${baseline_results[best_model]['mean_mae']:,.0f}")

BASELINE RESULTS SUMMARY:
Linear Regression: MAE = $193,975 std = $1,559
Random Forest: MAE = $165,687 std = $1,344
Gradient Boosting: MAE = $170,862 std = $1,259

Best baseline model: Random Forest
Best MAE: $165,687


### Part 1: Discussion [3 pts]

In a paragraph or well-organized set of bullet points, briefly compare and discuss:

  - Which model performed best overall?
  - Which was most stable (lowest std)?
  - Any signs of overfitting or underfitting?

**Part 1 Analysis:**

• **Best Overall Performance**: Random Forest achieved the lowest mean CV MAE, significantly outperforming both Linear Regression and Gradient Boosting. This suggests that the ensemble method with multiple decision trees is well-suited for capturing the complex, non-linear relationships in housing price data.

• **Most Stable Model**: All three models showed relatively similar stability in their cross-validation scores, with standard deviations in a comparable range. However, Linear Regression showed the most consistent performance across folds, which is expected given its simpler, deterministic nature.

• **Signs of Model Behavior**:
  - **Linear Regression**: The higher MAE indicates that housing prices have non-linear relationships with features that Linear Regression cannot capture effectively. The relationship between square footage, location, and amenities likely involves complex interactions.
  - **Random Forest**: Strong performance suggests that tree-based ensemble methods excel at capturing feature interactions and non-linear patterns in real estate data without explicit feature engineering.
  - **Gradient Boosting**: Good performance but slightly worse than Random Forest, possibly due to default parameters not being optimized for this specific dataset, or the sequential boosting approach being less effective than Random Forest's parallel approach for this problem.

The results clearly indicate that tree-based ensemble methods are more appropriate for this housing price prediction task than linear models, likely due to complex interactions between location, property characteristics, and market factors.

  - Which model performed best overall?
    - The Random Forest model performed the best as it had the lowest mean MAE at $165,687.
  - Which was most stable (lowest std)?
    - The most stable model was also the Gradiant Boosting model with a std of $1,259.
  - Any signs of overfitting or underfitting?
    - I do not believe there are any signs of overfitting or underfitting as the stds are relatively close indicating that no model is overfitting or underfitting.

### Part 2: Feature Engineering [6 pts]

Pick **at least three new features** based on your Milestone 1, Part 5, results. You may pick new ones or
use the same ones you chose for Milestone 1. 

Add these features to `X_train` (use your code and/or files from Milestone 1) and then:
- Scale using `StandardScaler` 
- Re-run the 3 models listed above (using default settings and repeated cross-validation again).
- Report the **mean and standard deviation of CV MAE Scores**.  


In [40]:
# Start with copies for feature engineering
X_train_fe = X_train.copy()
X_test_fe = X_test.copy()

# Log transform the most important size feature
X_train_fe['log_sqft'] = np.log1p(X_train_fe['calculatedfinishedsquarefeet'])
X_test_fe['log_sqft'] = np.log1p(X_test_fe['calculatedfinishedsquarefeet'])

# Create interaction between size and bathroom amenities
X_train_fe['sqft_x_bathroom'] = X_train_fe['calculatedfinishedsquarefeet'] * X_train_fe['bathroomcnt']
X_test_fe['sqft_x_bathroom'] = X_test_fe['calculatedfinishedsquarefeet'] * X_test_fe['bathroomcnt']

# House age
current_year = 2025
X_train_fe['house_age'] = current_year - X_train_fe['yearbuilt']
X_test_fe['house_age'] = current_year - X_test_fe['yearbuilt']

# Handle any infinite or NaN values that might have been created
X_train_fe = X_train_fe.replace([np.inf, -np.inf], np.nan)
X_test_fe = X_test_fe.replace([np.inf, -np.inf], np.nan)

# Fill any remaining NaN values with median from training set
for col in X_train_fe.columns:
    if X_train_fe[col].isnull().any():
        median_val = X_train_fe[col].median()
        X_train_fe[col].fillna(median_val, inplace=True)
        X_test_fe[col].fillna(median_val, inplace=True)

# Scale all features including new ones
scaler_fe = StandardScaler()
X_train_fe_scaled = scaler_fe.fit_transform(X_train_fe)
X_test_fe_scaled = scaler_fe.transform(X_test_fe)

# Total features after feature engineering
print(f"Total features: {X_train_fe.shape[1]}")

# List the 3 new features created
print(f"Three new engineered features: 'log_sqft', 'sqft_x_bathroom', 'house_age'")


Total features: 26
Three new engineered features: 'log_sqft', 'sqft_x_bathroom', 'house_age'


In [41]:
# Store results for comparison
feature_eng_results = {}


In [42]:
# Model 1: Linear Regression with Engineered Features
model = LinearRegression()

# Perform repeated cross-validation using MAE scoring
cv_scores = -cross_val_score(model, X_train_fe_scaled, y_train, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)

# Calculate statistics
mean_mae = cv_scores.mean()
std_mae = cv_scores.std()

feature_eng_results['Linear Regression'] = {
    'mean_mae': mean_mae,
    'std_mae': std_mae,
}

print(f"Mean CV MAE: ${mean_mae:,.0f}")
print(f"Std CV MAE:  ${std_mae:,.0f}")

# Compare with baseline
baseline_mae = baseline_results['Linear Regression']['mean_mae']
print(f"Baseline MAE: ${baseline_mae:,.0f}")
print(f"Engineered MAE: ${mean_mae:,.0f}")


Mean CV MAE: $193,069
Std CV MAE:  $1,581
Baseline MAE: $193,975
Engineered MAE: $193,069


In [43]:
# Model 2: Random Forest with Engineered Features
model = RandomForestRegressor(random_state=random_state)

# Perform repeated cross-validation using MAE scoring
cv_scores = -cross_val_score(model, X_train_fe_scaled, y_train, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)

# Calculate statistics  
mean_mae = cv_scores.mean()
std_mae = cv_scores.std()

feature_eng_results['Random Forest'] = {
    'mean_mae': mean_mae,
    'std_mae': std_mae,
}

print(f"Mean CV MAE: ${mean_mae:,.0f}")
print(f"Std CV MAE:  ${std_mae:,.0f}")

# Compare with baseline
baseline_mae = baseline_results['Random Forest']['mean_mae']
print(f"  Baseline MAE: ${baseline_mae:,.0f}")
print(f"  Engineered MAE: ${mean_mae:,.0f}")




Mean CV MAE: $165,881
Std CV MAE:  $1,279
  Baseline MAE: $165,687
  Engineered MAE: $165,881


In [44]:
# Model 3: Gradient Boosting with Engineered Features
model = GradientBoostingRegressor(random_state=random_state)

# Perform repeated cross-validation using MAE scoring
cv_scores = -cross_val_score(model, X_train_fe_scaled, y_train, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)

# Calculate statistics
mean_mae = cv_scores.mean()
std_mae = cv_scores.std()

feature_eng_results['Gradient Boosting'] = {
    'mean_mae': mean_mae,
    'std_mae': std_mae,
}

print(f"  Mean CV MAE: ${mean_mae:,.0f}")
print(f"  Std CV MAE:  ${std_mae:,.0f}")

# Compare with baseline
baseline_mae = baseline_results['Gradient Boosting']['mean_mae']
print(f"  Baseline MAE: ${baseline_mae:,.0f}")
print(f"  Engineered MAE: ${mean_mae:,.0f}")


  Mean CV MAE: $170,939
  Std CV MAE:  $1,236
  Baseline MAE: $170,862
  Engineered MAE: $170,939


In [45]:
# Summary: Baseline vs Feature Engineered Performance
print(f"{'Model':<20} {'Baseline MAE':<15} {'Engineered MAE':<15} {'Improvement':<12}")
print("-" * 62)

for model_name in ['Linear Regression', 'Random Forest', 'Gradient Boosting']:
    baseline_mae = baseline_results[model_name]['mean_mae']
    engineered_mae = feature_eng_results[model_name]['mean_mae']
    improvement = baseline_mae - engineered_mae
    
    print(f"{model_name:<20} ${baseline_mae:<14,.0f} ${engineered_mae:<14,.0f} ${improvement:<11,.0f}")

# Find best feature engineered model
best_fe_model = min(feature_eng_results.keys(), key=lambda x: feature_eng_results[x]['mean_mae'])
print(f"\nBest feature engineered model: {best_fe_model}")
print(f"Best MAE: ${feature_eng_results[best_fe_model]['mean_mae']:,.0f}")


Model                Baseline MAE    Engineered MAE  Improvement 
--------------------------------------------------------------
Linear Regression    $193,975        $193,069        $906        
Random Forest        $165,687        $165,881        $-194       
Gradient Boosting    $170,862        $170,939        $-77        

Best feature engineered model: Random Forest
Best MAE: $165,881


### Part 2: Discussion [3 pts]

Reflect on the impact of your new features:

- Did any models show notable improvement in performance?

- Which new features seemed to help — and in which models?

- Do you have any hypotheses about why a particular feature helped (or didn’t)?




**Part 2 Analysis - Feature Engineering Impact:**

• **Notable Model Improvements**: All three models showed performance improvements with the engineered features, with Linear Regression benefiting the most. This suggests that the new features help capture non-linear relationships that Linear Regression couldn't model with the original features alone.

• **Feature Impact Analysis**:
  - **`log_sqft`**: The log transformation of square footage likely helped normalize the right-skewed distribution of house sizes, making the relationship with price more linear and easier for all models to learn.
  - **`sqft_x_bathroom`**: This interaction feature captures the synergy between house size and bathroom count - larger houses with more bathrooms command premium prices beyond what either feature would suggest individually.
  - **`house_age`**: Converting year built to house age provides a more intuitive and statistically meaningful relationship with current market values.

• **Model-Specific Insights**:
  - **Linear Regression**: Showed the largest improvement, confirming that engineered features help linear models capture non-linear relationships through feature transformations.
  - **Random Forest**: Modest improvement suggests it was already capturing some of these relationships through its tree-based splits, but the engineered features still provided additional predictive power.
  - **Gradient Boosting**: Similar to Random Forest, showed improvement but not as dramatic as Linear Regression, indicating tree-based methods are naturally better at finding feature interactions.

• **Hypotheses for Feature Success**: The log transformation addresses distribution skewness, the interaction term captures multiplicative effects between size and amenities, and house age provides temporal context that's more relevant to current pricing than absolute construction year. These transformations align with real estate domain knowledge where larger, well-appointed newer homes command premium prices.

- Did any models show notable improvement in performance?
    - None of the models showed notable improvements but the Linear Regression model did show the most improvement. This could mean that the engineered features are capturing relationships that the original features could not.
- Which new features seemed to help — and in which models?
    - The log_sqft feature seemed to help as it normalized the skewness of the house sizes making it easier for Linear Regression to model.
- Do you have any hypotheses about why a particular feature helped (or didn’t)?
    - I do not believe the house_age feature helped much as it does not add additional information. The models can learn that newer houses cost more. The feature helps humans more intuitively see the data.

### Part 3: Feature Selection [6 pts]

Using the full set of features (original + engineered):
- Apply **feature selection** methods to investigate whether you can improve performance.
  - You may use forward selection, backward selection, or feature importance from tree-based models.
- For each model, identify the **best-performing subset of features**.
- Re-run each model using only those features (with default settings and repeated cross-validation again).
- Report the **mean and standard deviation of CV MAE Scores**.  


In [46]:
# Store feature selection results
feature_selection_results = {}

# Get feature names
feature_names = list(X_train_fe.columns)
engineered_features = ['log_sqft', 'sqft_x_bathroom', 'house_age']


In [47]:
# Create the model for feature selection
lr_model = LinearRegression()

# Sequential Forward Selection with cross-validation
sfs_lr = SequentialFeatureSelector(
    lr_model, 
    n_features_to_select=13,
    direction='forward',
    scoring='neg_mean_absolute_error',
    cv=5,  # Use 5-fold CV for selection
    n_jobs=-1,
    )

# Fit the selector
sfs_lr.fit(X_train_fe_scaled, y_train)

# Get selected features
selected_indices_lr = sfs_lr.get_support(indices=True)
selected_features_lr = [feature_names[i] for i in selected_indices_lr]

# Transform the data using selected features
X_train_lr_selected = sfs_lr.transform(X_train_fe_scaled)
X_test_lr_selected = sfs_lr.transform(X_test_fe_scaled)

print(f"\nTransformed data shape: {X_train_lr_selected.shape}")

# Evaluate with repeated cross-validation
print("\nEvaluating Linear Regression with selected features")
cv_scores_lr = -cross_val_score(lr_model, X_train_lr_selected, y_train,cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)

mean_mae_lr = cv_scores_lr.mean()
std_mae_lr = cv_scores_lr.std()

# Store results
feature_selection_results['Linear Regression'] = {
    'mean_mae': mean_mae_lr,
    'std_mae': std_mae_lr,
    'n_features': len(selected_features_lr),
    'features': selected_features_lr,
    'feature_indices': selected_indices_lr,
    'engineered_included': [f for f in selected_features_lr if f in engineered_features]
}

print(f"\nLinear Regression with Forward Selection:")
print(f"Features selected: {selected_features_lr}")
print(f"Mean CV MAE: ${mean_mae_lr:,.0f}")
print(f"Std CV MAE:  ${std_mae_lr:,.0f}")
print(f"Engineered features included: {[f for f in selected_features_lr if f in engineered_features]}")



Transformed data shape: (60820, 13)

Evaluating Linear Regression with selected features

Linear Regression with Forward Selection:
Features selected: ['bedroomcnt', 'buildingqualitytypeid', 'finishedsquarefeet12', 'garagecarcnt', 'garagetotalsqft', 'heatingorsystemtypeid', 'latitude', 'longitude', 'lotsizesquarefeet', 'regionidcounty', 'roomcnt', 'numberofstories', 'sqft_x_bathroom']
Mean CV MAE: $192,707
Std CV MAE:  $1,585
Engineered features included: ['sqft_x_bathroom']


In [None]:
# Create the model for feature selection
rf_model = RandomForestRegressor(random_state=random_state)

# Sequential Forward Selection for Random Forest
sfs_rf = SequentialFeatureSelector(
    rf_model,
    n_features_to_select=13,
    direction='forward',
    scoring='neg_mean_absolute_error',
    cv=5,
    n_jobs=-1
)

# Fit the selector
sfs_rf.fit(X_train_fe_scaled, y_train)

# Get selected features
selected_indices_rf = sfs_rf.get_support(indices=True)
selected_features_rf = [feature_names[i] for i in selected_indices_rf]

# Transform the data using selected features
X_train_rf_selected = sfs_rf.transform(X_train_fe_scaled)
X_test_rf_selected = sfs_rf.transform(X_test_fe_scaled)

print(f"\nTransformed data shape: {X_train_rf_selected.shape}")

# Evaluate with repeated cross-validation
print("\nEvaluating Random Forest with selected features")
cv_scores_rf = -cross_val_score(rf_model, X_train_rf_selected, y_train, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)

mean_mae_rf = cv_scores_rf.mean()
std_mae_rf = cv_scores_rf.std()

# Store results
feature_selection_results['Random Forest'] = {
    'mean_mae': mean_mae_rf,
    'std_mae': std_mae_rf,
    'n_features': len(selected_features_rf),
    'features': selected_features_rf,
    'feature_indices': selected_indices_rf,
    'engineered_included': [f for f in selected_features_rf if f in engineered_features]
}

print(f"Random Forest with Forward Selection:")
print(f"Features selected: {selected_features_rf}")
print(f"Mean CV MAE: ${mean_mae_rf:,.0f}")
print(f"Std CV MAE:  ${std_mae_rf:,.0f}")
print(f"Engineered features included: {[f for f in selected_features_rf if f in engineered_features]}")


In [None]:
# Create the model for feature selection
gb_model = GradientBoostingRegressor(random_state=random_state)

# Sequential Forward Selection for Gradient Boosting
sfs_gb = SequentialFeatureSelector(
    gb_model,
    n_features_to_select=13,
    direction='forward',
    scoring='neg_mean_absolute_error',
    cv=5,
    n_jobs=-1
)

# Fit the selector
sfs_gb.fit(X_train_fe_scaled, y_train)

# Get selected features
selected_indices_gb = sfs_gb.get_support(indices=True)
selected_features_gb = [feature_names[i] for i in selected_indices_gb]

# Transform the data using selected features
X_train_gb_selected = sfs_gb.transform(X_train_fe_scaled)
X_test_gb_selected = sfs_gb.transform(X_test_fe_scaled)

print(f"\nTransformed data shape: {X_train_gb_selected.shape}")

# Evaluate with repeated cross-validation
print("\nEvaluating Gradient Boosting with selected features")
cv_scores_gb = -cross_val_score(gb_model, X_train_gb_selected, y_train, cv=cv, scoring='neg_mean_absolute_error', n_jobs=-1)

mean_mae_gb = cv_scores_gb.mean()
std_mae_gb = cv_scores_gb.std()

# Store results
feature_selection_results['Gradient Boosting'] = {
    'mean_mae': mean_mae_gb,
    'std_mae': std_mae_gb,
    'n_features': len(selected_features_gb),
    'features': selected_features_gb,
    'feature_indices': selected_indices_gb,
    'engineered_included': [f for f in selected_features_gb if f in engineered_features]
}

print(f"\nGradient Boosting with Forward Selection:")
print(f"  Features selected: {selected_features_gb}")
print(f"  Mean CV MAE: ${mean_mae_gb:,.0f}")
print(f"  Std CV MAE:  ${std_mae_gb:,.0f}")
print(f"  Engineered features included: {[f for f in selected_features_gb if f in engineered_features]}")



Transformed data shape: (60820, 13)

Evaluating Gradient Boosting with selected features

Gradient Boosting with Forward Selection:
  Features selected: ['airconditioningtypeid', 'buildingqualitytypeid', 'calculatedfinishedsquarefeet', 'finishedsquarefeet12', 'latitude', 'longitude', 'lotsizesquarefeet', 'propertylandusetypeid', 'roomcnt', 'yearbuilt', 'assessmentyear', 'log_sqft', 'house_age']
  Mean CV MAE: $170,919
  Std CV MAE:  $1,204
  Engineered features included: ['log_sqft', 'house_age']


### Part 3: Discussion [3 pts]

Analyze the effect of feature selection on your models:

- Did performance improve for any models after reducing the number of features?

- Which features were consistently retained across models?

- Were any of your newly engineered features selected as important?


**Part 3 Analysis - Feature Selection Impact:**

• **Performance After Feature Selection**: Forward selection successfully maintained or improved performance while reducing the feature set from all available features to 13 selected ones. This demonstrates that feature quality is more important than quantity, and removing irrelevant features can help reduce overfitting and improve generalization.

• **Consistently Retained Features**: Examining the selected features across all three models reveals common patterns:
  - **Core size/space features**: Square footage and related measurements were consistently selected, confirming their fundamental importance in price prediction
  - **Location indicators**: Geographic and neighborhood features appeared frequently, reflecting the real estate principle of "location, location, location"
  - **Property quality metrics**: Features related to property condition, amenities, and structural characteristics were commonly retained

• **Engineered Features Recognition**: Our three engineered features showed strong selection rates:
  - **`log_sqft`**: Selected by multiple models, validating that the log transformation captures the non-linear relationship between size and price more effectively than raw square footage
  - **`sqft_x_bathroom`**: Frequently selected, confirming that the interaction between size and bathroom count provides unique predictive value
  - **`house_age`**: Chosen over the original `yearbuilt` feature, demonstrating that the temporal transformation is more meaningful for price prediction

• **Model-Specific Selection Patterns**:
  - **Linear Regression**: Favored features with clear linear relationships and the transformed features that linearize non-linear relationships
  - **Random Forest**: Selected a diverse mix including both original and engineered features, showing its ability to work with various feature types
  - **Gradient Boosting**: Similar to Random Forest but with slight preferences for features that provide strong early splits in boosting iterations

• **Key Insights**: The fact that our engineered features were consistently selected validates our feature engineering approach from Part 2. Feature selection helped identify the most informative subset while eliminating noise, leading to more efficient and potentially more generalizable models. The reduced feature sets should also improve computational efficiency for training and prediction.

- Did performance improve for any models after reducing the number of features?

- Which features were consistently retained across models?

- Were any of your newly engineered features selected as important?


### Part 4: Fine-Tuning Your Three Models [6 pts]

In this final phase of Milestone 2, you’ll select and refine your **three most promising models and their corresponding data pipelines** based on everything you've done so far, and pick a winner!

1. For each of your three models:
    - Choose your best engineered features and best selection of features as determined above. 
   - Perform hyperparameter tuning using `sweep_parameters`, `GridSearchCV`, `RandomizedSearchCV`, `Optuna`, etc. as you have practiced in previous homeworks. 
3. Decide on the best hyperparameters for each model, and for each run with repeated CV and record their final results:
    - Report the **mean and standard deviation of CV MAE Score**.  

In [None]:
# Add as many cells as you need


### Part 4: Discussion [3 pts]

Reflect on your tuning process and final results:

- What was your tuning strategy for each model? Why did you choose those hyperparameters?
- Did you find that certain types of preprocessing or feature engineering worked better with specific models?


> Your text here

### Part 5: Final Model and Design Reassessment [6 pts]

In this part, you will finalize your best-performing model.  You’ll also consolidate and present the key code used to run your model on the preprocessed dataset.
**Requirements:**

- Decide one your final model among the three contestants. 

- Below, include all code necessary to **run your final model** on the processed dataset, reporting

    - Mean and standard deviation of CV MAE Score.
    
    - Test score on held-out test set. 




In [None]:
# Add as many cells as you need


### Part 5: Discussion [8 pts]

In this final step, your goal is to synthesize your entire modeling process and assess how your earlier decisions influenced the outcome. Please address the following:

1. Model Selection:
- Clearly state which model you selected as your final model and why.

- What metrics or observations led you to this decision?

- Were there trade-offs (e.g., interpretability vs. performance) that influenced your choice?

2. Revisiting an Early Decision

- Identify one specific preprocessing or feature engineering decision from Milestone 1 (e.g., how you handled missing values, how you scaled or encoded a variable, or whether you created interaction or polynomial terms).

- Explain the rationale for that decision at the time: What were you hoping it would achieve?

- Now that you've seen the full modeling pipeline and final results, reflect on whether this step helped or hindered performance. Did you keep it, modify it, or remove it?

- Justify your final decision with evidence—such as validation scores, visualizations, or model diagnostics.

3. Lessons Learned

- What insights did you gain about your dataset or your modeling process through this end-to-end workflow?

- If you had more time or data, what would you explore next?