### Reflection on Milestone 1: Redundant Feature Groups

During Milestone 1, we examined several highly correlated feature groups — particularly the square footage (`finishedsquarefeet*`) and bathroom count variants. While we visualized their correlation, we retained most of them in the final dataset.

In retrospect, consolidating these into a single representative feature or engineering a composite (e.g., averaging or applying PCA) could have reduced dimensionality, improved model generalization, and decreased noise. Gradient Boosting handled the redundancy well, but this still represents an opportunity for simplification and interpretability in future iterations.


### Reflection on Milestone 2: Evaluation Strategy

In Milestone 2, we evaluated model performance using a single train-test split and reported RMSE. While this approach was practical and consistent, using a cross-validation strategy like K-Fold would have produced more stable and representative performance estimates.

Cross-validation mitigates the risk of overfitting to a particular split and would have helped us better understand variability across samples. This is especially valuable with a moderately sized dataset like ours.


### Final Model: Gradient Boosting Pipeline

The following code reproduces the full modeling pipeline using our best-performing model — Gradient Boosting. It integrates key preprocessing steps and reflects on design improvements identified during this project.


In [None]:
# ===================== Final Model Pipeline: Gradient Boosting =====================
# This code fulfills Final Report Part A requirement 3.
# It runs our selected model using the cleaned data and reflects our Milestone 1 & 2 learnings.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import numpy as np

# Assume df_encoded is our preprocessed feature set from Milestone 1
X = df_encoded.drop(columns='taxvaluedollarcnt')
y = df_encoded['taxvaluedollarcnt']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply standard scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Gradient Boosting model
final_model = GradientBoostingRegressor(n_estimators=25, random_state=42)
final_model.fit(X_train_scaled, y_train)

# Predict and evaluate
y_pred = final_model.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Final Gradient Boosting RMSE: {rmse:.2f}")
