# Day 4 – Model Selection & Cross Validation

Yesterday, we engineered powerful features and saved them to encoded_data.csv. Today, we'll use that dataset to evaluate three different regression models and find the most reliable one using robust validation techniques.

Objective:

- Load the pre-engineered data from Day 3.

- Evaluate Linear Regression, k-Nearest Neighbors (kNN), and Polynomial Regression.

- Compare three validation strategies: Train-Test Split, Train-Validation-Test Split, and K-Fold Cross-Validation.


## Load & Prepare Data

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

# Load the engineered dataset from Day 3
df = pd.read_csv('/Users/amrithaa/Downloads/12DayDataSeries/notebooks/data/encoded_data.csv')


# Define the features (X) and target (y)
# We select the most impactful engineered features
features = [
    'crew_size',
    'mental_health_score',
    'risk_score',
    'mission_type_Rescue',
    'mission_type_Resupply',
    'duration_bin_medium',
    'duration_bin_long'
]
target = 'mission_success_score'

X = df[features]
y = df[target]

# Note: The numerical features were already scaled with StandardScaler in Day 3.
# We'll proceed directly to modeling.

print("Shape of loaded features (X):", X.shape)


Shape of loaded features (X): (425, 7)


## Method 1: Train-Test Split (80/20)
This is our first and simplest check. We split the data once to get a baseline for how our models perform on unseen data.

In [7]:
# Split data into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
linear_reg = LinearRegression()
knn_reg = KNeighborsRegressor(n_neighbors=7) # Using k=7 as a reasonable starting point
poly_reg = make_pipeline(PolynomialFeatures(degree=2, include_bias=False), LinearRegression())

# Train models
linear_reg.fit(X_train, y_train)
knn_reg.fit(X_train, y_train)
poly_reg.fit(X_train, y_train)

# Evaluate on the test set
mse_linear = mean_squared_error(y_test, linear_reg.predict(X_test))
mse_knn = mean_squared_error(y_test, knn_reg.predict(X_test))
mse_poly = mean_squared_error(y_test, poly_reg.predict(X_test))

In [8]:
# Code to display results

# Collect model names and their corresponding test MSE
results = {
    "Model": ["Linear Regression", "kNN Regression (k=7)", "Polynomial Regression (deg=2)"],
    "Test MSE": [mse_linear, mse_knn, mse_poly]
}

# Convert to DataFrame for clean display
results_df = pd.DataFrame(results)

# Display the results table
print("Results: Train-Test Split")
print(results_df.to_string(index=False))

Results: Train-Test Split
                        Model  Test MSE
            Linear Regression 17.245092
         kNN Regression (k=7) 25.510034
Polynomial Regression (deg=2) 16.005268


Conclusion: On this single split, the Polynomial Regression model is the clear winner. This suggests there are valuable non-linear interactions between our engineered features that the simpler models can't capture.

## Method 2: Train-Validation-Test Split (60/20/20)

This is a better approach. The validation set allows us to tune model parameters (like finding the best k for kNN) without "peeking" at the final test set, which would bias our results.

In [9]:
# Split data: 60% train, 20% validation, 20% test
X_train_val, X_test_final, y_train_val, y_test_final = train_test_split(X, y, test_size=0.2, random_state=42)
X_train_final, X_val, y_train_final, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25, random_state=42)

# Train models on the 60% training set
linear_reg.fit(X_train_final, y_train_final)
knn_reg.fit(X_train_final, y_train_final)
poly_reg.fit(X_train_final, y_train_final)

# Evaluate on the validation set (used for tuning)
val_mse_linear = mean_squared_error(y_val, linear_reg.predict(X_val))
val_mse_knn = mean_squared_error(y_val, knn_reg.predict(X_val))
val_mse_poly = mean_squared_error(y_val, poly_reg.predict(X_val))

# Final evaluation on the unseen test set
final_mse_linear = mean_squared_error(y_test_final, linear_reg.predict(X_test_final))
final_mse_knn = mean_squared_error(y_test_final, knn_reg.predict(X_test_final))
final_mse_poly = mean_squared_error(y_test_final, poly_reg.predict(X_test_final))

In [10]:
# Code to display TV results

# Collect validation and test MSE for each model
results_tv = {
    "Model": ["Linear Regression", "kNN Regression (k=7)", "Polynomial Regression (deg=2)"],
    "Validation MSE": [val_mse_linear, val_mse_knn, val_mse_poly],
    "Final Test MSE": [final_mse_linear, final_mse_knn, final_mse_poly]
}

# Convert to DataFrame
results_tv_df = pd.DataFrame(results_tv)

# Display results
print("Results: Train-Validation-Test Split")
print(results_tv_df.to_string(index=False))

Results: Train-Validation-Test Split
                        Model  Validation MSE  Final Test MSE
            Linear Regression       24.155318       17.222822
         kNN Regression (k=7)       29.383277       28.112484
Polynomial Regression (deg=2)       23.795181       16.980122


Conclusion: The results hold. The Polynomial Regressor performed best on both the validation and the final test sets. This consistency gives us greater confidence that it's the right model for this problem.

## Method 3: K-Fold Cross-Validation (The Gold Standard)

In [11]:
# Perform 5-fold cross-validation
# Note: We use the full dataset (X, y) as cross_val_score handles the splits
cv_scores_linear = -cross_val_score(linear_reg, X, y, cv=5, scoring='neg_mean_squared_error')
cv_scores_knn = -cross_val_score(knn_reg, X, y, cv=5, scoring='neg_mean_squared_error')
cv_scores_poly = -cross_val_score(poly_reg, X, y, cv=5, scoring='neg_mean_squared_error')

# Calculate the average MSE
avg_mse_linear = np.mean(cv_scores_linear)
avg_mse_knn = np.mean(cv_scores_knn)
avg_mse_poly = np.mean(cv_scores_poly)

In [12]:
# Print results for CV 

# Create results dictionary
cv_results = {
    "Model": ["Linear Regression", "kNN Regression (k=7)", "Polynomial Regression (deg=2)"],
    "CV Fold MSEs": [cv_scores_linear, cv_scores_knn, cv_scores_poly],
    "Average MSE": [avg_mse_linear, avg_mse_knn, avg_mse_poly]
}

# Convert to DataFrame
cv_results_df = pd.DataFrame(cv_results)

# Format float precision for better readability
pd.options.display.float_format = '{:,.2f}'.format

# Display results
print("Results: 5-Fold Cross-Validation")
print(cv_results_df.to_string(index=False))


Results: 5-Fold Cross-Validation
                        Model                                                                                        CV Fold MSEs  Average MSE
            Linear Regression [20.254832397999557, 21.231303364720834, 19.554667576047173, 21.98905701752795, 18.552247740787656]        20.32
         kNN Regression (k=7) [29.512259591836738, 30.947534621848746, 25.82201126050422, 29.568015606242486, 26.498327539015587]        28.47
Polynomial Regression (deg=2)  [18.708230746416785, 19.55556801858651, 20.51452139147108, 20.622538248239277, 15.366709977129435]        18.95


Conclusion: After averaging the performance across 5 different folds, the Polynomial Regression model is confirmed as the most reliable choice. Its average MSE of 18.95 is our most trustworthy estimate of how it will perform on future, unseen data.

### Takeaway

| Evaluation Method | Linear Regression MSE | kNN Regression MSE | Polynomial Regression MSE | **Best Model** |
| :--- | :--- | :--- | :--- | :--- |
| **Train-Test Split (80/20)** | 17.25 | 25.51 | **16.01** | **Polynomial** |
| **Train-Val-Test (60/20/20)**| 17.22 (Test) | 28.11 (Test) | **16.98** (Test) | **Polynomial** |
| **5-Fold Cross-Validation**| 20.32 (Avg) | 28.47 (Avg) | **18.95** (Avg) | **Polynomial** |

The story is clear and consistent across all evaluation methods:

* **Feature Engineering Paid Off**: The powerful features we created in Day 3 gave our models a strong foundation. This is evident in the strong performance of even the simple Linear Regression model.
* **The Right Complexity Matters**: The Polynomial Regression model consistently succeeded because it could capture the complex, non-linear interactions between our engineered features, giving it the winning edge. The kNN model was not well-suited for this particular problem.
* **Cross-Validation is Trustworthy**: While the Polynomial model scored exceptionally well on single test sets (MSE of ~16-17), the cross-validation score gives us a more stable and reliable estimate of its true predictive power. We should expect its real-world performance to have an MSE closer to **18.95**.

**Final Takeaway**: Yesterday we created the right ingredients. Today, we rigorously tested the recipes and found the one that works best. **Cross-validation is the ultimate taste test that proves a model is truly effective and not just lucky.**