# Week 4 â€“ Model Training

In this notebook, I train and compare multiple regression models to predict studentsâ€™ final grades (**G3**).  

**Goals for Week 4:**
- Load the **processed dataset** from Week 3.
- Create **train / validation / test** splits.
- Train a **baseline model** (DummyRegressor).
- Train several candidate models:
  - Linear Regression
  - Lasso Regression (regularized)
  - Random Forest Regressor (tree-based)
- Use **GridSearchCV** to tune key hyperparameters.
- Compare models using **RMSE, MAE, and RÂ²**.
- Save the **best model(s)** to the `models/` folder for:
  - Week 5: Fairness evaluation  
  - Week 6: Explainability (SHAP, LIME)


In [1]:
# 1. Setup & Imports

import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import joblib

RANDOM_STATE = 42  # for reproducibility
np.random.seed(RANDOM_STATE)

# Create models directory if it doesn't exist
os.makedirs("models", exist_ok=True)

## 2. Load Processed Data

Here I load the **cleaned, preprocessed dataset** that was created in Week 3.  
Update the file path if your processed file is named differently.

Assumptions:
- The target variable is **`G3`** (final grade).
- All features are already **numeric** and cleaned.

In [2]:
# 2. Load Processed Data

data_path = r"C:\Users\Kal\processed_student_data.csv"

df = pd.read_csv(data_path)
print("Data shape:", df.shape)
df.head()


Data shape: (649, 61)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,51,52,53,54,55,56,57,58,59,G3
0,1.031695,1.310216,1.540715,0.576718,0.083653,-0.374305,0.072606,-0.171647,0.693785,-0.543555,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,11
1,0.210137,-1.336039,-1.188832,-0.760032,0.083653,-0.374305,1.119748,-0.171647,-0.15738,-0.543555,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,11
2,-1.43298,-1.336039,-1.188832,-0.760032,0.083653,-0.374305,0.072606,-0.171647,-1.008546,0.538553,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,12
3,-1.43298,1.310216,-0.278983,-0.760032,1.290114,-0.374305,-0.974536,-1.123771,-1.008546,-0.543555,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,14
4,-0.611422,0.428131,0.630866,-0.760032,0.083653,-0.374305,0.072606,-0.171647,-1.008546,-0.543555,...,0.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,13


## 3. Define Features and Target

- **Target (`y`)**: `G3` (studentsâ€™ final grade).
- **Features (`X`)**: all remaining columns.

I will also create **two feature sets**:
1. **With G1 and G2** â€“ uses all available predictors.
2. **Without G1 and G2** â€“ simulates a more realistic scenario where only background and behavior data are used.

> If you donâ€™t want the second version, you can comment out that part.

In [3]:
# 3. Define Features and Target

TARGET_COL = "G3"

if TARGET_COL not in df.columns:
    raise ValueError(f"Target column '{TARGET_COL}' not found in dataframe. Check your data.")

# Full feature set (all columns except target)
feature_cols = [col for col in df.columns if col != TARGET_COL]

X_full = df[feature_cols].copy()
y = df[TARGET_COL].copy()

print("Number of features (full):", X_full.shape[1])

# Two versions:
# 1) With G1 and G2
X_with_G1_G2 = X_full.copy()

# 2) Without G1 and G2 (if they exist in the data)
cols_to_drop = [c for c in ["G1", "G2"] if c in X_full.columns]
X_without_G1_G2 = X_full.drop(columns=cols_to_drop) if cols_to_drop else X_full.copy()

print("Number of features with G1 & G2:", X_with_G1_G2.shape[1])
print("Number of features without G1 & G2:", X_without_G1_G2.shape[1])

Number of features (full): 60
Number of features with G1 & G2: 60
Number of features without G1 & G2: 60


## 4. Train / Validation / Test Split

To get an honest estimate of performance, I split the data into:

1. **Train + Validation vs Test**  
2. Then split the **Train+Validation** part again into **Train** and **Validation**.

Example:
- 60% Train
- 20% Validation
- 20% Test

In [4]:
# 4. Train / Validation / Test Split

def make_splits(X, y, test_size=0.2, val_size=0.25, random_state=RANDOM_STATE):
    """
    Splits X, y into train, validation, and test sets.
    
    test_size: proportion of data for test (e.g., 0.2 -> 20%)
    val_size: proportion of (train+val) reserved for validation.
              e.g., 0.25 of 0.8 -> 0.2 -> final: 60/20/20
    """
    # First: train_val vs test
    X_train_val, X_test, y_train_val, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    # Second: train vs val from train_val
    X_train, X_val, y_train, y_val = train_test_split(
        X_train_val, y_train_val, test_size=val_size, random_state=random_state
    )
    
    return X_train, X_val, X_test, y_train, y_val, y_test


# Splits for both feature sets
splits_with = make_splits(X_with_G1_G2, y)
splits_without = make_splits(X_without_G1_G2, y)

(X_train_with, X_val_with, X_test_with,
 y_train_with, y_val_with, y_test_with) = splits_with

(X_train_wo, X_val_wo, X_test_wo,
 y_train_wo, y_val_wo, y_test_wo) = splits_without

print("With G1/G2 - Train:", X_train_with.shape, "Val:", X_val_with.shape, "Test:", X_test_with.shape)
print("Without G1/G2 - Train:", X_train_wo.shape, "Val:", X_val_wo.shape, "Test:", X_test_wo.shape)

With G1/G2 - Train: (389, 60) Val: (130, 60) Test: (130, 60)
Without G1/G2 - Train: (389, 60) Val: (130, 60) Test: (130, 60)


## 5. Evaluation Helper Function

To keep things clean, I define a helper function that:
- fits a model,
- makes predictions,
- returns **RMSE, MAE, and RÂ²**.

In [5]:
# 5. Evaluation Helper

def evaluate_regression_model(model, X_train, y_train, X_val, y_val, model_name="model"):
    """
    Fits the model on training data and evaluates on validation data.
    Returns a dictionary of metrics.
    """
    model.fit(X_train, y_train)
    
    y_train_pred = model.predict(X_train)
    y_val_pred = model.predict(X_val)
    
    metrics = {
        "model": model_name,
        "rmse_train": mean_squared_error(y_train, y_train_pred, squared=False),
        "mae_train": mean_absolute_error(y_train, y_train_pred),
        "r2_train": r2_score(y_train, y_train_pred),
        "rmse_val": mean_squared_error(y_val, y_val_pred, squared=False),
        "mae_val": mean_absolute_error(y_val, y_val_pred),
        "r2_val": r2_score(y_val, y_val_pred),
    }
    return metrics

## 6. Baseline Model â€“ DummyRegressor

The baseline model predicts a **constant value** (mean of the training target).

This gives a **minimum performance level**. Any useful model should beat this baseline.

In [6]:
# 6. Baseline Model â€“ DummyRegressor

results_with = []
results_wo = []

dummy_with = DummyRegressor(strategy="mean")
metrics_dummy_with = evaluate_regression_model(
    dummy_with,
    X_train_with, y_train_with,
    X_val_with, y_val_with,
    model_name="Dummy (mean) â€“ with G1/G2"
)
results_with.append(metrics_dummy_with)

dummy_wo = DummyRegressor(strategy="mean")
metrics_dummy_wo = evaluate_regression_model(
    dummy_wo,
    X_train_wo, y_train_wo,
    X_val_wo, y_val_wo,
    model_name="Dummy (mean) â€“ without G1/G2"
)
results_wo.append(metrics_dummy_wo)

pd.DataFrame(results_with)



Unnamed: 0,model,rmse_train,mae_train,r2_train,rmse_val,mae_val,r2_val
0,Dummy (mean) â€“ with G1/G2,3.359542,2.498926,0.0,2.876065,2.196955,-0.010676


## 7. Linear & Lasso Regression (Initial Training)

Next, I train:
- **Linear Regression** â€“ basic baseline for linear relationships.
- **Lasso Regression** â€“ adds L1 regularization to reduce overfitting and possibly perform feature selection.

Here I first train with **default settings** before tuning.

In [7]:
# 7. Linear & Lasso Regression â€“ initial fits

# Linear Regression â€“ with G1/G2
lin_with = LinearRegression()
metrics_lin_with = evaluate_regression_model(
    lin_with,
    X_train_with, y_train_with,
    X_val_with, y_val_with,
    model_name="Linear Regression â€“ with G1/G2"
)
results_with.append(metrics_lin_with)

# Linear Regression â€“ without G1/G2
lin_wo = LinearRegression()
metrics_lin_wo = evaluate_regression_model(
    lin_wo,
    X_train_wo, y_train_wo,
    X_val_wo, y_val_wo,
    model_name="Linear Regression â€“ without G1/G2"
)
results_wo.append(metrics_lin_wo)

# Lasso â€“ with G1/G2 (default alpha as a starting point)
lasso_with = Lasso(alpha=0.01, random_state=RANDOM_STATE, max_iter=10000)
metrics_lasso_with = evaluate_regression_model(
    lasso_with,
    X_train_with, y_train_with,
    X_val_with, y_val_with,
    model_name="Lasso (Î±=0.01) â€“ with G1/G2"
)
results_with.append(metrics_lasso_with)

# Lasso â€“ without G1/G2
lasso_wo = Lasso(alpha=0.01, random_state=RANDOM_STATE, max_iter=10000)
metrics_lasso_wo = evaluate_regression_model(
    lasso_wo,
    X_train_wo, y_train_wo,
    X_val_wo, y_val_wo,
    model_name="Lasso (Î±=0.01) â€“ without G1/G2"
)
results_wo.append(metrics_lasso_wo)

pd.DataFrame(results_with)




Unnamed: 0,model,rmse_train,mae_train,r2_train,rmse_val,mae_val,r2_val
0,Dummy (mean) â€“ with G1/G2,3.359542,2.498926,0.0,2.876065,2.196955,-0.010676
1,Linear Regression â€“ with G1/G2,1.261116,0.815135,0.859087,1.091688,0.780965,0.854383
2,Lasso (Î±=0.01) â€“ with G1/G2,1.269001,0.81327,0.85732,1.106514,0.780954,0.850401


## 8. Random Forest Regressor (Initial Training)

Now I add a **tree-based** model:

- **RandomForestRegressor** â€“ can capture non-linear relationships and interactions between features.

Here I start with a **simple configuration** and will tune hyperparameters later.

In [8]:
# 8. Random Forest â€“ initial fits

rf_with = RandomForestRegressor(
    n_estimators=100,
    random_state=RANDOM_STATE,
    n_jobs=-1
)
metrics_rf_with = evaluate_regression_model(
    rf_with,
    X_train_with, y_train_with,
    X_val_with, y_val_with,
    model_name="Random Forest â€“ with G1/G2"
)
results_with.append(metrics_rf_with)

rf_wo = RandomForestRegressor(
    n_estimators=100,
    random_state=RANDOM_STATE,
    n_jobs=-1
)
metrics_rf_wo = evaluate_regression_model(
    rf_wo,
    X_train_wo, y_train_wo,
    X_val_wo, y_val_wo,
    model_name="Random Forest â€“ without G1/G2"
)
results_wo.append(metrics_rf_wo)

pd.DataFrame(results_wo)



Unnamed: 0,model,rmse_train,mae_train,r2_train,rmse_val,mae_val,r2_val
0,Dummy (mean) â€“ without G1/G2,3.359542,2.498926,0.0,2.876065,2.196955,-0.010676
1,Linear Regression â€“ without G1/G2,1.261116,0.815135,0.859087,1.091688,0.780965,0.854383
2,Lasso (Î±=0.01) â€“ without G1/G2,1.269001,0.81327,0.85732,1.106514,0.780954,0.850401
3,Random Forest â€“ without G1/G2,0.525626,0.330154,0.975521,1.200176,0.781692,0.824003


## 9. Hyperparameter Tuning with GridSearchCV

To improve performance, I tune key hyperparameters using **GridSearchCV**.

I will tune:
- **Lasso**: `alpha`
- **Random Forest**: `n_estimators`, `max_depth`, `min_samples_split`

For speed, I keep the parameter grids small. You can expand them if training time is acceptable.


In [9]:
# 9. Hyperparameter Tuning â€“ Lasso and Random Forest

def tune_lasso(X_train, y_train):
    lasso = Lasso(random_state=RANDOM_STATE, max_iter=10000)
    param_grid = {
        "alpha": [0.001, 0.01, 0.1, 1.0]
    }
    grid = GridSearchCV(
        lasso,
        param_grid,
        cv=5,
        scoring="neg_root_mean_squared_error",
        n_jobs=-1
    )
    grid.fit(X_train, y_train)
    return grid

def tune_random_forest(X_train, y_train):
    rf = RandomForestRegressor(random_state=RANDOM_STATE, n_jobs=-1)
    param_grid = {
        "n_estimators": [100, 200],
        "max_depth": [None, 5, 10],
        "min_samples_split": [2, 5]
    }
    grid = GridSearchCV(
        rf,
        param_grid,
        cv=3,
        scoring="neg_root_mean_squared_error",
        n_jobs=-1
    )
    grid.fit(X_train, y_train)
    return grid

# Tune for the "with G1/G2" feature set
print("Tuning Lasso (with G1/G2)...")
lasso_grid_with = tune_lasso(X_train_with, y_train_with)
print("Best Lasso params (with):", lasso_grid_with.best_params_)

print("Tuning Random Forest (with G1/G2)...")
rf_grid_with = tune_random_forest(X_train_with, y_train_with)
print("Best RF params (with):", rf_grid_with.best_params_)

# Tune for the "without G1/G2" feature set
print("\nTuning Lasso (without G1/G2)...")
lasso_grid_wo = tune_lasso(X_train_wo, y_train_wo)
print("Best Lasso params (without):", lasso_grid_wo.best_params_)

print("Tuning Random Forest (without G1/G2)...")
rf_grid_wo = tune_random_forest(X_train_wo, y_train_wo)
print("Best RF params (without):", rf_grid_wo.best_params_)


Tuning Lasso (with G1/G2)...
Best Lasso params (with): {'alpha': 0.1}
Tuning Random Forest (with G1/G2)...
Best RF params (with): {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 100}

Tuning Lasso (without G1/G2)...
Best Lasso params (without): {'alpha': 0.1}
Tuning Random Forest (without G1/G2)...
Best RF params (without): {'max_depth': 5, 'min_samples_split': 5, 'n_estimators': 100}


## 10. Compare Models on Validation Set

Now I take the **best estimators** from each grid search and evaluate them on the validation set, then compare:

- Dummy baseline  
- Linear Regression  
- Lasso (tuned)  
- Random Forest (tuned)

In [10]:
# 10. Evaluation of tuned models on validation set

# Best Lasso & RF from grid search
best_lasso_with = lasso_grid_with.best_estimator_
best_rf_with = rf_grid_with.best_estimator_

best_lasso_wo = lasso_grid_wo.best_estimator_
best_rf_wo = rf_grid_wo.best_estimator_

# Evaluate tuned models â€“ with G1/G2
metrics_lasso_with_tuned = evaluate_regression_model(
    best_lasso_with,
    X_train_with, y_train_with,
    X_val_with, y_val_with,
    model_name="Lasso (tuned) â€“ with G1/G2"
)
results_with.append(metrics_lasso_with_tuned)

metrics_rf_with_tuned = evaluate_regression_model(
    best_rf_with,
    X_train_with, y_train_with,
    X_val_with, y_val_with,
    model_name="Random Forest (tuned) â€“ with G1/G2"
)
results_with.append(metrics_rf_with_tuned)

# Evaluate tuned models â€“ without G1/G2
metrics_lasso_wo_tuned = evaluate_regression_model(
    best_lasso_wo,
    X_train_wo, y_train_wo,
    X_val_wo, y_val_wo,
    model_name="Lasso (tuned) â€“ without G1/G2"
)
results_wo.append(metrics_lasso_wo_tuned)

metrics_rf_wo_tuned = evaluate_regression_model(
    best_rf_wo,
    X_train_wo, y_train_wo,
    X_val_wo, y_val_wo,
    model_name="Random Forest (tuned) â€“ without G1/G2"
)
results_wo.append(metrics_rf_wo_tuned)

print("=== WITH G1/G2 â€“ Validation Results ===")
df_results_with = pd.DataFrame(results_with).sort_values("rmse_val")
display(df_results_with)

print("\n=== WITHOUT G1/G2 â€“ Validation Results ===")
df_results_wo = pd.DataFrame(results_wo).sort_values("rmse_val")
display(df_results_wo)



=== WITH G1/G2 â€“ Validation Results ===




Unnamed: 0,model,rmse_train,mae_train,r2_train,rmse_val,mae_val,r2_val
1,Linear Regression â€“ with G1/G2,1.261116,0.815135,0.859087,1.091688,0.780965,0.854383
2,Lasso (Î±=0.01) â€“ with G1/G2,1.269001,0.81327,0.85732,1.106514,0.780954,0.850401
4,Lasso (tuned) â€“ with G1/G2,1.33043,0.811416,0.843172,1.113662,0.752936,0.848462
5,Random Forest (tuned) â€“ with G1/G2,0.902147,0.604312,0.92789,1.185266,0.772352,0.828349
3,Random Forest â€“ with G1/G2,0.525626,0.330154,0.975521,1.200176,0.781692,0.824003
0,Dummy (mean) â€“ with G1/G2,3.359542,2.498926,0.0,2.876065,2.196955,-0.010676



=== WITHOUT G1/G2 â€“ Validation Results ===


Unnamed: 0,model,rmse_train,mae_train,r2_train,rmse_val,mae_val,r2_val
1,Linear Regression â€“ without G1/G2,1.261116,0.815135,0.859087,1.091688,0.780965,0.854383
2,Lasso (Î±=0.01) â€“ without G1/G2,1.269001,0.81327,0.85732,1.106514,0.780954,0.850401
4,Lasso (tuned) â€“ without G1/G2,1.33043,0.811416,0.843172,1.113662,0.752936,0.848462
5,Random Forest (tuned) â€“ without G1/G2,0.902147,0.604312,0.92789,1.185266,0.772352,0.828349
3,Random Forest â€“ without G1/G2,0.525626,0.330154,0.975521,1.200176,0.781692,0.824003
0,Dummy (mean) â€“ without G1/G2,3.359542,2.498926,0.0,2.876065,2.196955,-0.010676


## 11. Final Test Evaluation

From the validation results, I choose **one best model** for each feature set:

- **With G1/G2** â€“ best model by lowest validation RMSE.
- **Without G1/G2** â€“ best model by lowest validation RMSE.

Then I retrain on **Train + Validation** and evaluate on the **Test** set.


In [11]:
# 11. Final Test Evaluation on held-out test set

def get_best_model(df_results, candidate_models_dict):
    """
    Given a results dataframe (sorted by rmse_val ascending) and a dict of
    {model_name: model_object}, return the best model object.
    """
    best_name = df_results.iloc[0]["model"]
    print("Selected best model:", best_name)
    return candidate_models_dict[best_name]

# Collect final candidate models used so far (WITH G1/G2)
candidate_models_with = {
    "Dummy (mean) â€“ with G1/G2": dummy_with,
    "Linear Regression â€“ with G1/G2": lin_with,
    "Lasso (Î±=0.01) â€“ with G1/G2": lasso_with,
    "Random Forest â€“ with G1/G2": rf_with,
    "Lasso (tuned) â€“ with G1/G2": best_lasso_with,
    "Random Forest (tuned) â€“ with G1/G2": best_rf_with,
}

candidate_models_wo = {
    "Dummy (mean) â€“ without G1/G2": dummy_wo,
    "Linear Regression â€“ without G1/G2": lin_wo,
    "Lasso (Î±=0.01) â€“ without G1/G2": lasso_wo,
    "Random Forest â€“ without G1/G2": rf_wo,
    "Lasso (tuned) â€“ without G1/G2": best_lasso_wo,
    "Random Forest (tuned) â€“ without G1/G2": best_rf_wo,
}

# Ensure dataframes are sorted by rmse_val (ascending)
df_results_with = df_results_with.sort_values("rmse_val")
df_results_wo = df_results_wo.sort_values("rmse_val")

best_model_with = get_best_model(df_results_with, candidate_models_with)
best_model_wo = get_best_model(df_results_wo, candidate_models_wo)

# Retrain best models on Train + Val, then evaluate on Test
def train_on_trainval_and_test(model, X_train, X_val, X_test, y_train, y_val, y_test):
    X_train_val = pd.concat([X_train, X_val], axis=0)
    y_train_val = pd.concat([y_train, y_val], axis=0)
    
    model.fit(X_train_val, y_train_val)
    y_test_pred = model.predict(X_test)
    
    test_metrics = {
        "rmse_test": mean_squared_error(y_test, y_test_pred, squared=False),
        "mae_test": mean_absolute_error(y_test, y_test_pred),
        "r2_test": r2_score(y_test, y_test_pred),
    }
    return test_metrics

print("\n=== Final Test Performance â€“ WITH G1/G2 ===")
test_metrics_with = train_on_trainval_and_test(
    best_model_with,
    X_train_with, X_val_with, X_test_with,
    y_train_with, y_val_with, y_test_with
)
test_metrics_with

print("\n=== Final Test Performance â€“ WITHOUT G1/G2 ===")
test_metrics_wo = train_on_trainval_and_test(
    best_model_wo,
    X_train_wo, X_val_wo, X_test_wo,
    y_train_wo, y_val_wo, y_test_wo
)
test_metrics_wo

Selected best model: Linear Regression â€“ with G1/G2
Selected best model: Linear Regression â€“ without G1/G2

=== Final Test Performance â€“ WITH G1/G2 ===

=== Final Test Performance â€“ WITHOUT G1/G2 ===




{'rmse_test': 1.2229218170886718,
 'mae_test': 0.7845590444711539,
 'r2_test': 0.8466385019476248}

## 12. Save Final Models

Finally, I save the **best models** for both scenarios to the `models/` folder.

These will be used in:
- **Week 5** â€“ Fairness evaluation  
- **Week 6** â€“ Explainability (SHAP, LIME)

In [12]:
# 12. Save final best models

model_path_with = "models/best_model_with_G1_G2.pkl"
model_path_wo = "models/best_model_without_G1_G2.pkl"

joblib.dump(best_model_with, model_path_with)
joblib.dump(best_model_wo, model_path_wo)

print("Saved best model WITH G1/G2 to:", model_path_with)
print("Saved best model WITHOUT G1/G2 to:", model_path_wo)

Saved best model WITH G1/G2 to: models/best_model_with_G1_G2.pkl
Saved best model WITHOUT G1/G2 to: models/best_model_without_G1_G2.pkl


## 13. Conclusions & Next Steps

**Summary of Week 4:**
- Trained baseline and multiple regression models to predict **G3**.
- Compared models using **RMSE, MAE, and RÂ²** on validation data.
- Tuned **Lasso** and **Random Forest** with GridSearchCV.
- Selected the best-performing model for:
  - **With G1/G2** features.
  - **Without G1/G2** features.
- Evaluated final models on the **held-out test set**.
- Saved final models to the `models/` directory.

**Next Steps:**
- **Week 5 (Fairness Evaluation):**
  - Load the saved models.
  - Evaluate performance across student subgroups (e.g., gender, school, study time).
- **Week 6 (Explainability):**
  - Use SHAP and/or LIME to explain predictions.
  - Analyze which features contribute most to predicted grades and how this relates to fairness.
es to fairness.
Load the cleaned dataset generated in Week 3 (`processed_student_data.csv`).
Update the full path if your file location changes.
Code (Ready to Paste)
python
Copy code
# 2. Load Processed Data

data_path = r"C:\Users\Kal\processed_student_data.csv"

df = pd.read_csv(data_path)
print("Data shape:", df.shape)
df.head()
ðŸŽ‰ This will load your data correctly with no error.
If you want, I can now integrate this path directly into the full Week 4 notebook.











