<a href="https://colab.research.google.com/github/jcsmcmendes/Step_Class/blob/main/Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 📘 Regression – Model Evaluation Overview

This notebook explores different regression models and validation strategies for predicting final grades from student features.


In [16]:
# Import the necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns


# 📥 1. Load and Define Regression Models

In [17]:
# Load regression dataset
df = pd.read_excel("Student_datasets.xlsx", sheet_name="regression")
X = df[['attendance', 'assignments_completed', 'participation']]
y = df['final_grade']


# Define regression models
models = {
    'Linear Regression': LinearRegression(),
    'K-Nearest Neighbors': KNeighborsRegressor(n_neighbors=5),
    'Decision Tree': DecisionTreeRegressor(max_depth=4, random_state=42),
    'Random Forest': RandomForestRegressor(n_estimators=100, max_depth=6, random_state=42),
    'XGBoost': XGBRegressor(n_estimators=100, max_depth=4, random_state=42)
}

# 📊 3. Single Train/Test Split

In this section, we evaluate all models using a single 80/20 split of the dataset.

Advantages:
- Simple to implement
- Fast for a quick estimate

Limitations:
- High variance depending on the split

Metrics used:
- MAE (Mean Absolute Error)
- MAPE (Mean Absolute Percentage Error)
- RMSE (Root Mean Squared Error)

In [18]:
# Store results
results_single = []

# Loop through each model
for name, model in models.items():
    # Single split
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Feature scaling
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    # Train the model
    model.fit(X_train_scaled, y_train)

    # Predict
    preds = model.predict(X_test_scaled)

    # Evaluate metrics
    mae = mean_absolute_error(y_test, preds)
    mape = mean_absolute_percentage_error(y_test, preds)
    mse = mean_squared_error(y_test, preds)
    rmse = np.sqrt(mse)

    results_single.append({
        'Model': name,
        'MAE': round(mae, 3),
        'MAPE': round(mape, 3),
        'RMSE': round(rmse, 3)
    })


# Show results
results_df_single = pd.DataFrame(results_single)
print("\n📊 Regression Metrics (Single Train/Test Split):")
print(results_df_single)


📊 Regression Metrics (Single Train/Test Split):
                 Model    MAE   MAPE   RMSE
0    Linear Regression  2.129  0.067  2.634
1  K-Nearest Neighbors  3.384  0.104  4.234
2        Decision Tree  4.954  0.153  6.384
3        Random Forest  3.641  0.110  4.352
4              XGBoost  3.735  0.115  4.611


# 🔁 4. K-Fold Cross-Validation

Here, we use 5-fold cross-validation to evaluate model performance more robustly.

Advantages:
- Each data point is used in both training and validation
- Reduces variance in evaluation

We aggregate predictions across all folds and compute the same metrics (MAE, MAPE, RMSE).


In [19]:
# Perform K-Fold Cross-Validation and calculate global metrics
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Store results for each model
results = []

# Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Evaluate each model using aggregated predictions over K-Folds
for name, model in models.items():
    all_preds = []
    all_true = []

    for train_idx, val_idx in kf.split(X_scaled):
        X_train, X_val = X_scaled[train_idx], X_scaled[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        model.fit(X_train, y_train)  # Train the model
        preds = model.predict(X_val)  # Predict on validation set

        all_preds.extend(preds)       # Store predictions
        all_true.extend(y_val)        # Store actual values

    # Compute global regression metrics across all folds
    mae = mean_absolute_error(all_true, all_preds)
    mape = mean_absolute_percentage_error(all_true, all_preds)
    mse = mean_squared_error(all_true, all_preds)
    rmse = np.sqrt(mse)

    # Store metrics
    results.append({
        'Model': name,
        'MAE': round(mae, 3),
        'MAPE': round(mape, 3),
        'RMSE': round(rmse, 3)
    })

# 🧾 Display final results for all models
results_df = pd.DataFrame(results)
print(" Aggregated K-Fold Regression Metrics for All Models:")
print(results_df)

 Aggregated K-Fold Regression Metrics for All Models:
                 Model    MAE   MAPE   RMSE
0    Linear Regression  2.574  0.094  3.248
1  K-Nearest Neighbors  3.360  0.130  4.235
2        Decision Tree  5.447  0.204  6.905
3        Random Forest  3.560  0.138  4.460
4              XGBoost  3.663  0.138  4.584


🔁 5. Repeated Holdout Validation

We repeat the train/test split 5 times using different random seeds.

Advantages:
- Simulates model performance on multiple random train/test partitions
- Balances simplicity with some robustness

Same metrics (MAE, MAPE, RMSE) are calculated on aggregated predictions.

In [20]:
# Repeated Holdout Validation

# Define number of repeated splits
n_repeats = 5
results = []

#  Loop through each model
for name, model in models.items():
    all_preds = []
    all_true = []

    for seed in range(n_repeats):
        # Random train/test split with different seed each time
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=seed)

        # Standardize features
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_test_scaled = scaler.transform(X_test)

        # Train and predict
        model.fit(X_train_scaled, y_train)
        preds = model.predict(X_test_scaled)

        # Collect predictions and true values for global evaluation
        all_preds.extend(preds)
        all_true.extend(y_test)

    #  Calculate regression metrics over all 5 repetitions
    mae = mean_absolute_error(all_true, all_preds)
    mape = mean_absolute_percentage_error(all_true, all_preds)
    mse = mean_squared_error(all_true, all_preds)
    rmse = np.sqrt(mse)

    # Save results
    results.append({
        'Model': name,
        'MAE': round(mae, 3),
        'MAPE': round(mape, 3),
        'RMSE': round(rmse, 3)
    })

# 🧾 Display all results
results_df_repeated = pd.DataFrame(results)
print("\n📊 Aggregated Metrics for All Models (Repeated Holdout):")
print(results_df_repeated)


📊 Aggregated Metrics for All Models (Repeated Holdout):
                 Model    MAE   MAPE   RMSE
0    Linear Regression  2.440  0.088  3.094
1  K-Nearest Neighbors  3.280  0.125  4.185
2        Decision Tree  5.867  0.202  7.136
3        Random Forest  3.587  0.130  4.617
4              XGBoost  3.606  0.128  4.531
