### Regression Models

In [2]:
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

In [1]:
from pathlib import Path

DATA_PATH = Path("..") / "data" / "raw" / "student-mat.csv"
df = pd.read_csv(DATA_PATH, sep=";")

In [3]:
df_model = df.drop(["G1", "G2"], axis=1)

df_encoded = pd.get_dummies(df_model, drop_first=True)

X= df_encoded.drop("G3", axis=1)
y = df_encoded["G3"]

df_encoded.info()

<class 'pandas.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 40 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   age                395 non-null    int64
 1   Medu               395 non-null    int64
 2   Fedu               395 non-null    int64
 3   traveltime         395 non-null    int64
 4   studytime          395 non-null    int64
 5   failures           395 non-null    int64
 6   famrel             395 non-null    int64
 7   freetime           395 non-null    int64
 8   goout              395 non-null    int64
 9   Dalc               395 non-null    int64
 10  Walc               395 non-null    int64
 11  health             395 non-null    int64
 12  absences           395 non-null    int64
 13  G3                 395 non-null    int64
 14  school_MS          395 non-null    bool 
 15  sex_M              395 non-null    bool 
 16  address_U          395 non-null    bool 
 17  famsize_LE3        395 non-

In [4]:
X["absences_log"] = np.log1p(X["absences"])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [5]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("--- Linear Regression ---")
print(f"MSE:  {mean_squared_error(y_test, y_pred_lr):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_lr)):.3f}")
print(f"MAE:  {mean_absolute_error(y_test, y_pred_lr):.3f}")
print(f"R2:   {r2_score(y_test, y_pred_lr):.3f}")

--- Linear Regression ---
MSE:  17.722
RMSE: 4.210
MAE:  3.442
R2:   0.136


In [6]:
rf = RandomForestRegressor(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("--- Random Forest Regressor ---")
print(f"MSE: {mean_squared_error(y_test, y_pred_rf):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_rf)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred_rf):.3f}")
print(f"R2: {r2_score(y_test, y_pred_rf):.3f}")

--- Random Forest Regressor ---
MSE: 15.231
RMSE: 3.903
MAE: 3.111
R2: 0.257


In [7]:
xgb = XGBRegressor(n_estimators=300, learning_rate=0.05, max_depth=4, subsample=0.8, colsample_bytree=0.8,
                   random_state=42)
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

print("--- XGBoost Regressor ---")
print(f"MSE: {mean_squared_error(y_test, y_pred_xgb):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_xgb)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred_xgb):.3f}")
print(f"R2: {r2_score(y_test, y_pred_xgb):.3f}")

--- XGBoost Regressor ---
MSE: 15.259
RMSE: 3.906
MAE: 3.086
R2: 0.256


## Regression Model Performance (Predicting Exact Final Grade: G3)

Three regression models were tested to predict students’ final grade (**G3**) using non-grade features (family background, study habits, lifestyle, absences, etc.).

| Model | MSE | RMSE | MAE | R² |
|------|----:|-----:|----:|---:|
| Linear Regression | 17.72 | 4.21 | 3.44 | 0.136 |
| Random Forest Regressor | **15.23** | **3.90** | 3.11 | **0.257** |
| XGBoost Regressor | 15.26 | 3.91 | **3.09** | 0.256 |

### Interpretation (in my own words)
- The **tree-based models (Random Forest and XGBoost)** performed better than **Linear Regression**. This suggests that the relationship between student factors and grades is not purely linear.
- **Random Forest** gave the best overall performance (highest **R²** and lowest **RMSE**), meaning it explained the most variation in final grades compared to the other models.
- **XGBoost** was very close to Random Forest and even had the lowest **MAE**, which means it had slightly smaller average prediction errors.
- Even with the best model, the predictions are still off by about **3 marks on average**, and the models explain only about **25%** of the variation in grades. That tells me the dataset has some predictive power, but there are likely other important factors that aren’t captured in these features.