# 📘 Modeling: Student Performance and Risk Prediction

This notebook covers the modeling phase of the student data performance project. It includes predictive modeling for student grades and classification of at-risk students, along with model evaluation and selection.

In this notebook, we:
- Develop a regression model to predict student GPA
- Derive a binary risk label (`risk_flag`) based on GPA
- Train a classificication model to identify at-risk students
- Evaluate and compare model performance
- Tuning top models
- Save models for deployment

In [22]:
# Load packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from xgboost import XGBRegressor, XGBClassifier
from sklearn.metrics import (r2_score, mean_absolute_error, mean_squared_error,
                             accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score)
import joblib
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

## df Load and Split

In [None]:
# Load Transformed Data
file_path = "../data/feature_engineered_student_data.csv"
df = pd.read_csv(file_path)
df.head()

Unnamed: 0,ethnicity_1,ethnicity_2,ethnicity_3,studytimeweekly,absences,age,gender,tutoring,extracurricular,sports,music,volunteering,parentalsupport,parentaleducation,gpa,gradeclass,risk_flag
0,0.0,0.0,0.0,1.780336,-0.890822,0.472919,1.0,1.0,0.0,0.0,1.0,0.0,2.0,2.0,2.929196,2.0,0
1,0.0,0.0,0.0,0.997376,-1.717694,1.362944,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.042915,1.0,0
2,0.0,1.0,0.0,-0.984045,1.353542,-1.307132,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3.0,0.112602,4.0,1
3,0.0,0.0,0.0,0.045445,-0.063951,0.472919,1.0,0.0,1.0,0.0,0.0,0.0,3.0,3.0,2.054218,3.0,0
4,0.0,0.0,0.0,-0.902311,0.290422,0.472919,1.0,1.0,0.0,0.0,0.0,0.0,3.0,2.0,1.288061,4.0,1


In [19]:
print(df.columns)
print(len(df.columns), "columns")

Index(['ethnicity_1', 'ethnicity_2', 'ethnicity_3', 'studytimeweekly',
       'absences', 'age', 'gender', 'tutoring', 'extracurricular', 'sports',
       'music', 'volunteering', 'parentalsupport', 'parentaleducation', 'gpa',
       'gradeclass', 'risk_flag'],
      dtype='object')
17 columns


In [23]:
# Define Targets
regression_target = 'gpa'
classification_target = 'risk_flag'

# Feature-Target Split
X = df.drop([regression_target, classification_target], axis=1)
y_reg = df[regression_target]
y_cls = df[classification_target]

# Train-Test Split
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(X, y_reg, test_size=0.2, random_state=42)
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(X, y_cls, test_size=0.2, random_state=42)

# Initialize Models
regression_models = {
    'LinearRegression': LinearRegression(),
    'RandomForestRegressor': RandomForestRegressor(random_state=42),
    'XGBRegressor': XGBRegressor(random_state=42)
}

classification_models = {
    'LogisticRegression': LogisticRegression(max_iter=1000),
    'RandomForestClassifier': RandomForestClassifier(random_state=42),
    'XGBClassifier': XGBClassifier(random_state=42)
}

# Train & Evaluate Regression Models
print("\n--- Regression Results ---")
for name, model in regression_models.items():
    model.fit(X_train_reg, y_train_reg)
    preds = model.predict(X_test_reg)
    print(f"\n{name}:")
    print(f"R2 Score: {r2_score(y_test_reg, preds):.3f}")
    print(f"MAE: {mean_absolute_error(y_test_reg, preds):.3f}")
    print(f"RMSE: {mean_squared_error(y_test_reg, preds, squared=False):.3f}")

# Train & Evaluate Classification Models
print("\n--- Classification Results ---")
for name, model in classification_models.items():
    model.fit(X_train_cls, y_train_cls)
    preds = model.predict(X_test_cls)
    print(f"\n{name}:")
    print(f"Accuracy: {accuracy_score(y_test_cls, preds):.3f}")
    print(f"Precision: {precision_score(y_test_cls, preds):.3f}")
    print(f"Recall: {recall_score(y_test_cls, preds):.3f}")
    print(f"F1 Score: {f1_score(y_test_cls, preds):.3f}")
    print(f"ROC AUC: {roc_auc_score(y_test_cls, preds):.3f}")



--- Regression Results ---

LinearRegression:
R2 Score: 0.957
MAE: 0.151
RMSE: 0.189

RandomForestRegressor:
R2 Score: 0.936
MAE: 0.170
RMSE: 0.230

XGBRegressor:
R2 Score: 0.943
MAE: 0.163
RMSE: 0.217

--- Classification Results ---

LogisticRegression:
Accuracy: 0.956
Precision: 0.949
Recall: 0.968
F1 Score: 0.958
ROC AUC: 0.956

RandomForestClassifier:
Accuracy: 0.983
Precision: 0.988
Recall: 0.980
F1 Score: 0.984
ROC AUC: 0.983

XGBClassifier:
Accuracy: 0.985
Precision: 0.988
Recall: 0.984
F1 Score: 0.986
ROC AUC: 0.985


In [None]:
# # Save Best Models (based on evaluation)
# best_reg_model = regression_models['XGBRegressor']
# best_cls_model = classification_models['LogisticRegression']

# joblib.dump(best_reg_model, 'xgboost_student_grade_model.pkl')
# joblib.dump(best_cls_model, 'logistic_student_risk_model.pkl')

# print("\n✅ Models saved successfully!")

- Linear Regression achieved an R² of `X.XX`, indicating [brief interpretation].
- Logistic Regression performed with [X%] precision and recall for at-risk students.
- Threshold for risk was set at GPA < 2.0 (can be adjusted).
- Future work: Experiment with other algorithms (RandomForest, XGBoost), hyperparameter tuning, and ensemble models.