# ðŸŽ“ Student Performance Predictor

This notebook walks through the full ML pipeline:
1. Data Loading
2. Exploratory Data Analysis (EDA)
3. Preprocessing
4. Regression Modeling (predicting exam score)
5. Classification Modeling (pass / fail)
6. Feature Importance & Interpretation
7. Summary & Next Steps

In [None]:
import sys, os
# Ensure project root is on the path
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..'))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from src.data_loader import load_data
from src.preprocessing import (
    prepare_data, add_pass_fail, get_feature_names,
    NUMERIC_FEATURES, CATEGORICAL_FEATURES,
    TARGET_REGRESSION, TARGET_CLASSIFICATION,
)
from src.model import (
    train_regression_models, train_classification_models,
    tune_model, save_model,
)
from src.evaluation import evaluate_regression, evaluate_classification, print_report
from src.visualizations import (
    plot_score_distribution, plot_correlation_heatmap,
    plot_scatter_with_regression, plot_feature_importance,
    plot_predicted_vs_actual, plot_confusion_matrix,
)

%matplotlib inline
sns.set_theme(style='whitegrid', palette='muted', font_scale=1.1)
print('âœ… All imports loaded')

---
## 1 Â· Data Loading

In [None]:
DATA_PATH = os.path.join(PROJECT_ROOT, 'data', 'student_performance.csv')
df = load_data(DATA_PATH)
df.head(10)

In [None]:
print(f'Shape: {df.shape}')
print(f'\nData types:\n{df.dtypes}')
print(f'\nMissing values:\n{df.isnull().sum()}')

In [None]:
df.describe()

---
## 2 Â· Exploratory Data Analysis

### 2.1 Score Distribution

In [None]:
fig = plot_score_distribution(df)
plt.show()

### 2.2 Correlation Heatmap

In [None]:
fig = plot_correlation_heatmap(df)
plt.show()

### 2.3 Scatter Plots with Regression Lines

In [None]:
scatter_features = ['study_hours_per_week', 'attendance_rate', 'previous_exam_score', 'internal_marks']
for feat in scatter_features:
    fig = plot_scatter_with_regression(df, x=feat)
    plt.show()

### 2.4 Categorical Distributions

In [None]:
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
for ax, col in zip(axes.flatten(), CATEGORICAL_FEATURES):
    sns.boxplot(data=df, x=col, y='final_exam_score', ax=ax, palette='muted')
    ax.set_title(f'{col.replace("_", " ").title()} vs Final Score', weight='bold')
    ax.tick_params(axis='x', rotation=20)
plt.tight_layout()
plt.show()

---
## 3 Â· Preprocessing

In [None]:
# Regression data
X_train_r, X_test_r, y_train_r, y_test_r, preprocessor_r = prepare_data(
    df, target=TARGET_REGRESSION
)
print(f'Regression split  â†’  Train: {X_train_r.shape[0]}  |  Test: {X_test_r.shape[0]}')

# Classification data
df_cls = add_pass_fail(df)
print(f'\nPass/Fail distribution:\n{df_cls[TARGET_CLASSIFICATION].value_counts()}')

X_train_c, X_test_c, y_train_c, y_test_c, preprocessor_c = prepare_data(
    df_cls, target=TARGET_CLASSIFICATION
)
print(f'Classification split  â†’  Train: {X_train_c.shape[0]}  |  Test: {X_test_c.shape[0]}')

---
## 4 Â· Regression Modeling

Predicting the **exact final exam score** using Linear Regression, Random Forest, and Gradient Boosting.

In [None]:
print('Training regression models â€¦')
reg_models = train_regression_models(X_train_r, y_train_r, preprocessor_r)

In [None]:
for name, model in reg_models.items():
    metrics, y_pred = evaluate_regression(model, X_test_r, y_test_r)
    print_report(name, metrics)
    fig = plot_predicted_vs_actual(y_test_r, y_pred)
    plt.suptitle(name, y=1.02, fontsize=13, weight='bold')
    plt.show()

### 4.1 Hyperparameter Tuning (Random Forest Regressor)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from src.preprocessing import build_preprocessor

rf_pipe = Pipeline([
    ('preprocessor', build_preprocessor()),
    ('model', RandomForestRegressor(random_state=42, n_jobs=-1)),
])

param_grid = {
    'model__n_estimators': [100, 200],
    'model__max_depth': [None, 10, 20],
    'model__min_samples_split': [2, 5],
}

best_rf = tune_model(rf_pipe, param_grid, X_train_r, y_train_r, scoring='r2')

metrics, y_pred = evaluate_regression(best_rf, X_test_r, y_test_r)
print_report('Tuned Random Forest Regressor', metrics)

# Save best regression model
save_model(best_rf, os.path.join(PROJECT_ROOT, 'models', 'best_regressor.pkl'))

---
## 5 Â· Classification Modeling

Predicting **Pass / Fail** (threshold: 50) using Logistic Regression and Random Forest Classifier.

In [None]:
print('Training classification models â€¦')
cls_models = train_classification_models(X_train_c, y_train_c, preprocessor_c)

In [None]:
for name, model in cls_models.items():
    metrics, y_pred = evaluate_classification(model, X_test_c, y_test_c)
    print_report(name, metrics)
    fig = plot_confusion_matrix(metrics['Confusion Matrix'])
    plt.suptitle(name, y=1.02, fontsize=13, weight='bold')
    plt.show()

### 5.1 Save best classifier

In [None]:
best_cls = cls_models['Random Forest Classifier']
save_model(best_cls, os.path.join(PROJECT_ROOT, 'models', 'best_classifier.pkl'))

---
## 6 Â· Feature Importance

In [None]:
# Get feature names from the fitted preprocessor
fitted_preprocessor = best_rf.named_steps['preprocessor']
feat_names = get_feature_names(fitted_preprocessor)

fig = plot_feature_importance(best_rf, feat_names)
if fig:
    plt.suptitle('Regression â€” Feature Importance', y=1.02, fontsize=13, weight='bold')
    plt.show()

In [None]:
fitted_preprocessor_c = best_cls.named_steps['preprocessor']
feat_names_c = get_feature_names(fitted_preprocessor_c)

fig = plot_feature_importance(best_cls, feat_names_c)
if fig:
    plt.suptitle('Classification â€” Feature Importance', y=1.02, fontsize=13, weight='bold')
    plt.show()

---
## 7 Â· Summary & Next Steps

| Task | Best Model | Key Metric |
|------|-----------|------------|
| Score Prediction | Tuned Random Forest | RÂ² (see above) |
| Pass/Fail | Random Forest Classifier | F1 / ROC-AUC (see above) |

### Next steps
- Launch the **Streamlit app** for interactive predictions: `streamlit run app/streamlit_app.py`
- Collect user feedback via the in-app form
- Experiment with additional features or real-world datasets
- Perform fairness audits across demographic groups