
# 🔁 Rebuilding JAMB Regression Model using Scikit-learn

This notebook recreates the best model from PyCaret using native Scikit-learn tools for transparency and control.


## 🧾 Data Dictionary: JAMB Exam Results

| Column Name                     | Data Type | Description |
|--------------------------------|-----------|-------------|
| **JAMB_Score**                 | `int64`   | Final exam score (target variable) |
| **Study_Hours_Per_Week**       | `int64`   | Average number of study hours per week |
| **Attendance_Rate**            | `int64`   | Attendance percentage |
| **Teacher_Quality**            | `int64`   | Teacher quality rating (1–5 scale) |
| **Distance_To_School**         | `float64` | Distance from home to school in kilometers |
| **School_Type**                | `object`  | Type of school (Public/Private) |
| **School_Location**            | `object`  | Location of the school (Urban/Rural) |
| **Extra_Tutorials**            | `object`  | Whether student attended extra tutorials (Yes/No) |
| **Access_To_Learning_Materials** | `object` | Whether student had access to learning materials (Yes/No) |
| **Parent_Involvement**         | `object`  | Level of parental involvement (Low/Medium/High) |
| **IT_Knowledge**               | `object`  | Student's IT knowledge level (Low/Medium/High) |
| **Student_ID**                 | `int64`   | Unique student identifier (not used in modeling) |
| **Age**                        | `int64`   | Age of the student |
| **Gender**                     | `object`  | Student's gender (Male/Female) |
| **Socioeconomic_Status**       | `object`  | Family’s financial background (Low/Medium/High) |
| **Parent_Education_Level**     | `object`  | Highest education level of parents |
| **Assignments_Completed**      | `int64`   | Number of assignments completed by the student (1–5 scale) |


[Eda Report](https://mukhtarasif.github.io/EDA_Reports/ydata/JAMB_YData_Profile_Report.html)

In [1]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

df = pd.read_csv("jamb_exam_results.csv")
df = df.drop(columns=["Student_ID"])

# Encode categorical features
df_encoded = df.copy()
for col in df_encoded.select_dtypes(include="object").columns:
    df_encoded[col] = LabelEncoder().fit_transform(df_encoded[col])

X = df_encoded.drop("JAMB_Score", axis=1)
y = df_encoded["JAMB_Score"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

model = RandomForestRegressor(n_estimators=100, random_state=123)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Metrics
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
print("R² Score:", r2_score(y_test, y_pred))


MAE: 31.24795
RMSE: 38.95153830210047
R² Score: 0.2562509137195814


## 📈 Model Evaluation Summary

Below are the performance metrics for the final regression model predicting **JAMB scores**:

- **Mean Absolute Error (MAE)**: `31.25`  
  → On average, the model's predictions are off by about 31 marks.

- **Root Mean Squared Error (RMSE)**: `38.95`  
  → Heavily penalizes larger errors. Indicates the typical error size in predictions.

- **R² Score**: `0.2563`  
  → About 25.63% of the variance in JAMB scores is explained by the model. There's room for improvement.
