# Student Performance & Academic Risk Prediction  
## Data Understanding & Cleaning

**Objective:** Prepare a clean, leakage-free dataset to predict academic risk using interpretable machine learning.


In [2]:
import numpy as np 
import pandas as pd


## Problem Context

Academic risk often emerges due to unnoticed academic and behavioral patterns.
This project frames student performance prediction as a **risk classification problem** to support early intervention.


In [4]:
data_path="C:/Users/prasad/OneDrive/Documents/student-performance-risk-prediction/data/raw/student-mat.csv"
student_df=pd.read_csv(data_path,sep=';')

student_df.shape

(395, 33)

In [5]:
student_df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,4,6,10,10


In [6]:
student_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  higher    

In [7]:
student_df.describe()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0,395.0
mean,16.696203,2.749367,2.521519,1.448101,2.035443,0.334177,3.944304,3.235443,3.108861,1.481013,2.291139,3.55443,5.708861,10.908861,10.713924,10.41519
std,1.276043,1.094735,1.088201,0.697505,0.83924,0.743651,0.896659,0.998862,1.113278,0.890741,1.287897,1.390303,8.003096,3.319195,3.761505,4.581443
min,15.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,3.0,0.0,0.0
25%,16.0,2.0,2.0,1.0,1.0,0.0,4.0,3.0,2.0,1.0,1.0,3.0,0.0,8.0,9.0,8.0
50%,17.0,3.0,2.0,1.0,2.0,0.0,4.0,3.0,3.0,1.0,2.0,4.0,4.0,11.0,11.0,11.0
75%,18.0,4.0,3.0,2.0,2.0,0.0,5.0,4.0,4.0,2.0,3.0,5.0,8.0,13.0,13.0,14.0
max,22.0,4.0,4.0,4.0,4.0,3.0,5.0,5.0,5.0,5.0,5.0,5.0,75.0,19.0,19.0,20.0


## Dataset Overview

- Contains demographic, family, behavioral, and academic history features
- Includes both numeric and categorical variables
- No missing values were observed


## Target Definition

The final grade (`G3`) is converted into academic risk levels:

- **High Risk:** G3 < 10  
- **Medium Risk:** 10 ≤ G3 < 15  
- **Low Risk:** G3 ≥ 15  

This makes predictions more actionable for academic support.


In [10]:
def assign_risk(grade):
    if grade<10:
        return "High"
    elif grade<15:
        return "Medium"
    else :
        return "Low"

student_df["risk_level"]=student_df["G3"].apply(assign_risk)

In [11]:
student_df[["G3","risk_level"]].head()

Unnamed: 0,G3,risk_level
0,6,High
1,6,High
2,10,Medium
3,15,Low
4,10,Medium


In [13]:
student_df["risk_level"].value_counts()

risk_level
Medium    192
High      130
Low        73
Name: count, dtype: int64

In [14]:
student_df.isnull().sum()

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
risk_level    0
dtype: int64

## Data Leakage Prevention

Grade-related features (`G1`, `G2`, `G3`) were removed to prevent data leakage.
These values would not be available at prediction time.


In [15]:
leakage_features=["G1","G2","G3"]
student_df_clean=student_df.drop(columns=leakage_features)

In [16]:
student_df_clean.shape

(395, 31)

In [17]:
student_df_clean.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,risk_level
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,no,no,4,3,4,1,1,3,6,High
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,yes,no,5,3,3,1,1,3,4,High
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,yes,no,4,3,2,2,3,3,10,Medium
3,GP,F,15,U,GT3,T,4,2,health,services,...,yes,yes,3,2,2,1,1,5,2,Low
4,GP,F,16,U,GT3,T,3,3,other,other,...,no,no,4,3,2,1,2,5,4,Medium


In [19]:
student_df_clean.duplicated().sum()

np.int64(0)

In [21]:
student_df_clean.dtypes

school        object
sex           object
age            int64
address       object
famsize       object
Pstatus       object
Medu           int64
Fedu           int64
Mjob          object
Fjob          object
reason        object
guardian      object
traveltime     int64
studytime      int64
failures       int64
schoolsup     object
famsup        object
paid          object
activities    object
nursery       object
higher        object
internet      object
romantic      object
famrel         int64
freetime       int64
goout          int64
Dalc           int64
Walc           int64
health         int64
absences       int64
risk_level    object
dtype: object

## Data Cleaning Summary

- No missing values required imputation
- Grade-related features removed to avoid leakage
- Dataset prepared for exploratory data analysis
