# Preprocessing Pipeline Overview

This notebook performs the **preprocessing stage** of the *Student Performance Classification* project.  
It includes the following steps in accordance with project requirements:


In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# 📥 Load raw data
raw_path = '../data/raw/StudentPerformanceFactors.csv'
df = pd.read_csv(raw_path)

print(f"Dataset shape: {df.shape}")
df.head()


Dataset shape: (6607, 20)


Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


## 1. Handling Missing Values

We treat missing values using different strategies depending on the data type:

- **Numerical Features:**
  - Missing values are replaced with the **median** of each column.
  - This is robust to outliers and helps preserve the distribution of the data.

- **Categorical Features:**
  - Missing values are filled with the **mode** (most frequent category).
  - Assumes that missing entries follow the most common class.

```python
# Numeric: fill missing with median
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Categorical: fill missing with mode
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))
```

In [3]:
# Check for missing values
df.info()
df.isnull().sum().sort_values(ascending=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Hours_Studied               6607 non-null   int64 
 1   Attendance                  6607 non-null   int64 
 2   Parental_Involvement        6607 non-null   object
 3   Access_to_Resources         6607 non-null   object
 4   Extracurricular_Activities  6607 non-null   object
 5   Sleep_Hours                 6607 non-null   int64 
 6   Previous_Scores             6607 non-null   int64 
 7   Motivation_Level            6607 non-null   object
 8   Internet_Access             6607 non-null   object
 9   Tutoring_Sessions           6607 non-null   int64 
 10  Family_Income               6607 non-null   object
 11  Teacher_Quality             6529 non-null   object
 12  School_Type                 6607 non-null   object
 13  Peer_Influence              6607 non-null   obje

Parental_Education_Level      90
Teacher_Quality               78
Distance_from_Home            67
Hours_Studied                  0
Attendance                     0
Gender                         0
Learning_Disabilities          0
Physical_Activity              0
Peer_Influence                 0
School_Type                    0
Family_Income                  0
Tutoring_Sessions              0
Internet_Access                0
Motivation_Level               0
Previous_Scores                0
Sleep_Hours                    0
Extracurricular_Activities     0
Access_to_Resources            0
Parental_Involvement           0
Exam_Score                     0
dtype: int64

In [4]:
# Fill missing numeric values with median
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Fill missing categorical values with mode
categorical_cols = df.select_dtypes(include=["object"]).columns
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.fillna(x.mode()[0]))


## 2. Target Label Creation

We convert the continuous `Exam_Score` into a categorical variable `Target` with 3 classes:

| Label    | Rule             | Meaning             |
|----------|------------------|---------------------|
| `Bad`    | score < 60       | Poor performance    |
| `Medium` | 60 ≤ score < 80  | Average performance |
| `Good`   | score ≥ 80       | High performance    |



In [5]:
def score_to_class(score):
    if score < 60:
        return "Bad"
    elif score < 80:
        return "Medium"
    else:
        return "Good"


df["Target"] = df["Exam_Score"].apply(score_to_class)
df["Target"].value_counts()

Target
Medium    6491
Bad         68
Good        48
Name: count, dtype: int64


## 3. Categorical Encoding

We apply **One-Hot Encoding** to all categorical variables.  
This step transforms each category into its own binary (0/1) column, ensuring compatibility with ML models.


In [6]:
# One-hot encode categorical variables
df_encoded = pd.get_dummies(df.drop(["Exam_Score"], axis=1), drop_first=True)

# Map target to numeric
target_map = {"Bad": 0, "Medium": 1, "Good": 2}
df_encoded["Target"] = df["Target"].map(target_map)

## 4. Min-Max Normalization

We normalize all numerical features using **Min-Max Normalization**:

$$
x_{\text{norm}} = \frac{x - \min(x)}{\max(x) - \min(x)}
$$


This scales values to the range **[0, 1]**, ensuring that no single feature dominates due to scale.


In [7]:
# Min-Max normalization
features = df_encoded.drop("Target", axis=1)
scaler = MinMaxScaler()
features_scaled = pd.DataFrame(scaler.fit_transform(features), columns=features.columns)

# Combine with target
df_final = pd.concat([features_scaled, df_encoded["Target"]], axis=1)

## 5. Train/Validation/Test Split

To evaluate our model effectively, we split the dataset into three parts:

- **Training set** (60%) – used to train the machine learning models.
- **Validation set** (20%) – used to tune hyperparameters and avoid overfitting.
- **Test set** (20%) – used for final performance evaluation on unseen data.

We use **stratified sampling** based on the `Target` column to ensure that the class distribution is preserved across all three subsets.


In [None]:
# Split into train / val / test (60/20/20)
train_val, test = train_test_split(
    df_final, test_size=0.2, random_state=42, stratify=df_final["Target"]
)
train, val = train_test_split(
    train_val, test_size=0.25, random_state=42, stratify=train_val["Target"]
)  # 0.25 * 0.8 = 0.2

# Save to CSV
processed_path = "../data/processed/"
os.makedirs(processed_path, exist_ok=True)

train.to_csv(os.path.join(processed_path, "train.csv"), index=False)
val.to_csv(os.path.join(processed_path, "val.csv"), index=False)
test.to_csv(os.path.join(processed_path, "test.csv"), index=False)

print(f"Train size: {train.shape}, Val size: {val.shape}, Test size: {test.shape}")

Train size: (3963, 30), Val size: (1322, 30), Test size: (1322, 30)
