# Student Academic Performance Prediction¶
## Second Notebook: Data Preparation and Feature Engineering

Following the exploratory analysis conducted in the first notebook, the focus of this phase is on preparing the dataset for predictive modeling and constructing informative features that capture the underlying patterns in student performance. Proper data preparation and feature engineering are essential steps to enhance model accuracy, reduce bias, and ensure that the predictive algorithms can effectively leverage both academic and contextual information.

This notebook addresses several key aspects of data preparation, including handling missing values, encoding categorical variables, and normalizing numerical features. Additionally, feature engineering techniques are applied to create new variables that may provide greater insight into student performance.

This notebook provides a high-level overview of the data preparation and feature engineering process, laying the groundwork for building predictive models that estimate students’ final academic performance.

**Author**: J-F Jutras  
**Date**: January 2026  
**Dataset**: Student Performance — UCI / Kaggle (Portuguese Secondary Education)

## 2.1-Data Loading

In [1]:
import pandas as pd
import os
import kagglehub

#Download latest version of the dataset
path = kagglehub.dataset_download("jaimeh1/acamedicperfomance")

#Define dataset path
dataset_dir = "/kaggle/input/acamedicperfomance"

#Load Portuguese dataset
port_csv = os.path.join(dataset_dir, "student_language.csv")
df_port = pd.read_csv(port_csv, sep = ";")

#Load Math dataset
math_csv = os.path.join(dataset_dir, "student_math.csv")
df_math = pd.read_csv(math_csv, sep = ";")

#Clone the public GitHub repository "jfj-utils" into the current Kaggle working directory.
#This downloads all files and folders from the repo so they can be used in the notebook.
!rm -rf /kaggle/working/jfj-utils
!git clone https://github.com/jfjutras07/jfj-utils.git

#Add the cloned repository to the Python path so Python can import modules from it
import sys
sys.path.append("/kaggle/working/jfj-utils")

Cloning into 'jfj-utils'...
remote: Enumerating objects: 2016, done.[K
remote: Counting objects: 100% (91/91), done.[K
remote: Compressing objects: 100% (70/70), done.[K
remote: Total 2016 (delta 69), reused 21 (delta 21), pack-reused 1925 (from 3)[K
Receiving objects: 100% (2016/2016), 664.84 KiB | 14.77 MiB/s, done.
Resolving deltas: 100% (1301/1301), done.


### Column Description

| Variable | Description | Variable | Description |
|---------|-------------|---------|-------------|
| school | Student's school (GP or MS) | sex | Student's sex (F or M) |
| age | Student's age (15–22) | address | Home address type (Urban or Rural) |
| famsize | Family size (≤3 or >3) | Pstatus | Parents' cohabitation status |
| Medu | Mother's education level (0–4) | Fedu | Father's education level (0–4) |
| Mjob | Mother's occupation | Fjob | Father's occupation |
| reason | Reason for choosing the school | guardian | Student's guardian |
| traveltime | Home-to-school travel time (1–4) | studytime | Weekly study time (1–4) |
| failures | Number of past class failures | schoolsup | Extra educational support |
| famsup | Family educational support | paid | Extra paid classes (subject-specific) |
| activities | Extra-curricular activities | nursery | Attended nursery school |
| higher | Intention to pursue higher education | internet | Internet access at home |
| romantic | In a romantic relationship | famrel | Family relationship quality (1–5) |
| freetime | Free time after school (1–5) | goout | Going out with friends (1–5) |
| Dalc | Workday alcohol consumption (1–5) | Walc | Weekend alcohol consumption (1–5) |
| health | Current health status (1–5) | absences | Number of school absences |
| G1 | First period grade (0–20) | G2 | Second period grade (0–20) |
| G3 | Final grade (0–20) |  |  |


## 2.2-Handling Outliers - Absences and Grades

In [2]:
#Detect outliers in continuous columns of both Portuguese and Math datasets using the IQR method
continuous_cols = ['age', 'absences', 'G1', 'G2', 'G3']

discrete_cols = ["Medu","Fedu","traveltime","studytime","failures","famrel","freetime",
                "goout","Dalc","Walc","health"]

categorical_cols = ["school","sex","address","famsize","Pstatus","Mjob","Fjob","reason","guardian",
                    "schoolsup","famsup","paid","activities","nursery","higher","internet","romantic"]

from data_preprocessing.outliers import detect_outliers_iqr
print(detect_outliers_iqr(df_port, continuous_cols))
print(detect_outliers_iqr(df_math, continuous_cols))

{'age': 1, 'absences': 21, 'G1': 16, 'G2': 25, 'G3': 16, 'Total_outliers': 79}
{'age': 1, 'absences': 15, 'G1': 0, 'G2': 13, 'G3': 0, 'Total_outliers': 29}


Extreme values in age are intentionally retained. These observations represent legitimate late-stage student profiles (e.g., 20+ years old) that carry vital predictive signals regarding academic maturity and historical delays.

The absences column exhibits significant positive skewness. We apply a logarithmic transformation to these values to neutralize the disproportionate impact of extreme outliers while preserving the ordinal relationship between student attendance and final success.

In [3]:
import numpy as np

#Apply log transformation to 'absences' in both datasets
df_port['absences_log'] = np.log1p(df_port['absences']) 
df_math['absences_log'] = np.log1p(df_math['absences'])

# nly one representation of the variable is kept in the modeling dataset
df_port.drop(columns=['absences'], inplace = True)
df_math.drop(columns=['absences'], inplace = True)

#Quick check
print(detect_outliers_iqr(df_port, ['absences_log']))
print(detect_outliers_iqr(df_math, ['absences_log']))

{'absences_log': 0, 'Total_outliers': 0}
{'absences_log': 0, 'Total_outliers': 0}


The raw absences variable was log-transformed to address skewness, and only the transformed version was retained in the modeling dataset to avoid redundancy.

Outliers detected in G1, G2, and G3 (specifically zero-inflation) are systematically isolated. By separating these cases for a dedicated descriptive analysis in a later phase, we ensure that our primary regression engines focus on "active" academic performance without being biased by dropout events.

In [4]:
def analyze_zero_grades(df, dataset_name):
    target_grades = ['G1', 'G2', 'G3']
    
    #Count occurrences per column
    zero_counts = (df[target_grades] == 0).sum()
    
    #Filter students with at least one zero
    zero_anywhere = df[(df['G1'] == 0) | (df['G2'] == 0) | (df['G3'] == 0)][['G1', 'G2', 'G3']]
    
    #Identify Sudden Dropouts (Passing G1/G2, but 0 in G3)
    sudden_dropouts = zero_anywhere[(zero_anywhere['G1'] > 0) & (zero_anywhere['G2'] > 0) & (zero_anywhere['G3'] == 0)]
    
    #Identify the Recovery Case (0 in G1, but > 0 in G3)
    recovery_cases = zero_anywhere[(zero_anywhere['G1'] == 0) & (zero_anywhere['G3'] > 0)]
    
    print(f"--- {dataset_name} Analysis ---")
    print(f"Zero counts per period:\n{zero_counts}")
    print("Detailed breakdown:")
    print(zero_anywhere)
    print("\n" + "="*40 + "\n")

#Run for both datasets
analyze_zero_grades(df_port, "Portuguese")
analyze_zero_grades(df_math, "Mathematics")

--- Portuguese Analysis ---
Zero counts per period:
G1     1
G2     7
G3    15
dtype: int64
Detailed breakdown:
     G1  G2  G3
0     0  11  11
163  11   9   0
440   7   0   0
519   8   7   0
563   7   0   0
567   4   0   0
583   8   6   0
586   8   8   0
597   9   0   0
603   5   0   0
605   5   0   0
610   8   0   0
626   7   5   0
637   7   7   0
639   5   8   0
640   7   7   0


--- Mathematics Analysis ---
Zero counts per period:
G1     0
G2    13
G3    38
dtype: int64
Detailed breakdown:
     G1  G2  G3
128   7   4   0
130  12   0   0
131   8   0   0
134   9   0   0
135  11   0   0
136  10   0   0
137   4   0   0
140   7   9   0
144   5   0   0
146   6   7   0
148   7   6   0
150   6   5   0
153   5   0   0
160   7   6   0
162   7   0   0
168   6   7   0
170   6   5   0
173   8   7   0
221   6   5   0
239   7   7   0
242   6   0   0
244   7   0   0
259  10   9   0
264   9  10   0
269   6   0   0
296  10   9   0
310   9   9   0
316   8   8   0
332   7   0   0
333   8   8   0
334  

The 53 students (15 in Portuguese and 38 in Mathematics) who finished with a final grade of 0 are removed from this stage. A final zero is usually an "administrative" event (dropping out, long-term absence) rather than a reflection of actual academic skill. Keeping these cases would compromise the regression models by forcing them to predict a total collapse that follows a different logic than regular grading. These observations are moved to Notebook 4 for a specific descriptive study.

The student in the Portuguese dataset who started with a 0 in G1 but improved to an 11 in G3 is intentionally retained. This case represents a real academic success story and provides the model with a crucial signal for resilience, demonstrating that an initial failure is not an inescapable outcome.

By separating "dropouts" from "active students," the models become significantly more accurate at predicting the actual grades of the population remaining within the school system.

In [5]:
#Identify and isolate dropouts (G3 = 0)
#We create temporary copies and add a 'subject' column to track origin after merging
port_dropouts = df_port[df_port['G3'] == 0].copy()
port_dropouts['subject'] = 'Portuguese'

math_dropouts = df_math[df_math['G3'] == 0].copy()
math_dropouts['subject'] = 'Mathematics'

#Create the combined dropout dataset for further analysis in Notebook 4
#Merging the 53 cases (15 Port + 38 Math) into a single dataframe
df_all_dropouts = pd.concat([port_dropouts, math_dropouts], axis=0).reset_index(drop=True)

#Clean original datasets
df_port = df_port[df_port['G3'] > 0].copy()
df_math = df_math[df_math['G3'] > 0].copy()

#Verification ---
print(f"Extraction successful:")
print(f"-> df_all_dropouts: {df_all_dropouts.shape[0]} rows")
print(f"-> Cleaned df_port: {df_port.shape[0]} rows")
print(f"-> Cleaned df_math: {df_math.shape[0]} rows")

Extraction successful:
-> df_all_dropouts: 53 rows
-> Cleaned df_port: 634 rows
-> Cleaned df_math: 357 rows


## 2.3-Feature Engineering Design and Variable Transformation Strategy

### Demographics

| Variable | Transformation / Option | Type | Justification | Priority / Scenario |
|----------|------------------------|------|---------------|------------------|
| Age | Keep numeric | Continuous | Weak direct correlation with grades, captures older repeaters | Standard |
| Sex | Binary encoding (0=male,1=female) | Binary | Stable effect across grades | Priority |
| School (GP/MS) | Binary (GP=1, MS=0) | Binary | Moderate, consistent effect | Priority |
| Address (Urban/Rural) | Binary (Urban=1, Rural=0) | Binary | Weak effect, stable | Standard |


### Family Structure and Socio-Economic Background

| Variable | Transformation / Option | Type | Justification | Priority / Scenario |
|----------|------------------------|------|---------------|------------------|
| Famsize | Binary (GT3 / LE3) | Binary | Family structure control variable; weak but standard contextual effect | Standard |
| Pstatus | Binary (T = 1, A = 0) | Binary | Minor but stable association; captures household stability | Standard |
| Medu | Keep ordinal (0–4) | Ordinal | Positive, stable effect on academic performance | Priority |
| Fedu | Keep ordinal (0–4) | Ordinal | Positive, stable effect on academic performance | Priority |
| Mjob | One-hot encoding | Categorical | Occupational context not fully captured by education | Standard |
| Fjob | One-hot encoding | Categorical | Complementary socio-economic signal | Standard |
| Guardian | One-hot (mother / father / other) | Categorical | Weak/modest contextual effect | Standard |
| Parental Education Level | Composite: mean(Medu, Fedu) | Continuous composite | Captures parental educational capital while reducing multicollinearity | Priority (alternative core) |
| Parental Education Level (binned) | Low (0–1) / Medium (2) / High (3–4) | Ordinal | Could model socio-educational strata and potential non-linear effects | Considered |

### School Context and Academic Support

| Variable | Transformation / Option | Type | Justification | Priority / Scenario |
|----------|------------------------|------|---------------|------------------|
| Reason | One-hot | Categorical | Weak effect, useful for interactions | Priority |
| Traveltime | Keep numeric | Continuous | Slight effect on grades | Standard |
| Schoolsup | Binary (Yes=1, No=0) | Binary | Indicates prior academic difficulty | Standard |
| Famsup | Binary (Yes=1, No=0) | Binary | Slight positive effect, stable | Standard |
| Paid | Binary (Yes=1, No=0) | Binary | Weak effect, may reflect extra support | Standard |
| Nursery | Binary (Yes=1, No=0) | Binary | Stable, minor effect | Standard |
| Higher | Binary (Yes=1, No=0) | Binary | Strong positive effect, major predictor | Priority |
| Internet | Binary (Yes=1, No=0) | Binary | Minor effect, optional for interactions | Standard |


### Student Behavior and Lifestyle

| Variable | Transformation / Option | Type | Justification | Priority / Scenario |
|----------|------------------------|------|---------------|------------------|
| Studytime | Keep numeric | Ordinal | Slight positive effect on grades | Standard |
| Failures | Keep numeric | Continuous | Strong negative effect on G3 | Priority |
| Failures | Bin: 0 / 1 / ≥2 | Ordinal | Could capture non-linear impact on grades | Considered |
| Activities | Binary (Yes=1, No=0) | Binary | Small effect | Standard |
| Freetime | Keep numeric | Ordinal | Minor effect, stable | Standard |
| Goout | Keep numeric | Ordinal | Modest negative effect | Standard |
| Romantic | Binary (Yes=1, No=0) | Binary | Slight negative effect | Standard |
| Absences | Keep numeric (already log-transformed) | Continuous | Weak/modest effect | Standard |


### Health and Well-being

| Variable | Transformation / Option | Type | Justification | Priority / Scenario |
|----------|------------------------|------|---------------|------------------|
| Health | Keep numeric | Ordinal | Minor effect, stable across grades | Standard |
| Famrel | Keep numeric | Ordinal | Moderate positive, stable family effect | Standard |
| Dalc (daily alcohol) | Keep numeric | Ordinal | Slight negative effect, behavior-related | Standard |
| Walc (weekend alcohol) | Keep numeric | Ordinal | Slight negative effect, correlated with Dalc | Standard |
| Alcohol Consumption Index | Composite (Dalc + Walc) | Composite numeric | Captures latent alcohol behavior, reduces redundancy, more stable than individual effects | Sensitivity scenario |


### Academic Performance

| Variable | Transformation / Option | Type | Justification | Priority / Scenario |
|----------|------------------------|------|---------------|------------------|
| G1 | Keep numeric | Continuous | Early academic level; component of overall achievement | Standard |
| G2 | Keep numeric | Continuous | Mid-year academic level; strong contributor to overall achievement | Standard |
| Prior Grades Mean | (G1 + G2) / 2 | Continuous composite | Could capture prior academic level before final evaluation as an alternative baseline | Considered |
| Early Progress Index     | G2 − G1                        | Continuous        | Captures early learning dynamics (improvement or decline)           | Sensitivity scenario              |

## 2.4-Data Partitioning for Modeling

In [6]:
from sklearn.model_selection import train_test_split

#Define the target variable
target_col = 'G3'

#Portuguese dataset
# Split into train and test sets (80/20) while keeping distribution of target
train_port, test_port = train_test_split(
    df_port, 
    test_size = 0.2, 
    random_state = 42, 
    shuffle = True
)

#Quick check of sizes
print(f"Portuguese dataset: train={train_port.shape}, test={test_port.shape}")

#Mathematics dataset
train_math, test_math = train_test_split(
    df_math,
    test_size = 0.2,
    random_state = 42,
    shuffle = True
)

print(f"Math dataset: train={train_math.shape}, test={test_math.shape}")

Portuguese dataset: train=(507, 33), test=(127, 33)
Math dataset: train=(285, 33), test=(72, 33)


## 2.5-Baseline Feature Encoding

In [7]:
#Define binary_mappings for binary encoding
binary_mappings = {
    'sex': {'F': 1, 'M': 0},
    'school': {'GP': 1, 'MS': 0},
    'address': {'U': 1, 'R': 0},
    'famsize': {'GT3': 1, 'LE3': 0},
    'Pstatus': {'T': 1, 'A': 0},
    'schoolsup': {'yes': 1, 'no': 0},
    'famsup': {'yes': 1, 'no': 0},
    'paid': {'yes': 1, 'no': 0},
    'nursery': {'yes': 1, 'no': 0},
    'higher': {'yes': 1, 'no': 0},
    'activities': {'yes': 1, 'no': 0},
    'romantic': {'yes': 1, 'no': 0},
    'internet': {'yes': 1, 'no': 0}
}

from data_preprocessing.encoding import binary_encode_columns

#Apply binary encoding on train sets first
train_port, train_math = binary_encode_columns(
    dfs=[train_port, train_math],
    binary_mappings=binary_mappings,
    strict=True
)

#Apply the same encoding to the test sets
test_port, test_math = binary_encode_columns(
    dfs=[test_port, test_math],
    binary_mappings=binary_mappings,
    strict=True
)

Binary encoding successfully applied to 13 columns on 2 dataset(s).
Binary encoding successfully applied to 13 columns on 2 dataset(s).


In [8]:
#Columns to one-hot encode (categorical features with multiple categories)
parent_job_cols = ['Mjob', 'Fjob', 'guardian', 'reason']

from data_preprocessing.encoding import one_hot_encode_columns

#Apply one-hot encoding on train sets
train_port, train_math = one_hot_encode_columns(
    dfs=[train_port, train_math],
    categorical_cols=parent_job_cols,
    drop_first=False 
)

#After one-hot encoding, the train and test sets may have different columns because some categories
#may only appear in train or test. We must ensure the test set has the same columns as train set
#to avoid errors during scaling or model training.
def align_columns(train_df, test_df):
    #Add missing columns in test set with 0 (the category did not appear in test)
    for col in train_df.columns:
        if col not in test_df.columns:
            test_df[col] = 0
    #Keep only the columns that exist in train set (remove extra unseen categories in test)
    test_df = test_df[train_df.columns]
    return test_df

#This guarantees that both train and test datasets have identical column names and order.
#Without this step, applying normalization or feeding data to models would raise errors.
test_port = align_columns(train_port, test_port)
test_math = align_columns(train_math, test_math)

One-hot encoding successfully applied to 4 columns on 2 dataset(s).


## 2.6-Sensitivity Features - Composite and Alternative Variables

In [9]:
import numpy as np

#Parental Education Level (mean)
train_port['Parental_Edu_Level'] = train_port[['Medu','Fedu']].mean(axis=1)
train_math['Parental_Edu_Level'] = train_math[['Medu','Fedu']].mean(axis=1)

#Test sets (use same columns)
test_port['Parental_Edu_Level'] = test_port[['Medu','Fedu']].mean(axis=1)
test_math['Parental_Edu_Level'] = test_math[['Medu','Fedu']].mean(axis=1)

In [10]:
from data_preprocessing.encoding import label_encode_columns

#Parental Education Level (binned)
edu_bins = [-0.1, 1, 2, 4]          
edu_labels = ['Low', 'Medium', 'High']

for df in [train_port, train_math]:
    df['Parental_Edu_Bin'] = pd.cut(
        df['Parental_Edu_Level'],
        bins=edu_bins,
        labels=edu_labels,
        include_lowest=True
    )

for df in [test_port, test_math]:
    df['Parental_Edu_Bin'] = pd.cut(
        df['Parental_Edu_Level'],
        bins=edu_bins,
        labels=edu_labels,
        include_lowest=True
    )

#Failures (binned)
fail_bins = [-0.1, 0, 1, np.inf]    
fail_labels = ['0', '1', '2']       

for df in [train_port, train_math]:
    df['Failures_Bin'] = pd.cut(
        df['failures'],
        bins=fail_bins,
        labels=fail_labels,
        include_lowest=True
    )

for df in [test_port, test_math]:
    df['Failures_Bin'] = pd.cut(
        df['failures'],
        bins=fail_bins,
        labels=fail_labels,
        include_lowest=True
    )

#Label encoding
bins_cols = ["Parental_Edu_Bin", "Failures_Bin"]
train_port, train_math = label_encode_columns([train_port, train_math], bins_cols)
test_port, test_math = label_encode_columns([test_port, test_math], bins_cols)


Label encoding successfully applied to 2 columns on 2 dataset(s).
Label encoding successfully applied to 2 columns on 2 dataset(s).


In [11]:
#Alcohol Consumption Index
train_port['Alcohol_Index'] = train_port['Dalc'] + train_port['Walc']
train_math['Alcohol_Index'] = train_math['Dalc'] + train_math['Walc']

test_port['Alcohol_Index'] = test_port['Dalc'] + test_port['Walc']
test_math['Alcohol_Index'] = test_math['Dalc'] + test_math['Walc']

In [12]:
#Prior Grades Mean
train_port['Prior_Grades_Mean'] = (train_port['G1'] + train_port['G2']) / 2
train_math['Prior_Grades_Mean'] = (train_math['G1'] + train_math['G2']) / 2

test_port['Prior_Grades_Mean'] = (test_port['G1'] + test_port['G2']) / 2
test_math['Prior_Grades_Mean'] = (test_math['G1'] + test_math['G2']) / 2

In [13]:
#Early Progress Index
train_port['Early_Progress_Index'] = train_port['G2'] - train_port['G1']
train_math['Early_Progress_Index'] = train_math['G2'] - train_math['G1']

test_port['Early_Progress_Index'] = test_port['G2'] - test_port['G1']
test_math['Early_Progress_Index'] = test_math['G2'] - test_math['G1']


In [14]:
#Quick check on train sets
sensitivity_cols = [
    'Parental_Edu_Level', 'Parental_Edu_Bin', 'Failures_Bin',
    'Alcohol_Index', 'Prior_Grades_Mean', 'Early_Progress_Index'
]

print(train_port[sensitivity_cols].head(5))
print(train_math[sensitivity_cols].head(5))
print(test_port[sensitivity_cols].head(5))
print(test_math[sensitivity_cols].head(5))

     Parental_Edu_Level  Parental_Edu_Bin  Failures_Bin  Alcohol_Index  \
489                 1.0                 0             1              7   
422                 2.5                 2             0              3   
104                 3.5                 2             0              2   
114                 1.5                 1             0              2   
350                 1.5                 1             1              2   

     Prior_Grades_Mean  Early_Progress_Index  
489                8.5                     1  
422               11.5                     3  
104               16.0                     0  
114                9.5                    -1  
350                9.5                     1  
     Parental_Edu_Level  Parental_Edu_Bin  Failures_Bin  Alcohol_Index  \
308                 3.0                 2             1              3   
368                 2.5                 2             0              3   
315                 2.5                 2          

## 2.7-Feature Normalization

In [15]:
from sklearn.preprocessing import StandardScaler

#Datasets dictionary
datasets = {
    "portuguese": {"train": train_port, "test": test_port},
    "math": {"train": train_math, "test": test_math}
}

#Loop over datasets
normalized_datasets = {}

for name, dfs in datasets.items():
    train_df = dfs["train"]
    test_df = dfs["test"]
    
    #Identify numeric columns (including MCA components)
    numeric_cols = train_df.select_dtypes(include=[np.number]).columns.tolist()
    
    #Remove target G3 from scaling to keep it on its original 0-20 scale
    if 'G3' in numeric_cols:
        numeric_cols.remove('G3')
    
    #Remove constant columns (std = 0) to avoid division by zero errors
    numeric_cols = [c for c in numeric_cols if train_df[c].std() > 1e-10]
    
    non_numeric_cols = train_df.columns.difference(numeric_cols)
    
    scaler = StandardScaler()
    
    #Scale train numeric columns
    train_norm_num = pd.DataFrame(
        scaler.fit_transform(train_df[numeric_cols]),
        columns=numeric_cols,
        index=train_df.index
    )
    
    #Scale test numeric columns using the fitted scaler
    test_norm_num = pd.DataFrame(
        scaler.transform(test_df[numeric_cols]),
        columns=numeric_cols,
        index=test_df.index
    )
    
    #Reconstruct datasets by merging scaled and non-scaled columns
    normalized_datasets[name] = {
        "train": pd.concat([train_norm_num, train_df[non_numeric_cols]], axis=1),
        "test": pd.concat([test_norm_num, test_df[non_numeric_cols]], axis=1),
        "scaler": scaler,
        "numeric_cols": numeric_cols
    }
    
    #Print inside the loop to see progress for each dataset
    print(f"Normalization complete for {name}. No warnings.")

#Update the original variables with normalized data
train_port = normalized_datasets["portuguese"]["train"]
test_port = normalized_datasets["portuguese"]["test"]
train_math = normalized_datasets["math"]["train"]
test_math = normalized_datasets["math"]["test"]



To accommodate a wide range of potential architectures, all numeric and ordinal columns were normalized (except our target).

In [16]:
#Check dataset quality (train Portuguese example)
from eda.premodeling_check import premodeling_regression_check
print(premodeling_regression_check(train_port))

#Missing values
No missing values detected.

#Feature types
No non-numeric columns detected.

#Feature variance
No constant columns detected.

#Outliers
##Original continuous features
- Total outliers detected: 30
- Top contributing features:
  1. G2: 14
  2. G1: 10
  3. Early_Progress_Index: 5
  4. G3: 1
##PCA/MCA-derived features
No significant PCA/MCA-related outliers detected.

#High correlations (|r| ≥ 0.70)
- Strongest correlations:
  1. failures ↔ Failures_Bin: 0.980
  2. G1 ↔ Prior_Grades_Mean: 0.971
  3. G2 ↔ Prior_Grades_Mean: 0.969
  4. Walc ↔ Alcohol_Index: 0.935
  5. G2 ↔ G3: 0.931
  6. Prior_Grades_Mean ↔ G3: 0.926
  7. Medu ↔ Parental_Edu_Level: 0.918
  8. Fedu ↔ Parental_Edu_Level: 0.909
  9. Parental_Edu_Level ↔ Parental_Edu_Bin: 0.899
  10. G1 ↔ G2: 0.880
- 6 additional correlated pairs above threshold

#Target validation
No target specified.

#Dataset size
No size-related risks detected.

#Final assessment
Dataset is usable for regression, but issues above should be 

The diagnostic report confirms that the dataset is clean and structurally sound, with no missing values or constant features. The identified high correlations and outliers are not data flaws but rather high-density signals captured during feature engineering. 

These findings justify our sensitivity analysis strategy: by testing different feature subsets, we can leverage this information while neutralizing multicollinearity. The dataset is fully validated and ready for robust predictive modeling.

## 2.9-Summary - Notebook 2


| Category | Main Transformation | Notes / Insights |
| :--- | :--- | :--- |
| Outlier Management | Log-transformation of Absences | Raw absences showed high skewness. Applying log(1+x) successfully neutralized extreme values while preserving the distribution, eliminating the need to delete data. |
| Data Partitioning | 80/20 Train-Test Split | Stratified-like split (random_state=42) used to ensure target distribution is preserved across sets before any feature engineering to prevent leakage. |
| Categorical Encoding | Binary & One-Hot Encoding | 13 binary features (sex, school, etc.) mapped to 0/1. Multi-category features (Mjob, Fjob) expanded via One-Hot. align_columns logic ensures consistency between Train and Test. |
| Parental Education | Composite Index & Binning | Created Parental_Edu_Level (mean) and Parental_Edu_Bin. Reduces Medu/Fedu redundancy while capturing the household's total educational capital. |
| Academic Momentum | Early Progress & Prior Mean | Engineered Early_Progress_Index (G2-G1) and Prior_Grades_Mean. These capture the "velocity" of student improvement, often more predictive than static grades. |
| Behavioral Indexes | Alcohol Index & Failures Bin | Combined Dalc and Walc into a single Alcohol_Index to reduce noise. Binned failures into 0, 1, and 2+ to capture non-linear impacts on performance. |
| Feature Scaling | Standardization (Z-score) | All numeric and ordinal features normalized using StandardScaler. Fit on Train only to prevent Data Leakage; essential for the convergence of Neural Networks. |

## 2.10-Data Export

In [17]:
import pickle

# On récupère les listes de variables que TU as définies plus haut
# Pour reconstruire la liste des prédicteurs de base (sans G1, G2, G3)
base_features = (
    [c for c in continuous_cols if c not in ['G1', 'G2', 'G3']] + 
    discrete_cols + 
    categorical_cols + 
    ['absences_log']
)

# Output filename
export_filename = "student_performance_sensitivity_bundle.pkl"

# Bundle dictionary
modeling_bundle = {
    "data": {
        "portuguese": {
            "train": normalized_datasets["portuguese"]["train"],
            "test": normalized_datasets["portuguese"]["test"]
        },
        "math": {
            "train": normalized_datasets["math"]["train"],
            "test": normalized_datasets["math"]["test"]
        },
        "dropouts": df_all_dropouts
    },
    "metadata": {
        "features_base": base_features,
        "features_sensitivity": sensitivity_cols, # Tes indices : Parental_Edu_Level, Alcohol_Index, etc.
        "target": "G3"
    }
}

# Save the bundle
with open(export_filename, "wb") as f:
    pickle.dump(modeling_bundle, f, protocol=pickle.HIGHEST_PROTOCOL)

print(f"--- Export Successful: {export_filename} ---")
print(f"Metadata: 'features_base' and 'features_sensitivity' lists included.")

--- Export Successful: student_performance_sensitivity_bundle.pkl ---
Metadata: 'features_base' and 'features_sensitivity' lists included.
