# 🧠 Feature Engineering Notebook. 

**Notebook Objective:** 

In this notebook, we engineer new features and transform existing ones to prepare the data for modeling. This includes one-hot encoding, scaling, and combining binary and ordinal features. The final output is a clean, `model-ready` dataset saved for future use.

In [6]:
# Load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [7]:
# Load preprocessed dataset
processed_data_path = "../data/processed_student_data.csv"
df = pd.read_csv(processed_data_path)

print(f"✅ Data loaded successfully! Shape: {df.shape}")
df.head()

✅ Data loaded successfully! Shape: (2392, 15)


Unnamed: 0,studentid,age,gender,ethnicity,parentaleducation,studytimeweekly,absences,tutoring,parentalsupport,extracurricular,sports,music,volunteering,gpa,gradeclass
0,1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1,1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
2,1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0
3,1004,17,1,0,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0
4,1005,17,1,0,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0


### Dealing with Categorical Variables

- Binary Categorical Variables: 
  - `Gender, Tutoring, Extracurricular, Sports, Music, Volunteering`
    - These are already binary (0 or 1). No further encoding needed.
### Scaling

- Ordinal Variables:
  - `ParentalSupport`:
    - Since this has an inherent order, we can keep it as is (0 to 4). Machine learning models will handle this naturally
- Norminal Variables:
  - `Ethnicity, ParentalEducation`:
    - These have no inherent order, so we'll use One-Hot Encoding to avoid assigning artificial ordinal relationships.

In [9]:
df.columns

Index(['studentid', 'age', 'gender', 'ethnicity', 'parentaleducation',
       'studytimeweekly', 'absences', 'tutoring', 'parentalsupport',
       'extracurricular', 'sports', 'music', 'volunteering', 'gpa',
       'gradeclass'],
      dtype='object')

In [17]:
# Features
binary_features = ['gender', 'tutoring', 'extracurricular', 'sports', 'music', 'volunteering']
ordinal_features = ['parentalsupport', 'parentaleducation']
nominal_features = ['ethnicity']
numerical_features = ['studytimeweekly', 'absences', 'gpa', 'age']

In [18]:
features = binary_features + ordinal_features + nominal_features + numerical_features
target_cols = ['gpa', 'gradeclass']  # Keep all useful targets

X = df[features]
y = df[target_cols]
print(f"✅ Features and targets separated!")
print(f"X shape: {X.shape}, y shape: {y.shape}")

✅ Features and targets separated!
X shape: (2392, 13), y shape: (2392, 2)


In [19]:
# Apply Transformations

preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first'), nominal_features),
        ('scale', StandardScaler(), numerical_features)
    ],
    remainder='passthrough'
)

# Fit and transform full X
X_transformed = preprocessor.fit_transform(X)

# Convert to DataFrame with correct column names

# Get column names after one-hot and scaling
encoded_columns = preprocessor.named_transformers_['onehot'].get_feature_names_out(nominal_features)
scaled_columns = numerical_features
remaining_columns = binary_features + ordinal_features

all_columns = list(encoded_columns) + scaled_columns + remaining_columns
X_transformed_df = pd.DataFrame(X_transformed, columns=all_columns)

In [22]:
# Combine with target columns
final_df = pd.concat([X_transformed_df, y.reset_index(drop=True)], axis=1)
# Check the final DataFrame
final_df.head()

Unnamed: 0,ethnicity_1,ethnicity_2,ethnicity_3,studytimeweekly,absences,gpa,age,gender,tutoring,extracurricular,sports,music,volunteering,parentalsupport,parentaleducation,gpa.1,gradeclass
0,0.0,0.0,0.0,1.780336,-0.890822,1.118086,0.472919,1.0,1.0,0.0,0.0,1.0,0.0,2.0,2.0,2.929196,2.0
1,0.0,0.0,0.0,0.997376,-1.717694,1.242374,1.362944,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.042915,1.0
2,0.0,1.0,0.0,-0.984045,1.353542,-1.960277,-1.307132,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3.0,0.112602,4.0
3,0.0,0.0,0.0,0.045445,-0.063951,0.16179,0.472919,1.0,0.0,1.0,0.0,0.0,0.0,3.0,3.0,2.054218,3.0
4,0.0,0.0,0.0,-0.902311,0.290422,-0.675573,0.472919,1.0,1.0,0.0,0.0,0.0,0.0,3.0,2.0,1.288061,4.0


In [23]:
##### Feature Engineering #####

feature_engineered_data_path = "../data/feature_engineered_student_data.csv"
final_df.to_csv(feature_engineered_data_path, index=False)

print(f"✅ Feature-engineered data saved: {feature_engineered_data_path}")


✅ Feature-engineered data saved: ../data/feature_engineered_student_data.csv
