# 🧠 Feature Engineering Notebook. 

**Notebook Objective:** 

In this notebook, we engineer new features and transform existing ones to prepare the data for modeling. This includes one-hot encoding, scaling, and combining binary and ordinal features. The final output is a clean, `model-ready` dataset saved for future use.

In [13]:
# Load packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
import joblib

In [14]:
# Load preprocessed dataset
processed_data_path = "../data/processed/processed_student_data.csv"
df = pd.read_csv(processed_data_path)

print(f"✅ Data loaded successfully! Shape: {df.shape}")
df.head()

✅ Data loaded successfully! Shape: (2392, 15)


Unnamed: 0,studentid,age,gender,ethnicity,parentaleducation,studytimeweekly,absences,tutoring,parentalsupport,extracurricular,sports,music,volunteering,gpa,gradeclass
0,1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1,1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
2,1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0
3,1004,17,1,0,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0
4,1005,17,1,0,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0


### Dealing with Categorical Variables

- Binary Categorical Variables: 
  - `Gender, Tutoring, Extracurricular, Sports, Music, Volunteering`
    - These are already binary (0 or 1). No further encoding needed.
### Scaling

- Ordinal Variables:
  - `ParentalSupport`:
    - Since this has an inherent order, we can keep it as is (0 to 4). Machine learning models will handle this naturally
- Norminal Variables:
  - `Ethnicity, ParentalEducation`:
    - These have no inherent order, so we'll use One-Hot Encoding to avoid assigning artificial ordinal relationships.

In [15]:
# Step 1: Define custom wrapper
class ColumnNameTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, transformer, feature_names):
        self.transformer = transformer
        self.feature_names = feature_names

    def fit(self, X, y=None):
        self.transformer.fit(X, y)
        return self

    def transform(self, X):
        X_t = self.transformer.transform(X)
        return pd.DataFrame(X_t, columns=self.feature_names, index=X.index)

In [16]:
# Step 2: Define Features
binary_features = ['gender', 'tutoring', 'extracurricular', 'sports', 'music', 'volunteering']
ordinal_features = ['parentalsupport', 'parentaleducation']
nominal_features = ['ethnicity']
numerical_features = ['studytimeweekly', 'absences', 'age']

In [17]:
features = binary_features + ordinal_features + nominal_features + numerical_features
target_cols = ['gpa', 'gradeclass']  # Keep all useful targets

X = df[features]
y = df[target_cols]
print(f"✅ Features and targets separated!")
print(f"X shape: {X.shape}, y shape: {y.shape}")

✅ Features and targets separated!
X shape: (2392, 12), y shape: (2392, 2)


In [18]:
# Step 3: Create base preprocessor
base_preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first'), nominal_features),
        ('scale', StandardScaler(), numerical_features)
    ],
    remainder='passthrough'
)

In [19]:
# Step 4: Fit base preprocessor and get column names
base_preprocessor.fit(X)
encoded_columns = base_preprocessor.named_transformers_['onehot'].get_feature_names_out(nominal_features)
scaled_columns = numerical_features
remaining_columns = binary_features + ordinal_features
all_columns = list(encoded_columns) + scaled_columns + remaining_columns
print(f"✅ Base preprocessor fitted!")

✅ Base preprocessor fitted!


In [20]:
# Step 5: Wrap base preprocessor to retain column names
wrapped_preprocessor = ColumnNameTransformer(base_preprocessor, feature_names=all_columns)

### Save Preprocessor Pipeline

In [21]:
# Save full pipeline (Wrapped)
pipeline = Pipeline(steps=[('preprocessor', wrapped_preprocessor)])
joblib.dump(pipeline, "../models/preprocessor_pipeline.pkl")
print(f"✅ Preprocessor pipeline saved successfully!")

✅ Preprocessor pipeline saved successfully!


In [22]:
# Step 7: Transform X and save feature-engineered dataset
X_transformed_df = pipeline.transform(X)

# Combine with target columns
final_df = pd.concat([X_transformed_df, y.reset_index(drop=True)], axis=1)
# Check the final DataFrame
final_df.head()

Unnamed: 0,ethnicity_1,ethnicity_2,ethnicity_3,studytimeweekly,absences,age,gender,tutoring,extracurricular,sports,music,volunteering,parentalsupport,parentaleducation,gpa,gradeclass
0,0.0,0.0,0.0,1.780336,-0.890822,0.472919,1.0,1.0,0.0,0.0,1.0,0.0,2.0,2.0,2.929196,2.0
1,0.0,0.0,0.0,0.997376,-1.717694,1.362944,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,3.042915,1.0
2,0.0,1.0,0.0,-0.984045,1.353542,-1.307132,0.0,0.0,0.0,0.0,0.0,0.0,2.0,3.0,0.112602,4.0
3,0.0,0.0,0.0,0.045445,-0.063951,0.472919,1.0,0.0,1.0,0.0,0.0,0.0,3.0,3.0,2.054218,3.0
4,0.0,0.0,0.0,-0.902311,0.290422,0.472919,1.0,1.0,0.0,0.0,0.0,0.0,3.0,2.0,1.288061,4.0


### Deriving Risk Flag 
The `risk_flag` is a binary label created based on GPA. Students with GPA < 2.0 are considered at risk (1), others are not at risk (0). This target will be used for classification modeling.

In [23]:
# Create 'risk_flag' column based on actual GPA (e.g., GPA < 2.0 = At Risk)
final_df['risk_flag'] = final_df['gpa'].apply(lambda x: 1 if x < 2.0 else 0)

# View class distribution
print(final_df['risk_flag'].value_counts())

# Save final engineered dataset
final_df.to_csv("../data/feature_engineered_student_data.csv", index=False)
print("✅ Transformed & Feature-engineered saved successfully!")


1    1274
0    1118
Name: risk_flag, dtype: int64
✅ Transformed & Feature-engineered saved successfully!
