# 🧠 Feature Engineering Notebook. 

**Notebook Objective:** 

In this notebook, we engineer new features and transform existing ones to prepare the data for modeling. This includes one-hot encoding, scaling, and combining binary and ordinal features. The final output is a clean, `model-ready` dataset saved for future use.

In [1]:
# Load packages
import pandas as pd
import joblib
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline

**1. Load Raw Processed Data**

In [2]:
# Load preprocessed dataset
processed_data_path = "../data/processed/processed_student_data.csv"
df = pd.read_csv(processed_data_path)

print(f"Data loaded successfully! Shape: {df.shape}")
df.head()

Data loaded successfully! Shape: (2392, 15)


Unnamed: 0,studentid,age,gender,ethnicity,parentaleducation,studytimeweekly,absences,tutoring,parentalsupport,extracurricular,sports,music,volunteering,gpa,gradeclass
0,1001,17,1,0,2,19.833723,7,1,2,0,0,1,0,2.929196,2.0
1,1002,18,0,0,1,15.408756,0,0,1,0,0,0,0,3.042915,1.0
2,1003,15,0,2,3,4.21057,26,0,2,0,0,0,0,0.112602,4.0
3,1004,17,1,0,3,10.028829,14,0,3,1,0,0,0,2.054218,3.0
4,1005,17,1,0,2,4.672495,17,1,3,0,0,0,0,1.288061,4.0


### Dealing with Categorical Variables

- Binary Categorical Variables: 
  - `Gender, Tutoring, Extracurricular, Sports, Music, Volunteering`
    - These are already binary (0 or 1). No further encoding needed.
### Scaling

- Ordinal Variables:
  - `ParentalSupport`:
    - Since this has an inherent order, we can keep it as is (0 to 4). Machine learning models will handle this naturally
- Norminal Variables:
  - `Ethnicity, ParentalEducation`:
    - These have no inherent order, so we'll use One-Hot Encoding to avoid assigning artificial ordinal relationships.

**2: Define Column Types & Separate Features/Targets**

In [3]:
# 2. Define feature types
binary_features = ['gender', 'tutoring', 'extracurricular', 'sports', 'music', 'volunteering']
ordinal_features = ['parentalsupport', 'parentaleducation']
nominal_features = ['ethnicity']
numerical_features = ['studytimeweekly', 'absences', 'age']

features = binary_features + ordinal_features + nominal_features + numerical_features
target_cols = ['gpa', 'gradeclass']

# 3. Split into X and y
X = df[features]
y = df[target_cols]
print(f"Features and targets separated! X shape: {X.shape}, y shape: {y.shape}")

Features and targets separated! X shape: (2392, 12), y shape: (2392, 2)


**3: Define Preprocessor (Base Transformer)**

In [4]:
# 4. Define base preprocessor
base_preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first'), nominal_features),
        ('scale', StandardScaler(), numerical_features)
    ],
    remainder='passthrough'  # Keeps binary + ordinal as-is
)

**4: ColumnNameTransformer to Retain Column Names**

In [5]:
# 5. Custom transformer to retain feature names
class ColumnNameTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, transformer, feature_names):
        self.transformer = transformer
        self.feature_names = feature_names

    def fit(self, X, y=None):
        self.transformer.fit(X, y)
        return self

    def transform(self, X):
        X_t = self.transformer.transform(X)
        return pd.DataFrame(X_t, columns=self.feature_names, index=X.index)

**5: Fit Preprocessor and Get Column Names**

In [6]:
# 6. Fit preprocessor to data
base_preprocessor.fit(X)

# 7. Get transformed column names
encoded_columns = base_preprocessor.named_transformers_['onehot'].get_feature_names_out(nominal_features)
scaled_columns = numerical_features
remaining_columns = binary_features + ordinal_features

# Final column order
all_transformed_columns = list(encoded_columns) + scaled_columns + remaining_columns
print(f"Preprocessor fitted and column names extracted!")

Preprocessor fitted and column names extracted!


**6: Wrap Preprocessor and Save**

In [7]:
# 8. Wrap with ColumnNameTransformer
wrapped_preprocessor = ColumnNameTransformer(base_preprocessor, feature_names=all_transformed_columns)

# 9. Create and save pipeline
preprocessor_pipeline = Pipeline([("preprocessor", wrapped_preprocessor)])
joblib.dump(preprocessor_pipeline, "../models/preprocessor_pipeline.pkl")
print(f"Preprocessor pipeline saved successfully!")

Preprocessor pipeline saved successfully!


**7: Transform Features & Create Final Dataset**

In [8]:
# 10. Apply transformation
X_transformed_df = preprocessor_pipeline.transform(X)

# 11. Combine with targets
final_df = pd.concat([X_transformed_df, y.reset_index(drop=True)], axis=1)

# 12. Create risk_flag based on GPA
final_df['risk_flag'] = final_df['gpa'].apply(lambda gpa: 1 if gpa < 2.0 else 0)
print("risk_flag column added!")

# 13. View distribution
print("Class distribution (risk_flag):")
print(final_df['risk_flag'].value_counts())

# 14. Save feature-engineered dataset
final_df.to_csv("../data/feature_engineered_student_data.csv", index=False)
print("Feature-engineered dataset saved successfully!")

risk_flag column added!
Class distribution (risk_flag):
1    1274
0    1118
Name: risk_flag, dtype: int64
Feature-engineered dataset saved successfully!
