# Week 3 — Data Preprocessing & Feature Engineering

This notebook prepares the Portuguese Student Performance dataset for machine learning modeling.  
It includes:
- Data cleaning  
- Handling missing values  
- Encoding categorical features  
- Scaling numerical features  
- Basic feature engineering  
- Saving a processed dataset for Week 4 model training.
.


## 1. Import Required Libraries

In this step, we import the main Python libraries needed for preprocessing and transforming the dataset before training machine-learning models.


In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

## 2. Load the Dataset

We load the Portuguese student performance dataset (`student-por.csv`) from the UCI repository.  
This raw dataset will be used as the starting point for all preprocessing steps.ng.

In [2]:
df = pd.read_csv("student-por.csv", sep=";")
df.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
1,GP,F,17,U,GT3,T,1,1,at_home,other,...,5,3,3,1,1,3,2,9,11,11
2,GP,F,15,U,LE3,T,1,1,at_home,other,...,4,3,2,2,3,3,6,12,13,12
3,GP,F,15,U,GT3,T,4,2,health,services,...,3,2,2,1,1,5,0,14,14,14
4,GP,F,16,U,GT3,T,3,3,other,other,...,4,3,2,1,2,5,0,11,13,13


## 3. Inspect Data Structure and Missing Values
We confirmed the data types of all columns and verified whether any feature contains missing values.  
In this dataset, most columns are complete, but we still keep imputation in the pipeline for robustness.
ed.y.

In [3]:
# Check basic info
df.info()

# Check missing values
df.isna().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 649 entries, 0 to 648
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      649 non-null    object
 1   sex         649 non-null    object
 2   age         649 non-null    int64 
 3   address     649 non-null    object
 4   famsize     649 non-null    object
 5   Pstatus     649 non-null    object
 6   Medu        649 non-null    int64 
 7   Fedu        649 non-null    int64 
 8   Mjob        649 non-null    object
 9   Fjob        649 non-null    object
 10  reason      649 non-null    object
 11  guardian    649 non-null    object
 12  traveltime  649 non-null    int64 
 13  studytime   649 non-null    int64 
 14  failures    649 non-null    int64 
 15  schoolsup   649 non-null    object
 16  famsup      649 non-null    object
 17  paid        649 non-null    object
 18  activities  649 non-null    object
 19  nursery     649 non-null    object
 20  higher    

school        0
sex           0
age           0
address       0
famsize       0
Pstatus       0
Medu          0
Fedu          0
Mjob          0
Fjob          0
reason        0
guardian      0
traveltime    0
studytime     0
failures      0
schoolsup     0
famsup        0
paid          0
activities    0
nursery       0
higher        0
internet      0
romantic      0
famrel        0
freetime      0
goout         0
Dalc          0
Walc          0
health        0
absences      0
G1            0
G2            0
G3            0
dtype: int64

## 4. Basic Feature Engineering

We create a few new features that may improve model performance and interpretability:
- `study_absence_interaction`: combines study time and absences  
- `age_group`: bins age into categories (young, middle, older)  
- `log_absences`: log-transformed absences to reduce skewness

In [4]:
# Interaction term: studytime × absences
df["study_absence_interaction"] = df["studytime"] * df["absences"]

# Age groups using bins
df["age_group"] = pd.cut(
    df["age"],
    bins=[15, 17, 19, 22],
    labels=["young", "middle", "older"],
    include_lowest=True
)

# Log-transform absences (add 1 to avoid log(0))
df["log_absences"] = np.log1p(df["absences"])

df[["studytime", "absences", "study_absence_interaction", "age", "age_group", "log_absences"]].head()

Unnamed: 0,studytime,absences,study_absence_interaction,age,age_group,log_absences
0,2,4,8,18,middle,1.609438
1,2,2,4,17,young,1.098612
2,2,6,12,15,young,1.94591
3,3,0,0,15,young,0.0
4,2,0,0,16,young,0.0


We engineered new features that may capture more nuanced relationships between behavior (study time, absences, age) and performance, which can later be analyzed and modeled.

## 5. Split Features (X) and Target (y)

We separate the input features (`X`) from the target variable (`y`).  
Here, `G3` (final grade) is the prediction target, and all other columns are used as features.s..

In [5]:
# Target variable
y = df["G3"]

# Features: drop G3 from the dataset
X = df.drop("G3", axis=1)

print("X shape:", X.shape)
print("y shape:", y.shape)

X shape: (649, 35)
y shape: (649,)


 **Step 5 completed:**  
We defined `y` as the final grade (G3) and `X` as all remaining columns, ensuring that the target is not used as an input feature.

## 6. Identify Numerical and Categorical Columns

We automatically detect numerical and categorical columns from `X`.  
This allows us to apply different preprocessing steps to each type (e.g., scaling for numeric, encoding for categorical).y.y.

In [6]:
num_cols = X.select_dtypes(include=['int64', 'float64']).columns
cat_cols = X.select_dtypes(include=['object']).columns

print("Numeric columns:", num_cols)
print("Categorical columns:", cat_cols)

Numeric columns: Index(['age', 'Medu', 'Fedu', 'traveltime', 'studytime', 'failures', 'famrel',
       'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences', 'G1', 'G2',
       'study_absence_interaction', 'log_absences'],
      dtype='object')
Categorical columns: Index(['school', 'sex', 'address', 'famsize', 'Pstatus', 'Mjob', 'Fjob',
       'reason', 'guardian', 'schoolsup', 'famsup', 'paid', 'activities',
       'nursery', 'higher', 'internet', 'romantic'],
      dtype='object')


**Step 6 completed:**  
We identified numeric features (e.g., age, grades, absences) and categorical features (e.g., gender, address, parental jobs), which will be handled differently in the preprocessing pipeline.

## 7. Build the Preprocessing Pipeline

We create a `ColumnTransformer` that:
- Imputes missing numeric values and scales them using `StandardScaler`
- Imputes missing categorical values and encodes them using `OneHotEncoder`

This ensures consistent preprocessing across all features.sing.

In [7]:
preprocessor = ColumnTransformer(
    transformers=[
        ("num", Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler())
        ]), num_cols),

        ("cat", Pipeline(steps=[
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("encoder", OneHotEncoder(handle_unknown="ignore"))
        ]), cat_cols)
    ]
)

**Step 7 completed:**  
We defined a preprocessing pipeline that standardizes numerical features and one-hot encodes categorical features, while safely handling any missing values.

## 8. Fit and Transform Features (Create `X_processed`)

We now apply the preprocessing pipeline to `X` to create a fully transformed feature matrix, ready for model training in Week 4..

In [8]:
X_processed = preprocessor.fit_transform(X)
print("X_processed created. Shape:", X_processed.shape)

X_processed created. Shape: (649, 60)


**Step 8 completed:**  
We successfully transformed all features into a numerical matrix `X_processed`, combining scaled numeric features and one-hot encoded categorical features.

## 9. Save the Processed Dataset

We convert the processed feature matrix into a DataFrame, attach the target variable `G3`, and save the result as `processed_student_data.csv` for Week 4 modeling..


In [9]:
# Convert X_processed to a dense array if it is sparse
if hasattr(X_processed, "toarray"):
    X_array = X_processed.toarray()
else:
    X_array = X_processed

# Build processed DataFrame
processed_df = pd.DataFrame(X_array)
processed_df["G3"] = y.values

# Save to CSV
processed_df.to_csv("processed_student_data.csv", index=False)

print("Saved processed_student_data.csv with shape:", processed_df.shape)

Saved processed_student_data.csv with shape: (649, 61)


**Step 9 completed:**  
We created `processed_student_data.csv`, which contains fully preprocessed features and the target grade G3.  
This file will be used in **Week 4** for model training (e.g., Linear Regression, Random Forest, SVM).

Note: SMOTE will be applied later if we convert G3 into a classification label (e.g., pass/fail). For now, we skip balancing for the regression setup.

## 10. Note on Class Balancing (SMOTE for Future Work)

In this notebook, `G3` is treated as a continuous regression target (0–20).  
Balancing methods like **SMOTE** are mainly designed for classification problems (e.g., Pass/Fai


➡️ In a future step, if we convert `G3` into a binary label such as:
- Pass (G3 ≥ threshold)
- Fail (G3 < threshold)

then we can apply SMOTE on the classification labels to handle class imbalance fairly across groups.