 # Data Preprocessing for Student Performance, Breast Cancer, and Penguins Datasets



 This notebook preprocesses three datasets:



 - **Student_Performance.csv**: A regression dataset predicting `Performance Index`.

 - **breast-cancer.csv**: A classification dataset predicting `diagnosis`. Two target versions are produced:

    - For logistic regression: target labels {0, 1}.

    - For SVM: target labels {-1, 1}.

 - **penguins.csv**: A clustering dataset for a K-Means task. In this section, best practices are followed by:

    - Handling missing values using the "differentiated" strategy (numeric columns are imputed with the median, categorical with the mode).

    - Separating the ground‐truth label (`sex`) from the clustering features.

    - Normalizing the numeric features (in the range [0,1]).

 ## Setup

 Import necessary libraries and set up the project root for file paths.

In [None]:
import os
import sys
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.decomposition import PCA

# Set project root directory
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)

# Import custom utility functions from data_utils.py
from src.scratch.utils.data_utils import (
    load_data, 
    shuffle_data_pandas, 
    encode_categorical, 
    handle_missing_values, 
    feature_target_split, 
    normalize, 
    split_data,
    drop_columns
)


 ## Student Performance Preprocessing (Regression)



 This section prepares the `Student_Performance.csv` dataset for predicting the `Performance Index`.

 Steps include shuffling, encoding categorical columns, handling missing values, feature‐target splitting, normalization, and splitting into training/test sets.

 ### Load and Inspect Data

 Load the dataset and display basic information to confirm structure.

In [None]:
# Check current working directory
print("Current working directory:", os.getcwd())


In [None]:
# Load the Student Performance dataset and shuffle it
data_path = Path("../data/raw/Regression_Dataset/Student_Performance.csv")
df_student = load_data(data_path)
df_student = shuffle_data_pandas(df_student)

# Display basic information and the first few rows
print("Student Performance Dataset Info:")
print(df_student.info())
print("\nFirst few rows:")
print(df_student.head())


 ### Encode Categorical Columns



 Here the `Extracurricular Activities` column (with values 'Yes'/'No') is encoded to numerical values.

In [None]:
df_student = encode_categorical(df_student)


 ### Handle Missing Values

 - From `df.info()`, there are no missing values (10,000 non-null entries per column).

 - Apply the function for completeness and robustness.

In [None]:
df_student = handle_missing_values(df_student, strategy="mean")


 ### Split Features and Target

 - **Target**: `Performance Index` (float64, continuous for regression).

 - **Features**: All other columns (`Hours Studied`, `Previous Scores`, `Extracurricular Activities`, `Sleep Hours`, `Sample Question Papers Practiced`).

In [None]:
target_column = "Performance Index"
X_student, y_student = feature_target_split(df_student, target_column)


### Normalize Features and Convert to NumPy Arrays

- Normalize only the feature columns (X) to ensure consistent scale.

- Do not normalize the target (`Performance Index`) as it’s a regression output.

- Convert features and target to NumPy arrays for compatibility with machine learning models.


In [None]:
X_student = normalize(X_student)
X_student = X_student.to_numpy()
y_student = y_student.to_numpy()


 ### Split into Training and Test Sets and Save



 The data is split 80/20 for model training and evaluation.

In [None]:
X_train_student, X_test_student, y_train_student, y_test_student = split_data(X_student, y_student, test_size=0.2)

np.save("../data/processed/student_X_train.npy", X_train_student)
np.save("../data/processed/student_X_test.npy", X_test_student)
np.save("../data/processed/student_y_train.npy", y_train_student)
np.save("../data/processed/student_y_test.npy", y_test_student)

print("\nStudent Performance data processed and saved.")


 ## Breast Cancer Preprocessing (Classification)



 This section prepares the `breast-cancer.csv` dataset for predicting the `diagnosis` label.

 Two target versions are produced:



 - **Logistic Regression Version**: Targets remain {0,1} (‘M’ is mapped to 1 and ‘B’ to 0).

 - **SVM Version**: Targets are transformed to {-1,1} (with 0 converted to -1).

 ### Load and Inspect Data

 Load the dataset and display basic information to confirm structure.

In [None]:
# Load the Breast Cancer dataset and shuffle it
data_path = Path("../data/raw/Classification_Dataset/breast-cancer.csv")
df_bc = load_data(data_path)
df_bc = shuffle_data_pandas(df_bc)

# Display basic info and first few rows
print("Breast Cancer Dataset Info:")
print(df_bc.info())
print("\nFirst few rows:")
print(df_bc.head())


### Drop Irrelevant Columns and Encode Target

- `id` column is irrelevant for modeling and should be removed.
- `diagnosis` is the target column (object type, 'M' for malignant, 'B' for benign).

- Map 'M' to 1 and 'B' to 0 for binary classification.


In [None]:
df_bc = drop_columns(df_bc, ["id"])
df_bc["diagnosis"] = df_bc["diagnosis"].map({"M": 1, "B": 0}).astype(int)


 ### Handle Missing Values

 - From `df.info()`, there are no missing values (569 non-null entries per column).

 - Apply the function for completeness.

In [None]:
df_bc = handle_missing_values(df_bc, strategy="mean")


 ### Split Features and Target

 - **Target**: `diagnosis` (now int, binary for classification).

 - **Features**: All other columns (30 numerical features like `radius_mean`, `texture_mean`, etc.).

In [None]:
target_column = "diagnosis"
X_bc, y_bc = feature_target_split(df_bc, target_column)


### Encode Categorical Columns in Features (Safeguard) and Normalize

- No categorical columns in features (all are float64 after dropping `id` and encoding `diagnosis`).

- Apply the function as a safeguard for future datasets.

- Normalize only the feature columns (X) to ensure consistent scale.

- Do not normalize the target (`diagnosis`) as it’s a binary label.


In [None]:
X_bc = encode_categorical(X_bc)
X_bc = normalize(X_bc)


 ### Convert to NumPy Arrays

 - Convert features and target to NumPy arrays for model compatibility.

In [None]:
X_bc = X_bc.to_numpy()
y_bc = y_bc.to_numpy()


 ### Split into Training and Test Sets and Create Two Versions of the Target



 The data is split 80/20. Additionally, an SVM version of the target is created by converting 0 to -1.

In [None]:
X_train_bc, X_test_bc, y_train_bc, y_test_bc = split_data(X_bc, y_bc, test_size=0.2)

# Create SVM targets by replacing 0 with -1
y_train_bc_svm = np.where(y_train_bc == 0, -1, 1)
y_test_bc_svm = np.where(y_test_bc == 0, -1, 1)


 ### Save Processed Data for Both Logistic Regression and SVM



 The logistic regression version uses {0,1} targets. The SVM version uses {-1,1}.

In [None]:
# Save logistic regression data
np.save("../data/processed/breast_cancer_X_train.npy", X_train_bc)
np.save("../data/processed/breast_cancer_X_test.npy", X_test_bc)
np.save("../data/processed/breast_cancer_y_train.npy", y_train_bc)
np.save("../data/processed/breast_cancer_y_test.npy", y_test_bc)

# Save SVM-specific targets
np.save("../data/processed/breast_cancer_y_train_svm.npy", y_train_bc_svm)
np.save("../data/processed/breast_cancer_y_test_svm.npy", y_test_bc_svm)

print("\nBreast Cancer data (both logistic regression and SVM versions) processed and saved.")


## Penguins Preprocessing for K-Means with PCA

This script preprocesses the Penguins dataset to create a fully numeric dataset,
applies normalization, and then reduces its dimensionality using PCA.

The resulting data (with `n_components=2`) is ready for training a K-Means clustering model.

Steps:
1. Load and shuffle the dataset.
2. Handle missing values using a "differentiated" strategy.
3. Encode categorical features to numeric values.
4. Normalize all numeric features to the [0, 1] range.
5. Apply PCA to reduce dimensionality.
6. Save the final dataset for clustering.

In [None]:
# Load the Penguins dataset
file_path = Path("../data/raw/K_Means_Dataset/penguins.csv")
df_penguins = load_data(file_path)

# Optionally, shuffle the dataset
df_penguins = shuffle_data_pandas(df_penguins)

# Display basic info and the first few rows
print("Penguins Dataset Info:")
print(df_penguins.info())
print("\nFirst few rows:")
print(df_penguins.head())


 ### Handle Missing Values



 Use the "differentiated" strategy to:

 - Impute missing numeric values with the median.

 - Impute missing categorical values with the mode.

In [None]:
df_penguins = handle_missing_values(df_penguins, strategy="differentiated")

# Verify that there are no missing values left
print("Missing values after handling:")
print(df_penguins.isnull().sum())
print(df_penguins.head())


### Encode Categorical Features

Convert all columns of type object or category to numeric using factorization.
This ensures the data is fully numeric for K-Means training.


In [None]:
df_penguins_encoded = encode_categorical(df_penguins)

print("After encoding categorical features:")
print(df_penguins_encoded.head())

 ### Normalize Features



 K-Means is sensitive to the scale of features. We normalize the numeric features to the [0, 1] range.

In [None]:
df_penguins_norm = normalize(df_penguins_encoded)
print("\nNormalized features sample:")
print(df_penguins_norm.head())


### Apply PCA to Reduce Dimensionality

We use PCA from scikit-learn to reduce the data to two principal components.
This step is useful for visualization and to simplify clustering with K-Means.


In [None]:
data = df_penguins_norm.to_numpy()

# Initialize PCA with desired number of components (e.g., 2)
pca = PCA(n_components=2)
data_pca = pca.fit_transform(data)

print("Data shape after PCA transformation:", data_pca.shape)
print("Explained variance ratio by components:", pca.explained_variance_ratio_)

### Save Processed Data

- Save the PCA-transformed data as a NumPy array.
- This file,"kmeans_penguins_data.npy", is now ready to be fed directly into a K-Means model.


In [None]:
# Save the full normalized features
np.save("../data/processed/kmeans_penguins_data.npy", data_pca)

print("\nPenguins dataset processed and saved for K-Means clustering. ^__^")
