 # Data Preprocessing for Student Performance and Breast Cancer Datasets

 This notebook preprocesses two datasets:

 - **Student_Performance.csv**: A regression dataset predicting `Performance Index`.

 - **breast-cancer.csv**: A classification dataset predicting `diagnosis` (Malignant/Benign).



 The steps include loading data, handling missing values, encoding categorical variables, normalizing features, splitting into training/test sets, and saving the processed data as NumPy arrays.

 ## Setup

 Import necessary libraries and set up the project root for file paths.

In [20]:
import os
import sys

import numpy as np
import pandas as pd

# Set project root directory
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
sys.path.append(project_root)

from src.scratch.utils.data_utils import *

 ## Student Performance Preprocessing (Regression)

 Prepares the `Student_Performance.csv` dataset for a regression task to predict `Performance Index`.

 ### Load and Inspect Data

 Load the dataset and display basic information to confirm structure.

In [21]:
# Load dataset
data_path = "../data/raw/Regression_Dataset/Student_Performance.csv"
df = load_data(data_path)  # Assumes load_data() is a utility function that reads CSV

# Display basic info and first few rows
print("Student Performance Dataset Info:")
print(df.info())
print("\nFirst few rows:")
print(df.head())

Student Performance Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 6 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Hours Studied                     10000 non-null  int64  
 1   Previous Scores                   10000 non-null  int64  
 2   Extracurricular Activities        10000 non-null  object 
 3   Sleep Hours                       10000 non-null  int64  
 4   Sample Question Papers Practiced  10000 non-null  int64  
 5   Performance Index                 10000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 468.9+ KB
None

First few rows:
   Hours Studied  Previous Scores Extracurricular Activities  Sleep Hours  \
0              7               99                        Yes            9   
1              4               82                         No            4   
2              8               51     

 ### Encode Categorical Columns

 - `Extracurricular Activities` is the only categorical column (object type, 'Yes'/'No').

 - Encode it to numerical values (e.g., Yes=1, No=0).

In [22]:
df = encode_categorical(df)

 ### Handle Missing Values

 - From `df.info()`, there are no missing values (10,000 non-null entries per column).

 - Apply the function for completeness and robustness.

In [23]:
df = handle_missing_values(df, strategy="mean")  # Uses mean imputation if needed

 ### Split Features and Target

 - **Target**: `Performance Index` (float64, continuous for regression).

 - **Features**: All other columns (`Hours Studied`, `Previous Scores`, `Extracurricular Activities`, `Sleep Hours`, `Sample Question Papers Practiced`).

In [24]:
target_column = "Performance Index"
X, y = feature_target_split(df, target_column)  # Splits features (X) and target (y)

 ### Normalize Features

 - Normalize only the feature columns (X) to ensure consistent scale.

 - Do not normalize the target (`Performance Index`) as it’s a regression output.

In [25]:
X = normalize(X)

 ### Convert to NumPy Arrays

 - Convert features and target to NumPy arrays for compatibility with machine learning models.

In [26]:
X = X.to_numpy()
y = y.to_numpy()

 ### Split into Training and Test Sets

 - Split data into 80% training and 20% testing for model evaluation.

In [27]:
X_train, X_test, y_train, y_test = split_data(X, y, test_size=0.2)

 ### Save Processed Data

 - Save the processed arrays to the `../data/processed/` directory for use in training scripts.

In [28]:
np.save("../data/processed/student_X_train.npy", X_train)
np.save("../data/processed/student_X_test.npy", X_test)
np.save("../data/processed/student_y_train.npy", y_train)
np.save("../data/processed/student_y_test.npy", y_test)

print("\nStudent Performance data processed and saved.")


Student Performance data processed and saved.


 ## Breast Cancer Preprocessing (Classification)

 Prepares the `breast-cancer.csv` dataset for a classification task to predict `diagnosis`.

 ### Load and Inspect Data

 Load the dataset and display basic information to confirm structure.

In [29]:
# Load dataset
data_path = "../data/raw/Classification_Dataset/breast-cancer.csv"
df = load_data(data_path)

# Display basic info and first few rows
print("Breast Cancer Dataset Info:")
print(df.info())
print("\nFirst few rows:")
print(df.head())

Breast Cancer Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 32 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       569 non-null    int64  
 1   diagnosis                569 non-null    object 
 2   radius_mean              569 non-null    float64
 3   texture_mean             569 non-null    float64
 4   perimeter_mean           569 non-null    float64
 5   area_mean                569 non-null    float64
 6   smoothness_mean          569 non-null    float64
 7   compactness_mean         569 non-null    float64
 8   concavity_mean           569 non-null    float64
 9   concave points_mean      569 non-null    float64
 10  symmetry_mean            569 non-null    float64
 11  fractal_dimension_mean   569 non-null    float64
 12  radius_se                569 non-null    float64
 13  texture_se               569 non-null    float64
 14

 ### Drop Irrelevant Columns

 - `id` column is irrelevant for modeling and should be removed.

In [30]:
df = drop_columns(df, ["id"])

 ### Encode Target Column

 - `diagnosis` is the target column (object type, 'M' for malignant, 'B' for benign).

 - Map 'M' to 1 and 'B' to 0 for binary classification.

In [31]:
df["diagnosis"] = df["diagnosis"].map({"M": 1, "B": 0}).astype(int)

 ### Handle Missing Values

 - From `df.info()`, there are no missing values (569 non-null entries per column).

 - Apply the function for completeness.

In [32]:
df = handle_missing_values(df, strategy="mean")

 ### Split Features and Target

 - **Target**: `diagnosis` (now int, binary for classification).

 - **Features**: All other columns (30 numerical features like `radius_mean`, `texture_mean`, etc.).

In [33]:
target_column = "diagnosis"
X, y = feature_target_split(df, target_column)

 ### Encode Categorical Columns in Features

 - No categorical columns in features (all are float64 after dropping `id` and encoding `diagnosis`).

 - Apply the function as a safeguard for future datasets.

In [34]:
X = encode_categorical(X)  # No-op in this case, but ensures robustness

 ### Normalize Features

 - Normalize only the feature columns (X) to ensure consistent scale.

 - Do not normalize the target (`diagnosis`) as it’s a binary label.

In [35]:
X = normalize(X)

 ### Convert to NumPy Arrays

 - Convert features and target to NumPy arrays for model compatibility.

In [36]:
X = X.to_numpy()
y = y.to_numpy()

 ### Split into Training and Test Sets

 - Split data into 80% training and 20% testing for model evaluation.

In [37]:
X_train, X_test, y_train, y_test = split_data(X, y, test_size=0.2)

 ### Save Processed Data

 - Save the processed arrays to the `../data/processed/` directory for use in training scripts.

In [38]:
np.save("../data/processed/breast_cancer_X_train.npy", X_train)
np.save("../data/processed/breast_cancer_X_test.npy", X_test)
np.save("../data/processed/breast_cancer_y_train.npy", y_train)
np.save("../data/processed/breast_cancer_y_test.npy", y_test)

print("\nBreast Cancer data processed and saved.")


Breast Cancer data processed and saved.
