## Data Preprocessing - Diabetes Dataset
### Introduction
This notebook performs **data preprocessing** for the diabetes dataset analyzed in `02_exploratory_data_analysis.ipynb`. Based on the insights from the exploratory phase, we will clean and transform the raw dataset to make it suitable for feature engineering and predictive modeling.

**Dataset:** Diabetes Dataset (Kaggle)

**Objective:** Produce a clean dataset that can be directly used in the feature engineering stage.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Preprocessing Steps:**
1. Import Libraries and Load Data
2. Missing Values Handling
3. Outliers Handling
4. Train-Test Split
5. Feature Scaling
6. Save Processed Data

### 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from collections import Counter
import joblib

plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

df = pd.read_csv("../data/raw/diabetes.csv")
df.columns = df.columns.str.strip()

print(f"Dataset loaded: {df.shape[0]} rows and {df.shape[1]} columns")
print(f"Dataset size: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

df_original = df.copy()

df.head()

### 2. Missing Values Handling

In [None]:
zero_cols = ["Glucose","BloodPressure","SkinThickness","Insulin","BMI"]
df[zero_cols] = df[zero_cols].replace(0, np.nan)
summary = pd.DataFrame({
    "Missing Count": df[zero_cols].isnull().sum(),
    "Median Used": [df[col].median() for col in zero_cols]
})

for col in zero_cols:
    df[col].fillna(df[col].median(), inplace=True)

print("Missing Values Handling Summary:")
print(summary)

### 3. Outliers Handling

In [None]:
features = ["Glucose", "BloodPressure", "BMI", "Age"]
for col in features:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower, upper = Q1 - 1.5*IQR, Q3 + 1.5*IQR
    outliers = ((df[col] < lower) | (df[col] > upper)).sum()
    print(f"{col}: {outliers} outliers")

plt.figure(figsize=(12, 8))
for i, col in enumerate(features, 1):
    plt.subplot(2, 2, i)
    sns.boxplot(y=df[col], color='lightpink')
    plt.title(f'{col} – Boxplot')
    plt.ylabel(col)
plt.suptitle("Outlier Visualization via Boxplots", y=1.02)
plt.tight_layout()
plt.show()

df.to_csv('../data/processed/cleaned_data.csv', index=False)

### 4. Train-Test Split

In [None]:
X = df.drop(columns='Outcome')
y = df['Outcome']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")
print(f"Train class distribution: {Counter(y_train)}")
print(f"Test class distribution:  {Counter(y_test)}")

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

sns.countplot(x=y_train, ax=axes[0], palette='Set2')
axes[0].set_title("Train Class Distribution")

sns.countplot(x=y_test, ax=axes[1], palette='Set2')
axes[1].set_title("Test Class Distribution")

plt.tight_layout()
plt.show()

### 5. Feature Scaling

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)

print("Scaling completed")
print(f"Train scaled mean: {X_train_scaled.mean().round(3).values}")
print(f"Train scaled std: {X_train_scaled.std().round(3).values}")

### 6. Save Processed Dataset

In [None]:
joblib.dump(scaler, '../data/processed/scaler.pkl')

X_train_scaled.to_csv('../data/processed/X_train_scaled.csv', index=False)
X_test_scaled.to_csv('../data/processed/X_test_scaled.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)

### Conclusion
The preprocessing pipeline successfully handled missing values encoded as zeros, detected and visualized outliers, and prepared train/test splits with proper scaling. The StandardScaler was fitted exclusively on training data to prevent data leakage. Processed datasets and the fitted scaler were saved to `data/processed/` for use in feature engineering and modeling stages.