## Feature Engineering - Diabetes Dataset
### Introduction
This notebook focuses on **feature engineering** for the diabetes prediction project. Using the cleaned dataset from `02_data_preprocessing.ipynb`, we will create informative features that enhance model performance and interpretability.

**Dataset:** Diabetes Dataset (Kaggle)

**Objective:** Create and select new features to improve model performance and prepare for modeling.

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Feature Engineering Steps:**
1. Import Libraries and Load Data
2. Feature Creation
3. Feature Transformation
4. Encoding Categorical Features
5. Feature Selection
6. Final Feature Set

### 1. Import Libraries and Load Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel

plt.style.use('default')
sns.set_palette("husl")
%matplotlib inline

df = pd.read_csv("../data/processed/cleaned_data.csv")

print(f"Cleaned dataset loaded: {df.shape[0]} rows and {df.shape[1]} columns")
print(f"Dataset size: {df.memory_usage(deep=True).sum() / 1024:.1f} KB")

df.head()

### 2. Feature Creation

In [None]:
def create_features(df):
    df = df.copy()
    df['AgeGroup'] = pd.cut(df['Age'],
                            bins=[20, 30, 40, 50, 60, 100],
                            labels=['20-29', '30-39', '40-49', '50-59', '60+'],
                            right=False)

    df['BMI_Category'] = pd.cut(df['BMI'],
                                bins=[0, 18.5, 25, 30, 100],
                                labels=['Underweight', 'Normal', 'Overweight', 'Obese'],
                                right=False)

    df['Insulin_Glucose_Ratio'] = df['Insulin'] / df['Glucose'].replace(0, np.nan)

    return df

df = create_features(df)
X = df.drop('Outcome', axis=1)
y = df['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f'Train shape: {X_train.shape}, Test shape: {X_test.shape}')

### 3. Feature Transformation

In [None]:
num_cols = ['Pregnancies','Glucose','BloodPressure','SkinThickness',
            'Insulin','BMI','DiabetesPedigreeFunction','Age','Insulin_Glucose_Ratio']

scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

### 4. Encoding Categorical Features

In [None]:
def encode_categorical(df, categories=None):
    df_encoded = pd.get_dummies(df, columns=['AgeGroup', 'BMI_Category'], drop_first=True)
    
    if categories is not None:
        missing_cols = set(categories) - set(df_encoded.columns)
        for col in missing_cols:
            df_encoded[col] = 0
        df_encoded = df_encoded[categories]  
    return df_encoded

X_train_encoded = encode_categorical(X_train)
X_test_encoded = encode_categorical(X_test, categories=X_train_encoded.columns)

### 5. Feature Selection

In [None]:
rf = RandomForestClassifier(random_state=42, n_estimators=100)
rf.fit(X_train_encoded, y_train)

importances = pd.Series(rf.feature_importances_, index=X_train_encoded.columns).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
importances.head(10).plot(kind='barh', color='teal')
plt.title("Top 10 Feature Importances (Random Forest)")
plt.xlabel("Importance")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

print("\nTop 10 features:")
print(importances.head(10))

### 6. Final Feature Set

In [None]:
X_train_encoded.to_csv('../data/processed/X_train_final.csv', index=False)
X_test_encoded.to_csv('../data/processed/X_test_final.csv', index=False)
y_train.to_csv('../data/processed/y_train.csv', index=False)
y_test.to_csv('../data/processed/y_test.csv', index=False)
print(f"Final feature set shape: {X_train_encoded.shape}")

### Conclusion
Feature engineering produced 16 final features by creating categorical bins (AgeGroup, BMI_Category) and a numeric ratio (Insulin_Glucose_Ratio), applying scaling to numeric features, and one-hot encoding categorical variables. RandomForest feature importance analysis identified Glucose, BMI, and Age as the most predictive features. The engineered datasets were saved to `data/processed/` and are ready for model training.