# Task 1: Data Preprocessing & Exploration

#### Our goal is to prepare and understand the given dataset to build a solid foundation for the model


### 0 Setup

In [141]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

# Plotting settings
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

print("Libraries imported successfully!")
print(f"Random seed set to: {RANDOM_SEED}")

Libraries imported successfully!
Random seed set to: 42


In [142]:
df = pd.read_csv('../data/raw/diabetes_dataset.csv')
print(df.shape)
print(df.columns) # Display column names (our features)
print(df.dtypes)
df.head(10) # Display first 10 records

(100000, 31)
Index(['age', 'gender', 'ethnicity', 'education_level', 'income_level',
       'employment_status', 'smoking_status', 'alcohol_consumption_per_week',
       'physical_activity_minutes_per_week', 'diet_score',
       'sleep_hours_per_day', 'screen_time_hours_per_day',
       'family_history_diabetes', 'hypertension_history',
       'cardiovascular_history', 'bmi', 'waist_to_hip_ratio', 'systolic_bp',
       'diastolic_bp', 'heart_rate', 'cholesterol_total', 'hdl_cholesterol',
       'ldl_cholesterol', 'triglycerides', 'glucose_fasting',
       'glucose_postprandial', 'insulin_level', 'hba1c', 'diabetes_risk_score',
       'diabetes_stage', 'diagnosed_diabetes'],
      dtype='object')
age                                     int64
gender                                 object
ethnicity                              object
education_level                        object
income_level                           object
employment_status                      object
smoking_status     

Unnamed: 0,age,gender,ethnicity,education_level,income_level,employment_status,smoking_status,alcohol_consumption_per_week,physical_activity_minutes_per_week,diet_score,...,hdl_cholesterol,ldl_cholesterol,triglycerides,glucose_fasting,glucose_postprandial,insulin_level,hba1c,diabetes_risk_score,diabetes_stage,diagnosed_diabetes
0,58,Male,Asian,Highschool,Lower-Middle,Employed,Never,0,215,5.7,...,41,160,145,136,236,6.36,8.18,29.6,Type 2,1
1,48,Female,White,Highschool,Middle,Employed,Former,1,143,6.7,...,55,50,30,93,150,2.0,5.63,23.0,No Diabetes,0
2,60,Male,Hispanic,Highschool,Middle,Unemployed,Never,1,57,6.4,...,66,99,36,118,195,5.07,7.51,44.7,Type 2,1
3,74,Female,Black,Highschool,Low,Retired,Never,0,49,3.4,...,50,79,140,139,253,5.28,9.03,38.2,Type 2,1
4,46,Male,White,Graduate,Middle,Retired,Never,1,109,7.2,...,52,125,160,137,184,12.74,7.2,23.5,Type 2,1
5,46,Female,White,Highschool,Upper-Middle,Employed,Never,2,124,9.0,...,61,119,179,100,133,8.77,6.03,23.5,Pre-Diabetes,0
6,75,Female,White,Graduate,Upper-Middle,Retired,Never,0,53,9.2,...,46,161,155,101,100,10.14,5.24,36.1,Pre-Diabetes,0
7,62,Male,White,Postgraduate,Middle,Unemployed,Current,1,75,4.1,...,49,159,120,110,189,8.96,7.04,34.2,Type 2,1
8,42,Male,Black,Highschool,Lower-Middle,Employed,Current,1,114,6.7,...,33,132,98,116,172,5.7,6.9,26.7,Type 2,1
9,59,Female,White,Graduate,Middle,Employed,Current,3,86,8.2,...,52,103,104,76,109,4.49,4.99,30.0,No Diabetes,0


### 1.1 Cleaning Process

Start by first displaying the dataset before any changes are applied to it. Our provided [dataset](https://www.kaggle.com/datasets/mohankrishnathalla/diabetes-health-indicators-dataset/data) is already preprocessed, clean, and ready for models to start working with. However, we will still go through with cleaning process if we plan to add anymore future datasets. There is only one issue, specifically, the feature *diabetes_stages* is unneeded for our prediction model and basically gives away the answer.





In [105]:
print("Dataset before cleaning:")
for column in df.columns:
    if np.issubdtype(df[column].dtype, np.number):
        print(f"{column:<40} Mean: {df[column].mean():>8.3f} | Median: {df[column].median():>8.3f}")

Dataset before cleaning:
age                                      Mean:   50.120 | Median:   50.000
alcohol_consumption_per_week             Mean:    2.004 | Median:    2.000
physical_activity_minutes_per_week       Mean:  118.912 | Median:  100.000
diet_score                               Mean:    5.995 | Median:    6.000
sleep_hours_per_day                      Mean:    6.998 | Median:    7.000
screen_time_hours_per_day                Mean:    5.996 | Median:    6.000
family_history_diabetes                  Mean:    0.219 | Median:    0.000
hypertension_history                     Mean:    0.251 | Median:    0.000
cardiovascular_history                   Mean:    0.079 | Median:    0.000
bmi                                      Mean:   25.613 | Median:   25.600
waist_to_hip_ratio                       Mean:    0.856 | Median:    0.860
systolic_bp                              Mean:  115.800 | Median:  116.000
diastolic_bp                             Mean:   75.232 | Median:   75.000


In [143]:
# Handling Null Values (empty records)
df_null = df.isnull() # bool
total_null = df_null.sum().sum() # count

print(f"Total Before:   {total_null} null") # Summing up null values for each column before handling
print(df.shape)

df_cleaned = df.copy()

# Handling Outliers
exclude_columns = ['family_history_diabetes','hypertension_history','cardiovascular_history','diagnosed_diabetes'] # Categorical columns to exclude (These are True/False binary values)
for col in exclude_columns:
    df_cleaned[col] = df_cleaned[col].clip(0,1) # reassigns any outliers to the closest bounds (0 or 1)

allowed_columns = df_cleaned.select_dtypes(include=[np.number]).columns.difference(exclude_columns) # columns to check for outliers

# Using IQR https://www.kaggle.com/code/aman2626786/outlier-removal-using-iqr-method
# This removes around 16k records from original 100k records
for col in allowed_columns:
    p25 = df_cleaned[col].quantile(0.25)
    p75 = df_cleaned[col].quantile(0.75)
    iqr = p75 - p25
    lower_bound = p25 - 1.5 * iqr
    upper_bound = p75 + 1.5 * iqr
    df_cleaned = df_cleaned[(df_cleaned[col] >= lower_bound) & (df_cleaned[col] <= upper_bound)] # Remove outliers

# Handling duplicate and empty records
df_cleaned = df_cleaned.drop_duplicates().dropna().drop('diabetes_stage', axis=1) # Drop duplicates, empty records, and 'diabetes_stage' column


for column in df_cleaned.columns:
    if df_cleaned[column].isnull().sum() > 0: # Check column datatypes if there exists null values
        if df_cleaned[column].dtype in [np.float64, np.int64]: # Numerical variables
            print(f"{column} contains {df_cleaned[column].isnull().sum()} null values.")
            df_cleaned[column] = df_cleaned[column].fillna(df_cleaned[column].median())
        else:
            df_cleaned[column] = df_cleaned[column].fillna("UNKNOWN") # Categorical variables

cleandf_null = df_cleaned.isnull() # bool

print(f"Total After:    {cleandf_null.sum().sum()} null") # Summing up null values for each column
print(df_cleaned.shape)

Total Before:   0 null
(100000, 31)
Total After:    0 null
(88637, 30)


In [133]:
print("Dataset after cleaning:")
for column in df_cleaned.columns:
    if np.issubdtype(df_cleaned[column].dtype, np.number):
        print(f"{column:<40} Mean: {df_cleaned[column].mean():>8.3f} | Median: {df_cleaned[column].median():>8.3f}")

Dataset after cleaning:
age                                      Mean:   49.764 | Median:   50.000
gender                                   Mean:    0.518 | Median:    0.000
ethnicity                                Mean:    2.532 | Median:    3.000
education_level                          Mean:    1.001 | Median:    1.000
income_level                             Mean:    2.499 | Median:    3.000
employment_status                        Mean:    0.697 | Median:    0.000
smoking_status                           Mean:    1.399 | Median:    2.000
alcohol_consumption_per_week             Mean:    1.979 | Median:    2.000
physical_activity_minutes_per_week       Mean:  110.947 | Median:   97.000
diet_score                               Mean:    6.026 | Median:    6.000
sleep_hours_per_day                      Mean:    7.000 | Median:    7.000
screen_time_hours_per_day                Mean:    5.966 | Median:    6.000
family_history_diabetes                  Mean:    0.211 | Median:    0.000
h

### 1.2 Feature Engineering (WIP)

To allow our dataset to become understandable to machine learning algorithms, we must encode the categorical objects in our dataset. Categorical variables/objects are often data types that these algorithms cannot understand (e.g., strings, descriptive information), there are two different types that exist: *Ordinal* and *Nominal*.

**Ordinal Data** - 
    Can be described as values that contain an "order" or a "hierarchy". For example, we can see this for feature *education_level*, as we grown up experiencing this as we progressed from all elementary school to university.

**Nominal Data** - 
    Can be decsribed as values that contain no order. For example, we see this with feature *gender*, where the options are Man or Female and there is no order between the two.
    
Encoding properly is vital as to our prediction model stripping away bias and relationships, while also providing additional information. There are several approaches to encoding, but we decided to select: *One Hot Encoding* and *Ordinal Encoding*:

| Column Name         | Data Type | Categorical Type | Encoding Approach
|----------------------|------------|------------|------------|
| gender               | object     | nominal  | one hot encoding
| ethnicity            | object     | nominal  | one hot encoding
| education_level      | object     | ordinal  | ordinal encoding
| income_level         | object     | ordinal  | ordinal encoding
| employment_status    | object     | nominal  | one hot encoding
| smoking_status       | object     | ordinal  | ordinal encoding


In [150]:
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

nominal_columns = ['gender', 'ethnicity', 'employment_status']
ordinal_columns = ['education_level', 'income_level', 'smoking_status']
ordinal_categories = [
    ['No formal', 'Highschool', 'Graduate', 'Postgraduate'],  # education_level
    ['Low', 'Lower-Middle', 'Middle', 'Upper-Middle', 'High'],  # income_level
    ['Never', 'Former', 'Current']  # smoking_status
]

# One Hot Encoding
df_feature = df_cleaned.copy()
df_feature = pd.get_dummies(df_feature, columns=nominal_columns,drop_first=True,dtype=int)

# Ordinal Encoding
encoder = OrdinalEncoder(categories=ordinal_categories)
df_feature[ordinal_columns] = encoder.fit_transform(df_feature[ordinal_columns])


In [None]:
# implementation to print df_feature w/ encoded columns

#### Creating Derived Features

Within these categorical variables there are two different ways we can encode this: One Hot Encoding and Ordinal Encoding. Specifically, education_level, employment_status, and smoking_status comes with this hiearchial 

### 1.3 Exploratory Data Analysis

### 1.4 Train / Test Split

# Task 2 ?
### Same notebook or a separate one?