Objectives

- Understand key risk factors in diabetes data through visualizations.
- Identify patterns and relationships betweeen features and diabetes outcome.
- Spot common data issues (imbalance, outliers) and their impact on ML models.
- Learn domain-specific EDA techniques for healthcare datasets.
- Apply insights to inform feature engineering for predictive modeling.

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.read_csv("diabetes_prediction_dataset.csv")
df.head(10)

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0
5,Female,20.0,0,0,never,27.32,6.6,85,0
6,Female,44.0,0,0,never,19.31,6.5,200,1
7,Female,79.0,0,0,No Info,23.86,5.7,85,0
8,Male,42.0,0,0,never,33.64,4.8,145,0
9,Female,32.0,0,0,never,27.32,5.0,100,0


In [7]:
# Data Quality Check
# Missing values, duplicates

print(f"Missing Values:, {df.isnull().sum()}")
print(f"Duplicate Rows:, {df.duplicated().sum()}")
print(f"Target Distribution:{df['diabetes'].value_counts()}")
print(f"\nTarget Balance: {(df['diabetes'].value_counts(normalize=True) * 100).round(2).to_dict()}")

Missing Values:, gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64
Duplicate Rows:, 3854
Target Distribution:diabetes
0    91500
1     8500
Name: count, dtype: int64

Target Balance: {0: 91.5, 1: 8.5}


In [9]:
# Staistical Summary

df.describe()

Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,41.885856,0.07485,0.03942,27.320767,5.527507,138.05806,0.085
std,22.51684,0.26315,0.194593,6.636783,1.070672,40.708136,0.278883
min,0.08,0.0,0.0,10.01,3.5,80.0,0.0
25%,24.0,0.0,0.0,23.63,4.8,100.0,0.0
50%,43.0,0.0,0.0,27.32,5.8,140.0,0.0
75%,60.0,0.0,0.0,29.58,6.2,159.0,0.0
max,80.0,1.0,1.0,95.69,9.0,300.0,1.0


In [11]:
df['gender'].unique()
df['smoking_history'].unique()

array(['never', 'No Info', 'current', 'former', 'ever', 'not current'],
      dtype=object)

In [22]:
print("="*60 + "\n Feature Cardinality Test Beings : \n" + "="*60)

for column in df.columns:
    num_distinct = len(df[column].unique())
    feature_type = "Categorical" if num_distinct < 10 else "Numerical"
    print(f"{column:20s} | {num_distinct:6,} unique values | {feature_type}")



 Feature Cardinality Test Beings : 
gender               |      3 unique values | Categorical
age                  |    102 unique values | Numerical
hypertension         |      2 unique values | Categorical
heart_disease        |      2 unique values | Categorical
smoking_history      |      6 unique values | Categorical
bmi                  |  4,247 unique values | Numerical
HbA1c_level          |     18 unique values | Numerical
blood_glucose_level  |     18 unique values | Numerical
diabetes             |      2 unique values | Categorical
