#### COGS 108 Project

In [1]:
import numpy as np
import pandas as pd

#### Fetching data

link to the dataset can be found [here](https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset?resource=download).

In [2]:
# loading in the datatset 
# csv assumed to be in the same directory as this notebook
df_original = pd.read_csv('diabetes_prediction_dataset.csv')
df_original.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [3]:
# there are 100,000 records
df_original.shape

(100000, 9)

In [12]:
# checking for null values
df_original.isna().sum()

gender                 0
age                    0
hypertension           0
heart_disease          0
smoking_history        0
bmi                    0
HbA1c_level            0
blood_glucose_level    0
diabetes               0
dtype: int64

In [10]:
# checking the type of each column
df_original.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


In [15]:
# checking the distribution of numerical values
df_original.describe()[['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']]

Unnamed: 0,age,bmi,HbA1c_level,blood_glucose_level
count,100000.0,100000.0,100000.0,100000.0
mean,41.885856,27.320767,5.527507,138.05806
std,22.51684,6.636783,1.070672,40.708136
min,0.08,10.01,3.5,80.0
25%,24.0,23.63,4.8,100.0
50%,43.0,27.32,5.8,140.0
75%,60.0,29.58,6.2,159.0
max,80.0,95.69,9.0,300.0


In [16]:
df = df_original.copy()
df['smoking_history'] = df['smoking_history'].str.lower()

df.groupby(['gender']).count()['age'] 

gender
Female    58552
Male      41430
Other        18
Name: age, dtype: int64

In [17]:
df = df[df['gender'] != 'Other']
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,no info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


#### EDA

In [5]:
# proportion of 'hypertension', 'heart_disease', 'diabetes' by gender
df.groupby(['gender']).mean()[['hypertension', 'heart_disease', 'diabetes']]

Unnamed: 0_level_0,hypertension,heart_disease,diabetes
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Female,0.07168,0.026677,0.076189
Male,0.079363,0.057446,0.09749


In [6]:
# average 'age', 'bmi', 'HbA1c_level', 'blood_glucose_level' by gender
df.groupby(['gender']).mean()[['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']]  # or median

Unnamed: 0_level_0,age,bmi,HbA1c_level,blood_glucose_level
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Female,42.463291,27.449287,5.509477,137.468951
Male,41.075139,27.139108,5.553041,138.890031


In [7]:
# proportion of 'hypertension', 'heart_disease' by diabetes status and gender
df.groupby(['diabetes', 'gender']).mean()[['hypertension', 'heart_disease']]

Unnamed: 0_level_0,Unnamed: 1_level_0,hypertension,heart_disease
diabetes,gender,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Female,0.056923,0.019153
0,Male,0.061994,0.043834
1,Female,0.250616,0.117911
1,Male,0.240158,0.183461


In [8]:
# average 'age', 'bmi', 'HbA1c_level', 'blood_glucose_level' by diabetes status and gender
df.groupby(['diabetes', 'gender']).mean()[['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']] # or median

Unnamed: 0_level_0,Unnamed: 1_level_0,age,bmi,HbA1c_level,blood_glucose_level
diabetes,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Female,40.935065,27.022526,5.391509,132.811429
0,Male,38.934209,26.691107,5.404354,132.908668
1,Female,60.993499,32.623898,6.939879,193.942838
1,Male,60.894776,31.286467,6.929512,194.262441


In [9]:
# create age group bins to explore features by age group
# justify and/or change ages chosen here
bins = pd.cut(df['age'], [0, 5, 13, 18, 30, 45, 60, 80]) 
df['age_group'] = bins

In [10]:
# proportion of 'hypertension', 'heart_disease', 'diabetes' by age group
df.groupby('age_group').mean()[['hypertension', 'heart_disease', 'diabetes']]

Unnamed: 0_level_0,hypertension,heart_disease,diabetes
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(0, 5]",0.000159,0.000159,0.001276
"(5, 13]",0.000427,0.000569,0.004979
"(13, 18]",0.001178,0.0,0.009619
"(18, 30]",0.00809,0.000652,0.013636
"(30, 45]",0.041902,0.006951,0.045328
"(45, 60]",0.105177,0.038068,0.11416
"(60, 80]",0.175271,0.124027,0.199687


In [11]:
# average 'bmi', 'HbA1c_level', 'blood_glucose_level' by age group
df.groupby('age_group').mean()[['bmi', 'HbA1c_level', 'blood_glucose_level']] # or median

Unnamed: 0_level_0,bmi,HbA1c_level,blood_glucose_level
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
"(0, 5]",18.691292,5.407208,132.498007
"(5, 13]",21.110745,5.424822,133.264438
"(13, 18]",24.96332,5.407087,133.620338
"(18, 30]",27.229268,5.401142,133.935408
"(30, 45]",29.050945,5.468747,135.056102
"(45, 60]",29.495225,5.57836,139.937205
"(60, 80]",28.481819,5.699678,145.359428


In [12]:
# proportion of 'heart_disease', 'diabetes' by hypertension status
df.groupby('hypertension').mean()[['heart_disease', 'diabetes']]

Unnamed: 0_level_0,heart_disease,diabetes
hypertension,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.032715,0.069321
1,0.122378,0.278958


In [13]:
# proportion of 'hypertension', 'diabetes' by heart_disease status
df.groupby('heart_disease').mean()[['hypertension', 'diabetes']]

Unnamed: 0_level_0,hypertension,diabetes
heart_disease,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.068399,0.075312
1,0.232369,0.32141


To do: 

- Add bins for `bmi` and see proportions and averages, similarly explore `HbA1c_level` and/or `blood_glucose_level`, using either bins for the numerical values, or by assigning levels such as `high`, `normal`, `low`, etc (see below).


- Create plots to visualize EDA.


- We are building a classifier to predict `diabetes` status of a patient, we can focus on those 18, and maybe even 13 and above, because for ages below that there are very few records as a proportion of the dataset. 


- If anyone would like to create a classifier or a regressor that predicts another feature in, or maybe even currently external to the datatset (such as `bmi_level` and not just `bmi`), I am open to suggestions. I chose `diabetes` status since it is the most obvious, but also for this dataset, the most reasonable variable to try to predict.

#### Data Cleaning and Feature Engineering

additional data for feature engineering:
- bmi level: [cdc](https://www.cdc.gov/obesity/basics/adult-defining.html)
- hba1c level: [bmc](https://bmcendocrdisord.biomedcentral.com/articles/10.1186/s12902-019-0338-7)
- blood glucose level: [medlineplus](https://medlineplus.gov/ency/patientinstructions/000086.htm#:~:text=From%2090%20to%20130%20mg,children%20under%206%20years%20old)
- diabetes likelihood: [cdc](https://www.cdc.gov/media/releases/2017/p0718-diabetes-report.html)

can add more related / relevant links