# **Diabetes Risk Analysis**

## Objectives

- Download cardiovascular disease dataset (preprocessed) from kaggle
- Load the dataset into a pandas dataframe
- Perform basic data exploration
- Create new features based on existing data
- Prepare the dataset for calculating diabetes risk using ChatGPT
- Reload the dataset with diabetes risk percentage
- Visualize the data using matplotlib and seaborn

## Inputs

- **Dataset:** cardio_data_processed.csv. The dataset is available on Kaggle at [Cardiovascular Disease Dataset](https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset).
- **Diabetes Risk Percentage:** With the help of ChatGPT, I will calculate the percentage of individuals at risk of diabetes based on the dataset on a separate notebook and merge the results with the main dataset.
- **Python Version:** 3.12.8
- **Python Libraries:** pandas, numpy, matplotlib, seaborn
- **Environment:** Jupyter Notebook or any Python IDE that supports data analysis


## Outputs

- **Cleaned dataset:**

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/raihannasir/Documents/DA_AI/diabetes_risk/diabetes_risk/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/raihannasir/Documents/DA_AI/diabetes_risk/diabetes_risk'

---

# Import necessary libraries and Packages

**I will import the necessary libraries and packages including pandas, numpy, matplotlib, and seaborn, which will be used for data analysis and visualization purposes.**

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load the raw dataset

In [22]:
raw_path = 'dataset/raw/cardio_data_processed.csv'

In [6]:
df = pd.read_csv(raw_path)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years,bmi,bp_category,bp_category_encoded
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,Hypertension Stage 1
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,Hypertension Stage 2
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,Hypertension Stage 1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,Hypertension Stage 2
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,Normal


**Using `.info()` method, I will try to explore general information about the structure of the dataset, including the number of entries, column names, data types, non-null counts. This will help identify any missing values or inconsistencies in the dataset.**

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68205 entries, 0 to 68204
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   68205 non-null  int64  
 1   age                  68205 non-null  int64  
 2   gender               68205 non-null  int64  
 3   height               68205 non-null  int64  
 4   weight               68205 non-null  float64
 5   ap_hi                68205 non-null  int64  
 6   ap_lo                68205 non-null  int64  
 7   cholesterol          68205 non-null  int64  
 8   gluc                 68205 non-null  int64  
 9   smoke                68205 non-null  int64  
 10  alco                 68205 non-null  int64  
 11  active               68205 non-null  int64  
 12  cardio               68205 non-null  int64  
 13  age_years            68205 non-null  int64  
 14  bmi                  68205 non-null  float64
 15  bp_category          68205 non-null 

## Initial data screening and exploration:

**Use describe() to summarize the dataset of numerical features.**
- Check for missing values, outliers, and basic statistics like mean, median, and standard deviation.

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,68205.0,49972.410498,28852.13829,0.0,24991.0,50008.0,74878.0,99999.0
age,68205.0,19462.667737,2468.381854,10798.0,17656.0,19700.0,21323.0,23713.0
gender,68205.0,1.348625,0.476539,1.0,1.0,1.0,2.0,2.0
height,68205.0,164.372861,8.176756,55.0,159.0,165.0,170.0,250.0
weight,68205.0,74.100688,14.288862,11.0,65.0,72.0,82.0,200.0
ap_hi,68205.0,126.434924,15.961685,90.0,120.0,120.0,140.0,180.0
ap_lo,68205.0,81.263925,9.143985,60.0,80.0,80.0,90.0,120.0
cholesterol,68205.0,1.363243,0.67808,1.0,1.0,1.0,1.0,3.0
gluc,68205.0,1.225174,0.571288,1.0,1.0,1.0,1.0,3.0
smoke,68205.0,0.087662,0.282805,0.0,0.0,0.0,0.0,1.0


**Use describe(include='object') to summarize the dataset of categorical features.**
- Check for number of unique values, the most frequent value and number of times it appears.

In [9]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
bp_category,68205,4,Hypertension Stage 1,39750
bp_category_encoded,68205,4,Hypertension Stage 1,39750


## Drop some of the unnecessary columns from the raw dataset as they are not needed.
 - **ID:** Unique identifier for each individual, not needed for analysis.
 - **age:** Age is already represented in years under `'age_years'` column.
 - **bp_category_encoded:** This is the repetitive column of `bp_category` and it is not encoded. So, I will drop this column.

In [10]:
df.drop(columns= ['id', 'age', 'bp_category_encoded'], inplace=True)
df.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years,bmi,bp_category
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal


**I will check the followings:**
- the shape of the dataset after dropping the unnecessary columns.
- name of the columns in the dataset.
- presence of any missing values per column in the dataset.

In [11]:
print(df.shape)
print(df.columns)
print(df.isnull().sum())


(68205, 14)
Index(['gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc',
       'smoke', 'alco', 'active', 'cardio', 'age_years', 'bmi', 'bp_category'],
      dtype='object')
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
age_years      0
bmi            0
bp_category    0
dtype: int64


**I will rename the `age_years` column to `age` for consistency with the original dataset.**

In [12]:
df.rename(columns = {'age_years': 'age',}, inplace=True)
df.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal


**For the purpose of analysis convenience, I will create two new categorical columns using functions:**
- `age_group`: Categorizing individuals into age groups.
- `bmi_category`: Categorizing individuals into BMI groups.

**These columns will be used during the dashboard creation to visualize the data more effectively.**

In [14]:
# Function to categorize age into groups

def age_group(age):
    if age < 30:
        return '18-29'
    elif age < 40:
        return '30-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    elif age >= 60:
        return '60+'

# Function to categorize BMI into groups
def bmi_category(bmi):
    if bmi < 18:
        return 'Underweight'
    elif 18 <= bmi < 25:
        return 'Normal weight'
    elif 25 <= bmi < 30:
        return 'Overweight'
    elif 30 <= bmi < 35:
        return 'Obesity I'
    elif 35 <= bmi < 40:
        return 'Obesity II'
    else:
        return 'Obesity III'

df = df.pipe(
    lambda x: x.assign(
        age_group=x['age'].apply(age_group),
        bmi_category=x['bmi'].apply(bmi_category)
    )
)

df.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category,age_group,bmi_category
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,50-59,Normal weight
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,50-59,Obesity I
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,50-59,Normal weight
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,40-49,Overweight
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,40-49,Normal weight


**I will further simplify the `age_group` and `bmi_category` columns for better readability and Hypothesis testing purposes**

In [15]:
def simplify_age_group(group):
    if group in ['18-29', '30-39']:
        return 'Adult'
    elif group in ['40-49', '50-59']:
        return 'Middle-aged'
    else:
        return 'Senior'

def simplify_bmi_category(category):
    if category in ['Underweight', 'Normal weight']:
        return 'Low BMI'
    elif category in ['Overweight', 'Obesity I']:
        return 'Mid BMI'
    else:
        return 'High BMI'

df = df.pipe(
    lambda x: x.assign(
        age_simp_group=x['age_group'].apply(simplify_age_group),
        bmi_simp_cat=x['bmi_category'].apply(simplify_bmi_category)
    )
)

df.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category,age_group,bmi_category,age_simp_group,bmi_simp_cat
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,50-59,Normal weight,Middle-aged,Low BMI
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,50-59,Obesity I,Middle-aged,Mid BMI
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,50-59,Normal weight,Middle-aged,Low BMI
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,40-49,Overweight,Middle-aged,Mid BMI
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,40-49,Normal weight,Middle-aged,Low BMI


### Since cardovascular disease dataset has some key features that are relevant for diabetes risk analysis, I will create a new column called `diabetes_risk_percentage` to store the calculated diabetes risk percentage for each individual.

**Save the cleaned dataset in cleaned folder as a new CSV file for the purpose of Diabetes Risk Percentage calculation with the help of ChatGPT**

In [20]:
df.to_csv('dataset/cleaned/cardio_data_processed_clean.csv', index=False)

**Load the cleaned dataset with diabetes risk percentage.**

In [31]:
df_diab = pd.read_csv('dataset/raw/cardio_data_with_diabetes_risk.csv')
print(df_diab.shape)
print(df_diab.info())
print(df_diab.isnull().sum())
df_diab.head()

(68205, 19)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68205 entries, 0 to 68204
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             68205 non-null  int64  
 1   height             68205 non-null  int64  
 2   weight             68205 non-null  float64
 3   ap_hi              68205 non-null  int64  
 4   ap_lo              68205 non-null  int64  
 5   cholesterol        68205 non-null  int64  
 6   gluc               68205 non-null  int64  
 7   smoke              68205 non-null  int64  
 8   alco               68205 non-null  int64  
 9   active             68205 non-null  int64  
 10  cardio             68205 non-null  int64  
 11  age                68205 non-null  int64  
 12  bmi                68205 non-null  float64
 13  bp_category        68205 non-null  object 
 14  age_group          68205 non-null  object 
 15  bmi_category       68205 non-null  object 
 16  age_simp_g

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category,age_group,bmi_category,age_simp_group,bmi_simp_cat,diab_risk_percent
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,14.42
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,50-59,Obese I,Middle-aged,Mid BMI,51.35
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,16.08
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,40-49,Overweight,Middle-aged,Mid BMI,19.75
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,40-49,Normal,Middle-aged,Low BMI,10.92


**As the dataset has 68205 rows, there might be a possibility of duplicate rows. Therefore, I will check the existence of duplicate rows in the dataset.**

In [30]:
print(df_diab.duplicated().sum())
df_diab[df_diab.duplicated()]

16


Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category,age_group,bmi_category,age_simp_group,bmi_simp_cat,diab_risk_percent
17495,1,160,60.0,120,80,1,1,0,0,1,0,58,23.4375,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,15.39
44606,1,165,65.0,120,80,1,1,0,0,1,0,54,23.875115,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,16.7
48560,2,170,75.0,120,80,1,1,0,0,1,0,50,25.951557,Hypertension Stage 1,50-59,Overweight,Middle-aged,Mid BMI,28.63
50573,2,170,70.0,120,80,1,1,0,0,1,1,59,24.221453,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,14.95
54628,1,170,70.0,120,80,1,1,0,0,1,1,49,24.221453,Hypertension Stage 1,40-49,Normal,Middle-aged,Low BMI,10.64
57338,1,165,65.0,120,80,1,1,0,0,1,0,55,23.875115,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,14.95
57881,1,165,65.0,120,80,1,1,0,0,1,0,50,23.875115,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,17.23
58920,1,165,63.0,120,80,1,1,0,0,1,0,39,23.140496,Hypertension Stage 1,30-39,Normal,Adult,Low BMI,5.96
59266,1,165,65.0,120,80,1,1,0,0,1,0,58,23.875115,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,16.98
60494,1,170,68.0,120,80,1,1,0,0,1,0,57,23.529412,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,15.36


**As I can see here, there are 16 duplicate rows in the dataset. However, after having a closer look at the duplicate dataset, I can see that these duplicate rows are not identical. Therefore, I will keep them in the dataset for further analysis.**