# **Diabetes Risk Analysis**

## Objectives

- Download cardiovascular disease dataset (preprocessed) from kaggle
- Load the dataset into a pandas dataframe
- Perform basic data exploration


## Inputs

- **Dataset:** cardio_data_processed.csv. The dataset is available on Kaggle at [Cardiovascular Disease Dataset](https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset).
- **Diabetes Risk Percentage:** With the help of ChatGPT, I will calculate the percentage of individuals at risk of diabetes based on the dataset on a separate notebook and merge the results with the main dataset.
- **Python Version:** 3.12.8
- **Python Libraries:** pandas, numpy, matplotlib, seaborn
- **Environment:** Jupyter Notebook or any Python IDE that supports data analysis


## Outputs

- **Cleaned dataset:**

## Additional Comments

* If you have any additional comments that don't fit in the previous bullets, please state them here. 

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/raihannasir/Documents/DA_AI/diabetes_risk/diabetes_risk/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/raihannasir/Documents/DA_AI/diabetes_risk/diabetes_risk'

# Import necessary libraries and Packages

I will import the necessary libraries and packages including pandas, numpy, matplotlib, and seaborn, which will be used for data analysis and visualization purposes.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Load the raw dataset

In [5]:
raw_path = 'dataset/raw/cardio_data_processed.csv'

In [6]:
df = pd.read_csv(raw_path)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years,bmi,bp_category,bp_category_encoded
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,Hypertension Stage 1
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,Hypertension Stage 2
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,Hypertension Stage 1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,Hypertension Stage 2
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,Normal


Using `.info()` method, I will try to explore general information about the structure of the dataset, including the number of entries, column names, data types, non-null counts. This will help identify any missing values or inconsistencies in the dataset.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68205 entries, 0 to 68204
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   68205 non-null  int64  
 1   age                  68205 non-null  int64  
 2   gender               68205 non-null  int64  
 3   height               68205 non-null  int64  
 4   weight               68205 non-null  float64
 5   ap_hi                68205 non-null  int64  
 6   ap_lo                68205 non-null  int64  
 7   cholesterol          68205 non-null  int64  
 8   gluc                 68205 non-null  int64  
 9   smoke                68205 non-null  int64  
 10  alco                 68205 non-null  int64  
 11  active               68205 non-null  int64  
 12  cardio               68205 non-null  int64  
 13  age_years            68205 non-null  int64  
 14  bmi                  68205 non-null  float64
 15  bp_category          68205 non-null 

**Initial data screening and exploration:**

- Use describe() to summarize the dataset of numerical features.
    - Check for missing values, outliers, and basic statistics like mean, median, and standard deviation.

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,68205.0,49972.410498,28852.13829,0.0,24991.0,50008.0,74878.0,99999.0
age,68205.0,19462.667737,2468.381854,10798.0,17656.0,19700.0,21323.0,23713.0
gender,68205.0,1.348625,0.476539,1.0,1.0,1.0,2.0,2.0
height,68205.0,164.372861,8.176756,55.0,159.0,165.0,170.0,250.0
weight,68205.0,74.100688,14.288862,11.0,65.0,72.0,82.0,200.0
ap_hi,68205.0,126.434924,15.961685,90.0,120.0,120.0,140.0,180.0
ap_lo,68205.0,81.263925,9.143985,60.0,80.0,80.0,90.0,120.0
cholesterol,68205.0,1.363243,0.67808,1.0,1.0,1.0,1.0,3.0
gluc,68205.0,1.225174,0.571288,1.0,1.0,1.0,1.0,3.0
smoke,68205.0,0.087662,0.282805,0.0,0.0,0.0,0.0,1.0


- Use describe(include='object') to summarize the dataset of categorical features.
    - Check for number of unique values, the most frequent value and number of times it appears.

In [9]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
bp_category,68205,4,Hypertension Stage 1,39750
bp_category_encoded,68205,4,Hypertension Stage 1,39750


---

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
