# **Diabetes Risk Analysis**

## Objectives

- Download cardiovascular disease dataset (preprocessed) from kaggle
- Load the dataset into a pandas dataframe
- Perform basic data exploration
- Create new features based on existing data for the analysis purpose
- Prepare the dataset for calculating diabetes risk using ChatGPT
- Reload the dataset with diabetes risk percentage
- Visualize the data using matplotlib and seaborn

## Inputs

- **Dataset:** `cardio_data_processed.csv`. The dataset is available on Kaggle at [Cardiovascular Disease](https://www.kaggle.com/datasets/colewelkins/cardiovascular-disease/data).
- **Diabetes Risk Percentage:** With the help of ChatGPT, I will calculate the percentage of individuals at risk of diabetes based on the dataset on a separate notebook and merge the results with the main dataset.
- **Python Version:** 3.12.8
- **Python Libraries:** pandas, numpy, matplotlib, seaborn
- **Environment:** Jupyter Notebook or any Python IDE that supports data analysis
- **Columns of interest:**
    - Target column: `diab_risk_cat` || `diab_risk_percent`
    - Features: `weight`, `active`, `bmi_simp_cat`, `age_simp_group`, `gender`, `bp_category`


## Outputs

- **Cleaned dataset:** `cardio_data_with_diabetes_risk_clean.csv` is stored in the `data/cleaned` folder.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/Users/raihannasir/Documents/DA_AI/diabetes_risk/diabetes_risk/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/Users/raihannasir/Documents/DA_AI/diabetes_risk/diabetes_risk'

---

# Import necessary libraries and Packages

**I will import the necessary libraries and packages including pandas, numpy, matplotlib, and seaborn, which will be used for data analysis and visualization purposes.**

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Importing necessary libraries
from sklearn.pipeline import Pipeline
from feature_engine.outliers import Winsorizer
from feature_engine import transformation as vt

## Load the raw dataset

In [5]:
raw_path = 'dataset/raw/cardio_data_processed.csv'

In [6]:
df = pd.read_csv(raw_path)
df.head()

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years,bmi,bp_category,bp_category_encoded
0,0,18393,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,Hypertension Stage 1
1,1,20228,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,Hypertension Stage 2
2,2,18857,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,Hypertension Stage 1
3,3,17623,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,Hypertension Stage 2
4,4,17474,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,Normal


**Using `.info()` method, I will try to explore general information about the structure of the dataset, including the number of entries, column names, data types, non-null counts. This will help identify any missing values or inconsistencies in the dataset.**

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68205 entries, 0 to 68204
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id                   68205 non-null  int64  
 1   age                  68205 non-null  int64  
 2   gender               68205 non-null  int64  
 3   height               68205 non-null  int64  
 4   weight               68205 non-null  float64
 5   ap_hi                68205 non-null  int64  
 6   ap_lo                68205 non-null  int64  
 7   cholesterol          68205 non-null  int64  
 8   gluc                 68205 non-null  int64  
 9   smoke                68205 non-null  int64  
 10  alco                 68205 non-null  int64  
 11  active               68205 non-null  int64  
 12  cardio               68205 non-null  int64  
 13  age_years            68205 non-null  int64  
 14  bmi                  68205 non-null  float64
 15  bp_category          68205 non-null 

## Initial data screening and exploration:

**Use describe() to summarize the dataset of numerical features.**
- Check for missing values, outliers, and basic statistics like mean, median, and standard deviation.

In [8]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,68205.0,49972.410498,28852.13829,0.0,24991.0,50008.0,74878.0,99999.0
age,68205.0,19462.667737,2468.381854,10798.0,17656.0,19700.0,21323.0,23713.0
gender,68205.0,1.348625,0.476539,1.0,1.0,1.0,2.0,2.0
height,68205.0,164.372861,8.176756,55.0,159.0,165.0,170.0,250.0
weight,68205.0,74.100688,14.288862,11.0,65.0,72.0,82.0,200.0
ap_hi,68205.0,126.434924,15.961685,90.0,120.0,120.0,140.0,180.0
ap_lo,68205.0,81.263925,9.143985,60.0,80.0,80.0,90.0,120.0
cholesterol,68205.0,1.363243,0.67808,1.0,1.0,1.0,1.0,3.0
gluc,68205.0,1.225174,0.571288,1.0,1.0,1.0,1.0,3.0
smoke,68205.0,0.087662,0.282805,0.0,0.0,0.0,0.0,1.0


**Use describe(include='object') to summarize the dataset of categorical features.**
- Check for number of unique values, the most frequent value and number of times it appears.

In [9]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
bp_category,68205,4,Hypertension Stage 1,39750
bp_category_encoded,68205,4,Hypertension Stage 1,39750


## Drop some of the unnecessary columns from the raw dataset as they are not needed.
 - **ID:** Unique identifier for each individual, not needed for analysis.
 - **age:** Age is already represented in years under `'age_years'` column.
 - **bp_category_encoded:** This is the repetitive column of `bp_category` and it is not encoded. So, I will drop this column.

In [10]:
df.drop(columns= ['id', 'age', 'bp_category_encoded'], inplace=True)
df.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age_years,bmi,bp_category
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal


**After dropping the unnecessary columns, I will check the followings:**
- the shape of the dataset after dropping the unnecessary columns.
- name of the columns in the dataset.
- presence of any missing values per column in the dataset.

In [11]:
print(df.shape)
print(df.columns)
print(df.isnull().sum())


(68205, 14)
Index(['gender', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc',
       'smoke', 'alco', 'active', 'cardio', 'age_years', 'bmi', 'bp_category'],
      dtype='object')
gender         0
height         0
weight         0
ap_hi          0
ap_lo          0
cholesterol    0
gluc           0
smoke          0
alco           0
active         0
cardio         0
age_years      0
bmi            0
bp_category    0
dtype: int64


**I will rename the `age_years` column to `age` for consistency with the original dataset.**

In [12]:
df.rename(columns = {'age_years': 'age',}, inplace=True)
df.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal


**For the purpose of analysis convenience, I will create two new categorical columns using functions:**
- `age_group`: Categorizing individuals into age groups.
- `bmi_category`: Categorizing individuals into BMI groups.

**These columns will be used during the dashboard creation to visualize the data more effectively.**

In [13]:
# Function to categorize age into groups

def age_group(age):
    if age < 30:
        return '18-29'
    elif age < 40:
        return '30-39'
    elif age < 50:
        return '40-49'
    elif age < 60:
        return '50-59'
    elif age >= 60:
        return '60+'

# Function to categorize BMI into groups
def bmi_category(bmi):
    if bmi < 18:
        return 'Underweight'
    elif 18 <= bmi < 25:
        return 'Normal weight'
    elif 25 <= bmi < 30:
        return 'Overweight'
    elif 30 <= bmi < 35:
        return 'Obesity I'
    elif 35 <= bmi < 40:
        return 'Obesity II'
    else:
        return 'Obesity III'

df = df.pipe(
    lambda x: x.assign(
        age_group=x['age'].apply(age_group),
        bmi_category=x['bmi'].apply(bmi_category)
    )
)

df.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category,age_group,bmi_category
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,50-59,Normal weight
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,50-59,Obesity I
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,50-59,Normal weight
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,40-49,Overweight
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,40-49,Normal weight


**I will further simplify the `age_group` and `bmi_category` columns for better readability and Hypothesis testing purposes**

In [14]:
def simplify_age_group(group):
    if group in ['18-29', '30-39']:
        return 'Adult'
    elif group in ['40-49', '50-59']:
        return 'Middle-aged'
    else:
        return 'Senior'

def simplify_bmi_category(category):
    if category in ['Underweight', 'Normal weight']:
        return 'Low BMI'
    elif category in ['Overweight', 'Obesity I']:
        return 'Mid BMI'
    else:
        return 'High BMI'

df = df.pipe(
    lambda x: x.assign(
        age_simp_group=x['age_group'].apply(simplify_age_group),
        bmi_simp_cat=x['bmi_category'].apply(simplify_bmi_category)
    )
)

df.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category,age_group,bmi_category,age_simp_group,bmi_simp_cat
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,50-59,Normal weight,Middle-aged,Low BMI
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,50-59,Obesity I,Middle-aged,Mid BMI
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,50-59,Normal weight,Middle-aged,Low BMI
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,40-49,Overweight,Middle-aged,Mid BMI
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,40-49,Normal weight,Middle-aged,Low BMI


**Since cardovascular disease dataset has some key features that are relevant for diabetes risk analysis but not present in the dataset and my focus is to understand diabetes risk factors, I will create a new column called `diabetes_risk_percentage` with the help of ChatGPT to store the calculated diabetes risk percentage for each individual.**

**Save the cleaned dataset in cleaned folder as a new CSV file for the purpose of Diabetes Risk Percentage calculation with the help of ChatGPT**

In [15]:
df.to_csv('dataset/cleaned/cardio_data_processed_clean.csv', index=False)

**Load the cleaned dataset with diabetes risk percentage.**

In [16]:
df_diab = pd.read_csv('dataset/raw/cardio_data_with_diabetes_risk.csv')
print(df_diab.shape)
print(df_diab.info())
print(df_diab.isnull().sum())
df_diab.head()

(68205, 19)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68205 entries, 0 to 68204
Data columns (total 19 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   gender             68205 non-null  int64  
 1   height             68205 non-null  int64  
 2   weight             68205 non-null  float64
 3   ap_hi              68205 non-null  int64  
 4   ap_lo              68205 non-null  int64  
 5   cholesterol        68205 non-null  int64  
 6   gluc               68205 non-null  int64  
 7   smoke              68205 non-null  int64  
 8   alco               68205 non-null  int64  
 9   active             68205 non-null  int64  
 10  cardio             68205 non-null  int64  
 11  age                68205 non-null  int64  
 12  bmi                68205 non-null  float64
 13  bp_category        68205 non-null  object 
 14  age_group          68205 non-null  object 
 15  bmi_category       68205 non-null  object 
 16  age_simp_g

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category,age_group,bmi_category,age_simp_group,bmi_simp_cat,diab_risk_percent
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,16.15
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,50-59,Obese I,Middle-aged,Mid BMI,57.5
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,18.21
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,40-49,Overweight,Middle-aged,Mid BMI,19.62
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,40-49,Normal,Middle-aged,Low BMI,9.76


**As the dataset has 68205 rows, there might be a possibility of duplicate rows. Existence of duplicate rows can impact the analysis. Therefore, I will check the existence of duplicate rows in the dataset.**

In [17]:
print(df_diab.duplicated().sum())
df_diab[df_diab.duplicated()]

9


Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category,age_group,bmi_category,age_simp_group,bmi_simp_cat,diab_risk_percent
4319,2,165,65.0,120,80,1,1,0,0,1,0,57,23.875115,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,15.95
12371,1,165,65.0,120,80,1,1,0,0,1,0,53,23.875115,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,15.64
19242,1,160,58.0,120,80,1,1,0,0,1,0,53,22.65625,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,15.08
26274,1,158,58.0,110,70,1,1,0,0,1,0,39,23.233456,Normal,30-39,Normal,Adult,Low BMI,5.47
32212,2,162,56.0,120,80,1,1,0,0,1,1,53,21.338211,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,14.51
42185,1,165,65.0,120,80,1,1,0,0,1,0,55,23.875115,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,16.03
51056,1,161,68.0,120,80,1,1,0,0,1,0,50,26.233556,Hypertension Stage 1,50-59,Overweight,Middle-aged,Mid BMI,32.71
57682,1,170,70.0,120,80,1,1,0,0,1,0,43,24.221453,Hypertension Stage 1,40-49,Normal,Middle-aged,Low BMI,10.33
65927,1,170,70.0,120,80,1,1,0,0,1,0,50,24.221453,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,18.6


**As I can see here, there are 16 duplicate rows in the dataset. However, after having a closer look at the duplicate dataset, I can see that these duplicate rows are not identical. Specifically, `bmi`, `age` and `diab_risk_percent` are different. Therefore, I will keep them in the dataset for further analysis.**

## Diabetes Risk Category

**Create Diabetes Risk Category based on the `diabetes_risk_percentage` column, which will be used for hypothesis testing and visualization purposes.**

In [18]:
def diab_risk(percent):
    if percent < 20:
        return 'Low Risk'
    elif 20 <= percent < 50:
        return 'Moderate Risk'
    elif percent >= 50:
        return 'High Risk'
    
df_diab['diab_risk_cat'] = df_diab['diab_risk_percent'].apply(diab_risk)
df_diab.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,bp_category,age_group,bmi_category,age_simp_group,bmi_simp_cat,diab_risk_percent,diab_risk_cat
0,2,168,62.0,110,80,1,1,0,0,1,0,50,21.96712,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,16.15,Low Risk
1,1,156,85.0,140,90,3,1,0,0,1,1,55,34.927679,Hypertension Stage 2,50-59,Obese I,Middle-aged,Mid BMI,57.5,High Risk
2,1,165,64.0,130,70,3,1,0,0,0,1,51,23.507805,Hypertension Stage 1,50-59,Normal,Middle-aged,Low BMI,18.21,Low Risk
3,2,169,82.0,150,100,1,1,0,0,1,1,48,28.710479,Hypertension Stage 2,40-49,Overweight,Middle-aged,Mid BMI,19.62,Low Risk
4,1,156,56.0,100,60,1,1,0,0,0,0,47,23.011177,Normal,40-49,Normal,Middle-aged,Low BMI,9.76,Low Risk


**Get a quick glance on the total frequency of each category in each column of the dataset using `value_counts()` method.**

In [19]:
for col in df_diab.select_dtypes(include=['object']).columns:
    print(f"Total number of occurrences of '{col}' column:\n")
    print(df_diab[col].value_counts())
    print("\n")

Total number of occurrences of 'bp_category' column:

bp_category
Hypertension Stage 1    39750
Hypertension Stage 2    15937
Normal                   9417
Elevated                 3101
Name: count, dtype: int64


Total number of occurrences of 'age_group' column:

age_group
50-59    34596
40-49    19173
60+      12683
30-39     1750
18-29        3
Name: count, dtype: int64


Total number of occurrences of 'bmi_category' column:

bmi_category
Normal         25174
Overweight     24591
Obese I        11852
Obese II        4179
Obese III       1778
Underweight      631
Name: count, dtype: int64


Total number of occurrences of 'age_simp_group' column:

age_simp_group
Middle-aged    53769
Senior         12683
Adult           1753
Name: count, dtype: int64


Total number of occurrences of 'bmi_simp_cat' column:

bmi_simp_cat
Mid BMI     36443
Low BMI     25805
High BMI     5957
Name: count, dtype: int64


Total number of occurrences of 'diab_risk_cat' column:

diab_risk_cat
Moderate Risk   

**Encode `age_group`, `bmi_category`, `bp_category`, and `diab_risk_cat` for the ease of further analysis**

In [20]:
encodings = {
    'age_group': {
        '18-29': 0, '30-39': 1, '40-49': 2, '50-59': 3, '60+': 4
    },
    'bmi_category': {
        'Underweight': 0, 'Normal': 1, 'Overweight': 2,
        'Obese I': 3, 'Obese II': 4, 'Obese III': 5
    },
    'bp_category': {
        'Normal': 0, 'Elevated': 1,
        'Hypertension Stage 1': 2, 'Hypertension Stage 2': 3
    },
    'diab_risk_cat': {
        'Low Risk': 0, 'Moderate Risk': 1, 'High Risk': 2
    }
}

# Apply mappings to the DataFrame to create numerical columns
for col, val in encodings.items():
    df_diab[f'{col}_num'] = df_diab[col].map(val)

In [21]:
df_diab.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,...,age_group,bmi_category,age_simp_group,bmi_simp_cat,diab_risk_percent,diab_risk_cat,age_group_num,bmi_category_num,bp_category_num,diab_risk_cat_num
0,2,168,62.0,110,80,1,1,0,0,1,...,50-59,Normal,Middle-aged,Low BMI,16.15,Low Risk,3,1,2,0
1,1,156,85.0,140,90,3,1,0,0,1,...,50-59,Obese I,Middle-aged,Mid BMI,57.5,High Risk,3,3,3,2
2,1,165,64.0,130,70,3,1,0,0,0,...,50-59,Normal,Middle-aged,Low BMI,18.21,Low Risk,3,1,2,0
3,2,169,82.0,150,100,1,1,0,0,1,...,40-49,Overweight,Middle-aged,Mid BMI,19.62,Low Risk,2,2,3,0
4,1,156,56.0,100,60,1,1,0,0,0,...,40-49,Normal,Middle-aged,Low BMI,9.76,Low Risk,2,1,0,0


## Skewness and Kurtosis

***Skewness – Asymmetry of the Distribution***

**Definition:** Skewness measures the asymmetry of the probability distribution of a real-valued random variable.

**Types:**
- Zero skewness: Data is symmetrically distributed (like a normal distribution).
- Positive skew: Tail on the right side is longer or fatter; mean > median.
- Negative skew: Tail on the left side is longer or fatter; mean < median.


***Kurtosis – Tailedness of the Distribution***

**Definition:** Kurtosis measures the "tailedness" or the extremity of outliers in the data.

**Types:**
- Mesokurtic (kurtosis ≈ 3): Normal distribution.
- Leptokurtic (kurtosis > 3): Heavy tails; more outliers.
- Platykurtic (kurtosis < 3): Light tails; fewer outliers.

**I will focus certain numerical columns from the dataset for skewness and kurtosis analysis, which will help in understanding the distribution of the data and identifying potential outliers.**

In [22]:
num_cols = df_diab[['gender', 'weight', 'active', 'bmi']]

output = []
for col in num_cols.columns:
    skew = num_cols[col].skew()
    kurt = num_cols[col].kurtosis()
    output.append({
        'Column': col,
        'Skewness': skew,
        'Kurtosis': kurt
    })
output_df = pd.DataFrame(output)
output_df

Unnamed: 0,Column,Skewness,Kurtosis
0,gender,0.635327,-1.596406
1,weight,1.00581,2.557139
2,active,-1.528034,0.334898
3,bmi,7.818718,230.5905


Based on the skewness and kurtosis values, I can determine the following:

| Feature                  | Skewness | Kurtosis | Interpretation                                                                                             |
| ------------------------ | -------- | -------- | ---------------------------------------------------------------------------------------------------------- |
| **gender**               | 0.64     | -1.60    | Slight right skew; light tails → fairly balanced but fewer extremes.                                       |
| **weight**               | 1.01     | 2.56     | Right-skewed; moderate tail → many lighter individuals, some heavy outliers.                               |
| **active**               | -1.53    | 0.33     | Left-skewed; light tails → more active people, few very inactive.                                          |
| **bmi**                  | 7.82     | 230.59   | **Extremely** right-skewed; **very high kurtosis** → majority with normal BMI, some extreme obesity cases. |

The `weight` column has a moderate right skewness and kurtosis, indicating that while most individuals have a weight around the mean, there are some heavier outliers. Hence, I will apply **log transformation** to the `weight` column to reduce skewness and make the distribution more normal.

The `active` column shows a left skewness and light tails, indicating that most individuals are active, with fewer very inactive individuals. Since, the data in this column in bionary format (0 or 1), I will not apply any transformation to this column.

The `bmi` column shows an extremely high skewness and kurtosis, indicating a significant number of individuals with normal BMI and some extreme cases of obesity. Therefore, I will apply **Winsorization** to the `bmi` column to limit the impact of extreme values and make the distribution more normal.

***LogTransformation*** is a technique used to reduce skewness in data by applying a logarithmic function to the values. It is particularly useful for right-skewed data, where a few extreme values can disproportionately affect the mean and variance.

***Winsorization*** is a statistical technique used to limit the impact of extreme values (outliers) in a dataset by replacing them with less extreme values. It is particularly useful for reducing the influence of outliers on statistical analyses.

## Feature Engineering

**Applying LogTransformer and Winsorizer transformations to the `weight` and `bmi` columns respectively for the purpose of reducing skewness and making the distributions more normal.**

In [23]:
pipe = Pipeline([
    ('right_skewed', vt.LogTransformer(variables=['weight'])),
    ('handle_outliers', Winsorizer(
        capping_method='gaussian',
        tail='both',
        fold=1.5,
        variables=['bmi'])
    )
])

df_trans = pipe.fit_transform(df_diab)
df_trans.head()

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,...,age_group,bmi_category,age_simp_group,bmi_simp_cat,diab_risk_percent,diab_risk_cat,age_group_num,bmi_category_num,bp_category_num,diab_risk_cat_num
0,2,168,4.127134,110,80,1,1,0,0,1,...,50-59,Normal,Middle-aged,Low BMI,16.15,Low Risk,3,1,2,0
1,1,156,4.442651,140,90,3,1,0,0,1,...,50-59,Obese I,Middle-aged,Mid BMI,57.5,High Risk,3,3,3,2
2,1,165,4.158883,130,70,3,1,0,0,0,...,50-59,Normal,Middle-aged,Low BMI,18.21,Low Risk,3,1,2,0
3,2,169,4.406719,150,100,1,1,0,0,1,...,40-49,Overweight,Middle-aged,Mid BMI,19.62,Low Risk,2,2,3,0
4,1,156,4.025352,100,60,1,1,0,0,0,...,40-49,Normal,Middle-aged,Low BMI,9.76,Low Risk,2,1,0,0


**After applying the transformations, I will check the skewness and kurtosis of the `weight` and `bmi` columns again to confirm that the transformations have effectively reduced skewness and made the distributions more normal.**

In [24]:
num_cols = df_trans[['gender', 'weight', 'active', 'bmi']]

output = []
for col in num_cols.columns:
    skew = num_cols[col].skew()
    kurt = num_cols[col].kurtosis()
    output.append({
        'Column': col,
        'Skewness': skew,
        'Kurtosis': kurt
    })
output_df = pd.DataFrame(output)
output_df

Unnamed: 0,Column,Skewness,Kurtosis
0,gender,0.635327,-1.596406
1,weight,0.219132,0.789084
2,active,-1.528034,0.334898
3,bmi,0.500621,-0.534195


Output of the skewness and kurtosis after applying the transformations:

| Column     | Skewness | Kurtosis | Interpretation                                                                                                             |
| ---------- | -------- | -------- | -------------------------------------------------------------------------------------------------------------------------- |
| **weight** | 0.22     | 0.79     | Slight right skew; nearly normal with mild tail weight. Distribution is fairly symmetric with a few heavier outliers.      |
| **bmi**    | 0.50     | -0.53    | Mild right skew; light tails. BMI is somewhat concentrated around central values, with slightly more high values than low. |

Now, both `weight` and `bmi` columns have reduced skewness and kurtosis, indicating that the transformations have effectively normalized the distributions.

## General Correlation Analysis
**I will use the `.corr()` method to calculate the correlation matrix of the dataset, which will help identify relationships between different features.**
**I will visualize the correlation matrix using a heatmap to better understand the relationships between features.**

In [25]:
df_trans.select_dtypes(exclude=['object']).corr().style.background_gradient(cmap='coolwarm')

Unnamed: 0,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio,age,bmi,diab_risk_percent,age_group_num,bmi_category_num,bp_category_num,diab_risk_cat_num
gender,1.0,0.498323,0.165669,0.060722,0.066126,-0.037397,-0.021822,0.337758,0.170719,0.005206,0.006098,-0.024098,-0.104468,-0.184671,-0.023432,-0.105879,0.074075,-0.163071
height,0.498323,1.0,0.313465,0.018544,0.03555,-0.050957,-0.019317,0.187543,0.094354,-0.008241,-0.011276,-0.081974,-0.20296,-0.253454,-0.074919,-0.209343,0.053381,-0.224098
weight,0.165669,0.313465,1.0,0.270521,0.253794,0.136232,0.102411,0.068207,0.067159,-0.017951,0.180582,0.058426,0.84438,0.708358,0.047107,0.805683,0.241326,0.65973
ap_hi,0.060722,0.018544,0.270521,1.0,0.731812,0.19533,0.093151,0.026032,0.032536,-0.001409,0.433802,0.211314,0.27091,0.284786,0.189577,0.255769,0.729341,0.263127
ap_lo,0.066126,0.03555,0.253794,0.731812,1.0,0.161637,0.073319,0.023836,0.036212,-0.001234,0.3429,0.155777,0.243676,0.242577,0.139415,0.227546,0.826389,0.226184
cholesterol,-0.037397,-0.050957,0.136232,0.19533,0.161637,1.0,0.450452,0.00957,0.034184,0.008658,0.220778,0.154738,0.174839,0.203825,0.137851,0.169006,0.142204,0.183247
gluc,-0.021822,-0.019317,0.102411,0.093151,0.073319,0.450452,1.0,-0.006109,0.009379,-0.008003,0.088905,0.098212,0.117688,0.134125,0.086678,0.114247,0.071702,0.118581
smoke,0.337758,0.187543,0.068207,0.026032,0.023836,0.00957,-0.006109,1.0,0.338226,0.024999,-0.016567,-0.048089,-0.031789,-0.064884,-0.04546,-0.025225,0.018497,-0.058445
alco,0.170719,0.094354,0.067159,0.032536,0.036212,0.034184,0.009379,0.338226,1.0,0.024339,-0.009038,-0.029052,0.018704,-0.005956,-0.030775,0.020728,0.023674,-0.00377
active,0.005206,-0.008241,-0.017951,-0.001409,-0.001234,0.008658,-0.008003,0.024999,0.024339,1.0,-0.037944,-0.010385,-0.013297,-0.01488,-0.012073,-0.013238,-0.006803,-0.011105


Considering the target `diab_risk_percent` column, the correlation matrix shows good correlation accross the feature columns except for `active` column, which has a very low correlation with the target column. Though this indicates that the level of physical activity does not have a significant impact on diabetes risk in this dataset, it seems counterintuitive as physical activity is generally considered a key factor in diabetes risk. This could be due to the nature of the dataset or other confounding factors not captured in the data.

Therefore, I will perform a **`AB test`** to check the significance of the difference in diabetes risk percentage between active and inactive individuals. This will help determine if there is a statistically significant difference in diabetes risk based on physical activity levels.

**Saving the final cleaned dataset with the transformations applied and the diabetes risk percentage calculated as a new CSV file in the cleaned folder.**

In [26]:
df_trans.to_csv('dataset/cleaned/cardio_data_with_diabetes_risk_clean.csv', index=False)