[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mobadara/cardiovascular-disease-risk-prediction/blob/main/notebooks/exploratory-data-analysis.ipynb)
[![Kaggle Notebook](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/new?source=https://github.com/mobadara/cardiovascular-disease-risk-prediction/blob/main/notebooks/exploratory-data-analysis.ipynb)
[![Python](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/)
[![Jupyter](https://img.shields.io/badge/Jupyter-%23F37626.svg?style=for-the-badge&logo=jupyter&logoColor=white)](https://jupyter.org/)

# **Exploratory Data Analysis (EDA) - Cardiovascular Disease Risk Prediction**

## **Introduction**
This notebook performs an in-depth exploratory data analysis on the Cardiovascular Disease Dataset to understand its structure, distributions, relationships between variables, and identify potential issues for machine learning model development.

## **Data Aquisition and Environment Setup**
This code block first imports the necessary libraries. It then attempts to install `kagglehub` if it's not found. After that, it uses `kagglehub.dataset_download` to get the dataset. Finally, it loads the `cardio_train.csv` file into a pandas DataFrame and displays the first few rows and information about the DataFrame, which are standard first steps in EDA.

In [4]:
import os
import shutil
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import kagglehub
import warnings

warnings.filterwarnings('ignore')

In [6]:
# Download latest version
os.makedirs('../data', exist_ok=True)
path = kagglehub.dataset_download('sulianova/cardiovascular-disease-dataset')
# Copy the dataset into `../data` folder to match
# the project structure on GitHub, and rename the file to match
# my personal naming style.
full_path = [path + '/' + file for file in os.listdir(path)]
for file in full_path:
    if file.endswith('cardio_train.csv'):
        shutil.copyfile(file, '../data/cardio-train.csv')
try:
    os.remove('../data/cardio_train.csv')
except:
    pass

## **Data Loading**
The subsequent code cell loads the acquired dataset into a pandas DataFrame (`df`). Displaying a few initial samples allows for a quick inspection of the data's structure and content.

In [9]:
df = pd.read_csv('../data/cardio-train.csv', sep=';')
df.sample(10)

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
56009,79897,18280,1,156,86.0,120,80,1,1,0,0,1,0
61144,87285,14363,2,168,64.0,120,75,1,1,1,0,1,0
18334,26182,18158,1,150,82.0,120,80,2,1,0,0,1,1
34362,49094,20282,2,174,75.0,120,80,2,1,1,1,1,0
44578,63672,15266,2,180,80.0,120,80,1,1,0,0,1,0
29365,41965,18227,1,162,101.0,160,100,1,1,0,0,0,0
65621,93658,18937,1,160,52.0,120,80,1,1,0,0,1,1
1555,2200,20583,2,170,70.0,120,80,1,1,0,0,1,0
17638,25201,18906,1,169,59.0,120,80,1,1,0,0,0,0
44881,64091,22132,1,166,57.0,120,80,1,1,0,0,1,0


A random sample of 10 rows from the DataFrame is displayed below. This provides a glimpse into the variety of patient data, including features such as `id`, `age` (in days), `gender`, `height`, `weight`, blood pressure (`ap_hi`, `ap_lo`), cholesterol and glucose levels (`cholesterol`, `gluc`), lifestyle factors (`smoke`, `alco`, `active`), and the target variable `cardio` (indicating the presence or absence of cardiovascular disease). Observing these samples helps to understand the scale and potential distribution of the different features.

## **Data Inspection**

The subsequent code cell provides a concise overview of the dataset, including the number of rows and columns, the data type of each column, and the count of any missing values. This initial inspection is crucial for understanding the dataset's structure and identifying potential data quality issues that may need to be addressed during preprocessing.

In [10]:
print(df.shape)

(70000, 13)


In [11]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70000 entries, 0 to 69999
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           70000 non-null  int64  
 1   age          70000 non-null  int64  
 2   gender       70000 non-null  int64  
 3   height       70000 non-null  int64  
 4   weight       70000 non-null  float64
 5   ap_hi        70000 non-null  int64  
 6   ap_lo        70000 non-null  int64  
 7   cholesterol  70000 non-null  int64  
 8   gluc         70000 non-null  int64  
 9   smoke        70000 non-null  int64  
 10  alco         70000 non-null  int64  
 11  active       70000 non-null  int64  
 12  cardio       70000 non-null  int64  
dtypes: float64(1), int64(12)
memory usage: 6.9 MB
None


In [12]:
df.isna().sum()

Unnamed: 0,0
id,0
age,0
gender,0
height,0
weight,0
ap_hi,0
ap_lo,0
cholesterol,0
gluc,0
smoke,0


The dataset consists of **70,000 rows** and **13 columns**, as confirmed by the shape of the DataFrame. The output of `df.info()` reveals that there are **no missing values** in any of the columns, as the "Non-Null Count" is 70,000 for each. The data types consist primarily of integers (`int64`), with one column (`weight`) being a float (`float64`). This information indicates a relatively clean dataset in terms of missingness and provides the foundational data types for each feature.

Since there are no missing values, it is not neccessary to inspect the distribution of missing values using heatmap.

## **Summary Statistics**

In [16]:
df.describe(include='all')

Unnamed: 0,id,age,gender,height,weight,ap_hi,ap_lo,cholesterol,gluc,smoke,alco,active,cardio
count,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0,70000.0
mean,49972.4199,19468.865814,1.349571,164.359229,74.20569,128.817286,96.630414,1.366871,1.226457,0.088129,0.053771,0.803729,0.4997
std,28851.302323,2467.251667,0.476838,8.210126,14.395757,154.011419,188.47253,0.68025,0.57227,0.283484,0.225568,0.397179,0.500003
min,0.0,10798.0,1.0,55.0,10.0,-150.0,-70.0,1.0,1.0,0.0,0.0,0.0,0.0
25%,25006.75,17664.0,1.0,159.0,65.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
50%,50001.5,19703.0,1.0,165.0,72.0,120.0,80.0,1.0,1.0,0.0,0.0,1.0,0.0
75%,74889.25,21327.0,2.0,170.0,82.0,140.0,90.0,2.0,1.0,0.0,0.0,1.0,1.0
max,99999.0,23713.0,2.0,250.0,200.0,16020.0,11000.0,3.0,3.0,1.0,1.0,1.0,1.0


The `df.describe()` method provides descriptive statistics for the numerical columns in the dataset. This includes the count, mean, standard deviation, minimum, 25th percentile, 50th percentile (median), 75th percentile, and maximum values for each numerical feature.

Observations from the descriptive statistics:

* **Age:** The average age is approximately **19469** days (around **53 years**), with a range from about **29** to **65** years.
* **Height:** The average height is around **164 cm**, with some seemingly unrealistic minimum and maximum values that might require further investigation.
* **Weight:** The average weight is about **74 kg**, with a wide range. Similar to height, the minimum and maximum weights might contain outliers.
* **Blood Pressure (`ap_hi`, `ap_lo`):** The average systolic blood pressure (`ap_hi`) is around 129, and the average diastolic blood pressure (`ap_lo`) is around **97**. The standard deviations are quite high, and the presence of negative minimum values and extremely high maximum values suggests potential data entry errors or outliers.
* **Cholesterol and Glucose (`cholesterol`, `gluc`):** These are categorical features encoded as numerical values **(1, 2, 3)**. The statistics show the distribution of these levels.
* **Lifestyle Factors (`smoke`, `alco`, `active`):** These are binary indicators (0 or 1), and the means indicate the proportion of individuals in each category. For example, the mean of 'smoke' being **~0.088** suggests that about **8.8%** of the individuals smoke.
* **Cardio:** The target variable has a mean of approximately **0.5**, indicating a roughly balanced dataset in terms of the presence (1) or absence (0) of cardiovascular disease.


## **Categorical Columns**

In [23]:
for summary_count in [df[col].value_counts(normalize=True) for col in df[['gender', 'cholesterol', 'gluc', 'smoke', 'alco', 'active', 'cardio']]]:
    print(summary_count)
    print()

gender
1    0.650429
2    0.349571
Name: proportion, dtype: float64

cholesterol
1    0.748357
2    0.136414
3    0.115229
Name: proportion, dtype: float64

gluc
1    0.849700
3    0.076157
2    0.074143
Name: proportion, dtype: float64

smoke
0    0.911871
1    0.088129
Name: proportion, dtype: float64

alco
0    0.946229
1    0.053771
Name: proportion, dtype: float64

active
1    0.803729
0    0.196271
Name: proportion, dtype: float64

cardio
0    0.5003
1    0.4997
Name: proportion, dtype: float64



The value counts for the categorical columns encoded as integers are as follows:

* **Cholesterol:**
    Level 1 is the most frequent (70%+), followed by level 2 (13%), and then level 3 (11%).

* **Glucose:**
    Similar to cholesterol, level 1 is the most frequent which accounts to about 84.9% of the total patients.

* **Smoke:**
    The majority of individuals in the dataset do not smoke (0). About 8.8% of the patients smoke.

* **Alcohol Consumption:**
    Most individuals in the dataset do not consume alcohol (0). About 94.6% of the individuals do not consume alcohol.

* **Physical Activity:**
    A larger proportion of individuals report being physically active (1). More than 80% of the patients in the dataset are physically active.

* **Cardio (Target Variable):**
    The target variable, indicating the presence (1) or absence (0) of cardiovascular disease, is very well balanced.

These value counts provide insights into the distribution of these categorical features within the dataset. For `cholesterol` and `gluc`, level 1 is notably more prevalent. For the lifestyle factors, the dataset shows a higher proportion of non-smokers, non-drinkers, and physically active individuals. The target variable `cardio` is almost perfectly balanced, which is beneficial for training a classification model.

## **Univariate Analysis**