# CIS 3252.04 - Final Project - Cardiovascular Heart Disease
## by Jeremy Castro
### Dataset: [https://www.kaggle.com/datasets/thedevastator/exploring-risk-factors-for-cardiovascular-diseas](https://www.kaggle.com/datasets/thedevastator/exploring-risk-factors-for-cardiovascular-diseas)

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
df = pd.read_csv('heart_data.csv')

In [None]:
df.head()

In [None]:
df.info()

## Data Taxonomy and Variable Nature
- age: (int64), **ratio** because it is numerical and has a natural 0 value
- gender: (int64), **nominal** because it is categorical and does not have a natural order
- height: (int64), **ratio** because it is numerical and has a natural 0 value
- weight: (float64), **ratio** because it is numerical and has a natural 0 value
- ap_hi: (int64), **ratio** because it is numerical and has a natural 0 value
- ap_lo: (int64), **ratio** because it is numerical and has a natural 0 value
- cholesterol: (int64), **ratio** because it is numerical and has a natural 0 value
- gluc: (int64), **ratio** because it is numerical and has a natural 0 value
- smoke: (int64), **nominal** because it is categorical and does not have a natural order
- alco: (int64), **nominal** because it is categorical and does not have a natural order
- active: (int64), **nominal** because it is categorical and does not have a natural order
- cardio: (int64), **nominal** because it is categorical and does not have a natural order

In [None]:
df.describe()

In [None]:
for col in df.columns:
    print(df[col].value_counts())

In [None]:
df.isnull().sum()

# Data Cleaning: Remove Unnecessary Columns

In [None]:
df.drop(['index', 'id'], axis=1, inplace=True)

# Data Cleaning: Clean `ap_hi` and `ap_lo`

In [None]:
# Print value counts high end
df['ap_hi'].value_counts().sort_index(ascending=False).head(60)

In [None]:
# Print value counts low end
df['ap_hi'].value_counts().sort_index(ascending=False).tail(20)

In [None]:
# Print value counts high end
df['ap_lo'].value_counts().sort_index(ascending=False).head(60)

In [None]:
# Print value counts low end
df['ap_lo'].value_counts().sort_index(ascending=False).tail(20)

In [None]:
# Drop rows containing ap_hi or ap_lo outside range of 0 to 300
df = df[(df['ap_hi'] > 0) & (df['ap_hi'] <= 300)]
df = df[(df['ap_lo'] > 0) & (df['ap_lo'] <= 300)]

In [None]:
# Verify data
df[['ap_hi', 'ap_lo']].describe()

# Data Wrangling: Change Age from Days to Years

In [None]:
df['age'] = df['age'] / 365

# Data Aggregation: Numerical Columns by Gender

In [None]:
# Copy df and map 1,2 to male,female
df_by_gender = df.copy()

df_by_gender['gender'] = df_by_gender['gender'].map({1: 'Male', 2: 'Female'})

In [None]:
# Group by gender
df_by_gender = df_by_gender.groupby(['gender'])

In [None]:
# Apply aggregate function for rounded mean of numerical columns
def mean_round(x):
    return round(x.mean(), 2)

df_by_gender[['age', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc']].apply(mean_round)

# Data Visualization: 5 Visualizations

In [None]:
# Set color palette for visualizations
sns.set_palette('Set2')

## Visualization 1: Bar Chart - Entries per Gender

In [None]:
# Plot to show difference in entries per gender
df_by_gender['gender'].value_counts().plot(
    kind='bar', title='Gender Distribution', xlabel='Gender', ylabel='Count', figsize=(10, 7)
)

# Rename x-ticks to Title case (from female, male to Female, Male)
plt.xticks([0, 1], ['Female', 'Male'])

# Reset rotation to horizontal, readable format
plt.xticks(rotation=0)

# Change y-axis ticks to be comma-separated for readability
plt.yticks([0, 10000, 20000, 30000, 40000, 50000], ['0', '10,000', '20,000', '30,000', '40,000', '50,000'])

plt.tight_layout()
sns.despine()
plt.show()

### Meaning and Insights from Visualization 1
From this visualization, we can gather that there is a biased distribution in the dataset towards the male population. This could be insightful if we were to head towards predictive modeling as our data could be slightly biased. It also shows that our dataset contains limitations, which would be the under-representation of females, possibly affecting results and further insights.

## Visualization 2: Box Plot - Age per Gender

In [None]:
# Box plot of age per gender
plt.figure(figsize=(10, 10))
sns.boxplot(x='gender', y='age', data=df, palette='Set2')

# Rename x-ticks to gender names
plt.xticks([0, 1], ['Male', 'Female'])

# Add axes labels
plt.xlabel('Gender')
plt.ylabel('Age (in years)')

# Change range of y-axis
plt.ylim(20, 80)

# Add more ticks between y-axis
plt.yticks([20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80])

plt.tight_layout()
sns.despine()
plt.show()

### Meaning and Insights from Visualization 2
From this visualization, we can gather that the range and statistics of both male and female patients are relatively similar. This gives us a good indicator that the age range representation will be consistent between both genders, and that there are no biases in terms of age within this dataset.

## Visualization 3: Histogram - Numerical Columns

In [None]:
fig, axes = plt.subplots(3, 1, figsize=(15, 10))

axes[0].hist(df['age'], bins=15)
axes[0].set_title('Age Distribution')
axes[0].set_xlabel('Age (in years)')
axes[0].set_ylabel('Count')
axes[0].set_yticks([0, 2000, 4000, 6000, 8000], ['0', '2,000', '4,000', '6,000', '8,000'])

axes[1].hist(df['height'], bins=30)
axes[1].set_title('Height Distribution')
axes[1].set_xlabel('Height (in cm)')
# axes[1].set_ylabel('Count')
axes[1].set_yticks([0, 5000, 10000, 15000, 20000], ['0', '5,000', '10,000', '15,000', '20,000'])

axes[2].hist(df['weight'], bins=30)
axes[2].set_title('Weight Distribution')
axes[2].set_xlabel('Weight (in kg)')
# axes[1].set_ylabel('Count')
axes[2].set_yticks(
    [0, 2500, 5000, 7500, 10000, 12500, 15000], ['0', '2,500', '5,000', '7,500', '10,000', '12,500', '15,000']
)

plt.tight_layout()
sns.despine()
plt.show()

### Meaning and Insights from Visualization 3
From this visualization, we can gather that age is a left-skewed distribution, with more higher-aged individuals being represented within this dataset. We can also see that a majority of this dataset is within average height, as well as within a regular weight range according to the weight distribution.

## Visualization 4: Box Plot - Blood Pressure in Cardiovascular Disease vs Healthy

In [None]:
fig, axes = plt.subplots(2, 1, figsize=(15, 10))

sns.boxplot(x='cardio', y='ap_hi', data=df, ax=axes[0], hue='cardio', legend=False)
axes[0].set_title('Systolic Blood Pressure (ap_hi) per Cardoiovascular Disease Presence')
axes


sns.boxplot(x='cardio', y='ap_lo', data=df, ax=axes[1], hue='cardio', legend=False)
axes[1].set_title('Diastolic Blood Pressure (ap_lo) per Cardoiovascular Disease Presence')

for ax in axes:
    ax.set_ylabel('Blood Pressure (in mmHg)')
    ax.set_xlabel('Cardiovascular Disease')
    ax.set_xticks([0, 1], ['No', 'Yes'])

plt.tight_layout()
sns.despine()
plt.show()

### Meaning and Insights from Visualization 4
From this visualization, we can gather that cases in which cardiovascular disease is present exhibit a higher overall average blood pressure (both systolic and diastolic), possibly confirming the claim that a higher blood pressure contributes to cardiovascular disease risk. There is also a higher IQR, indicating that there are more cases in which a higher systolic blood pressure is present for cardiovascular disease cases.

## Visualization 5: Bar Chart - Average Health Metrics Between Present vs Non-Present Cardiovascular Disease

In [None]:
# Group cardiovascular disease presence
df_by_cardio = df.groupby('cardio')

df_by_cardio.mean().head()

In [None]:
# Target columns
numerical_columns = ['age', 'height', 'weight', 'ap_hi', 'ap_lo', 'cholesterol', 'gluc']

fig, axes = plt.subplots(4, 2, figsize=(15, 20))

# Iterate on columns and generate plot for each
for i in range(0, len(numerical_columns)):
    ax_current = axes[i // 2, i % 2]
    # Plot bar plot with cardio as hue
    sns.barplot(x='cardio', y=numerical_columns[i], hue='cardio', legend=False, data=df_by_cardio.mean(), ax=ax_current)
    # Clean up y label
    ax_current.set_ylabel(numerical_columns[i].title())
    # Clean up x label
    ax_current.set_xlabel('Cardiovascular Disease')
    # Clean up x ticks
    ax_current.set_xticks([0, 1], ['No', 'Yes'])

plt.tight_layout()
sns.despine()
plt.show()

### Meaning and Insights from Visualization 5
From this visualization, we can gather that cardiovascular disease is generally associated with higher health metrics, at least within the metrics provided (i.e. blood pressure, cholesterol, glucose levels). There is also indication that cardiovascular disease is associated with higher overall average weight and age, possibly indicating that these are contributing factors as well towards higher risk levels of cardiovascular disease.

# Findings & Conclusion

## Summary
Overall, we can see that the dataset provided exhibits a good representation of the population, although there is an inherent bias towards more male entries than female. We found that overall blood pressure is higher within CVD-present cases. We also found that even metrics such as age and weight are generally higher within cardiovascular disease cases. Overall, patients with higher metrics of blood pressure, cholesterol, and glucose levels should be prioritized in preventing possible cardiovascular disease risk.

## Four Takeaways or Insights
1. Cardiovascular disease risk factors are consistent across genders, as both display similar distributions of health metrics
2. Age distribution shows that older populations are more at risk for cardiovascular disease
3. Higher levels of blood pressure, cholesterol, and glucose are associated with confirmed cases of cardiovascular disease
4. Weight can be a targetable risk factor that could potentially lower risk of cardiovascular disease if lowered within at-risk individuals