# Heart Data Exploration

In this notebook we identify potentially harmful correlations within the dataset as well as any imbalances that may affect our ML training procedure

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [None]:
# Load in the heart csv and perform pre-processing

heart_data = pd.read_csv("Heart_Data/heart.csv")

In [None]:
heart_data.head()

# Check for label imbalance

Are the target labels imbalanced

In [None]:
print((heart_data['HeartDisease'] == 0).sum())
print((heart_data['HeartDisease'] == 1).sum())

Are key protected columns imbalanced - in this case age & sex

In [None]:
print(heart_data[heart_data.Sex=='F'].shape[0])
print(heart_data[heart_data.Sex=='M'].shape[0])

In [None]:
age_bins = heart_data['Age'].to_numpy()

In [None]:
bins = np.arange(0,120,5)
age_bins = np.digitize(age_bins, bins)

In [None]:
heart_data['age_bin'] = age_bins

heart_data.head()

In [None]:
age_dist = heart_data['Age'].to_numpy()

plt.hist(age_dist)

There is a noticeable imbalance across sex and there is an imbalance across age as you would expect

Not a noticeable imbalance on the target - could be imbalanced across protected variables however

How does the target distribution differ across <i>age & sex</i>

In [None]:
print(heart_data[(heart_data.Sex=='F') & (heart_data.HeartDisease==1)].shape[0])
print(heart_data[(heart_data.Sex=='F') & (heart_data.HeartDisease==0)].shape[0])
print(heart_data[(heart_data.Sex=='M') & (heart_data.HeartDisease==1)].shape[0])
print(heart_data[(heart_data.Sex=='M') & (heart_data.HeartDisease==0)].shape[0])

In [None]:
age_dist_hd = heart_data[heart_data.HeartDisease==1]['age_bin']
age_dist_healthy = heart_data[heart_data.HeartDisease==0]['age_bin']

In [None]:
plt.hist(age_dist_hd)

In [None]:
plt.hist(age_dist_healthy)

The sex & age does have a difference to the heart disease target & these are potentially protected attributes. It is plausible to expect both age & sex to be <i>causally</i> related to heart disease and this is why they are being shown to correlate in the dataset.

It is also equally plausible that elderly people are less likely to come into this dataset as they may have died by heart disease before reporting - depends on data collection & where this has come from.

Bias like this can have an influence on the results and so mitigating it in synthetic data generation could be beneficial for future downstream tasks. The <b>key question</b> is when should this bias be handled?

- Is it reasonable for synthetic dataset providers to produce realistic (but still bias) datasets and simply state limitations & bias present within their data?
- Or should synthetic datasets providers try to deliberately mitigate the bias within their set using ML techniques (such as causal modelling)

This heart use case is going to explore the latter option using causal modelling - <i>not to say this method is perfect and without limitations</i>

In [None]:
heart_data.corr(method="spearman")

In [None]:
from scipy.stats import spearmanr

correlations = spearmanr(heart_data)[0]
fig = plt.figure(figsize=(40,40))
ax = fig.add_subplot(111)
cax = ax.matshow(correlations, vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,12,1)
ax.set_xticks(ticks)
ax.set_yticks(ticks)
ax.set_xticklabels(heart_data.columns)
ax.set_yticklabels(heart_data.columns)
plt.show()