In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

os.makedirs("figures", exist_ok=True)

df = pd.read_csv("data/raw/healthcare-dataset-stroke-data.csv")


In [2]:
plt.figure(figsize=(6,4))
sns.histplot(df['age'], bins=30, kde=True)
plt.title("Distribution of Age")
plt.xlabel("Age")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("figures/age_distribution.png")
plt.close()


Most patients are older adults, which is relevant because stroke risk increases with age.

In [3]:
plt.figure(figsize=(6,4))
sns.histplot(df['bmi'], bins=30, kde=True)
plt.title("Distribution of BMI")
plt.xlabel("BMI")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("figures/bmi_distribution.png")
plt.close()


The BMI distribution is right skewed, with most values between 20 and 35, with a few high outliers. This suggests that most of the patients fall within a healthy to overweight range.

In [4]:
plt.figure(figsize=(6,4))
sns.scatterplot(
    data=df,
    x='age',
    y='avg_glucose_level',
    hue='stroke',
    alpha=0.7
)
plt.title("Age vs. Avg Glucose Level by Stroke Outcome")
plt.xlabel("Age")
plt.ylabel("Average Glucose Level")
plt.legend(title="Stroke")
plt.tight_layout()
plt.savefig("figures/age_vs_glucose_by_stroke.png")
plt.close()


The scatter plot shows that stroke cases tend to occur more frequently among older individuals and those with higher average glucose levels. This suggests that both age and elevated glucose may be important risk factors for stroke.

In [5]:
plt.figure(figsize=(5,4))
sns.countplot(x='stroke', data=df)
plt.title("Class Balance: Stroke vs. No Stroke")
plt.xlabel("Stroke (1 = Yes, 0 = No)")
plt.ylabel("Count")
plt.tight_layout()
plt.savefig("figures/stroke_class_balance.png")
plt.close()


The dataset is highly imbalanced, with far more non-stroke cases than stroke cases. This imbalance suggests that special techniques, such as resampling or class weighting, may be needed to build an effective predictive model.

In [6]:
plt.figure(figsize=(8,4))
sns.heatmap(df.isnull(), cbar=False)
plt.title("Missing Values Heatmap")
plt.tight_layout()
plt.savefig("figures/missing_values_heatmap.png")
plt.close()


The only missing values in the dataset are in BMI.