## Stroke Prediction Dataset

The dataset we will be using in this workshop is the [Stroke Prediction Dataset](https://www.kaggle.com/fedesoriano/stroke-prediction-dataset) from **Kaggle**. It contains health records of over 5000 individuals, some of whom have suffered a stroke.

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset can be used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relevant information about the patient.

We will be using this dataset to get initial insights about the data, and to get a *feel* for the work of a data scientist. We will understand how to use basic python scripting, packages and techniques to explore the data, and how to use this information to train a Machine Learning (ML) model in subsequent workshop.

# Step 0: Imports and Reading Data

In [1]:
# Numpy and Pandas for data manipulation
import numpy as np
import pandas as pd

# Matplotlib and Seaborn for visualization
import matplotlib.pylab as plt
import seaborn as sns
plt.style.use('ggplot')

In [2]:
# Read in data into a dataframe
df = pd.read_csv('data/stroke.csv')

# Step 1: Data Understanding

* Dataframe `shape`
* `head` and `tail`
*  `dtypes`
*  `describe`

In [None]:
# Display shape of dataframe
df.shape

In [None]:
# Display first 10 rows of dataframe
df.head(10)

In [None]:
# Display column names
df.columns

The current dataset contains the following features:

* `id`: unique identifier
* `gender`: "Male", "Female" or "Other"
* `age`: age of the patient
* `hypertension`: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
* `heart_disease`: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
* `ever_married`: "No" or "Yes"
* `work_type`: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
* `Residence_type`: "Rural" or "Urban"
* `avg_glucose_level`: average glucose level in blood
* `bmi`: body mass index
* `smoking_status`: "formerly smoked", "never smoked", "smokes" or "Unknown"*
* `stroke`: 1 if the patient had a stroke or 0 if not

###### **Note**: "Unknown" in smoking_status means that the information is unavailable for this patient

In [None]:
# Display information about types of data in dataframe
df.dtypes

In [None]:
# Describe basic statistics about dataframe
df.describe()

# Step 2: Data Preparation

* Dropping irrelevant columns and rows
* Identifying duplicated columns
* Renaming Columns
* Feature Creation

In [None]:
# Dropping the 'irrelevant_column' and 'duplicate_column'
df.drop(['irrelevant_column', 'duplicate_column'], axis=1, inplace=True)
df.head()

In [None]:
# Renaming columns back to original
df.rename(columns={'sex': 'gender', 'patient_age': 'age'}, inplace=True)
df.head()

In [None]:
# Identifying duplicate rows
df.duplicated()

In [None]:
df[df.duplicated()]

In [None]:
df = df.drop_duplicates()
df.shape

In [None]:
# Checking the number of missing values
df.isnull().sum()

In [None]:
# Percentage of missing bmi values
df['bmi'].isnull().sum() / len(df) * 100

In [None]:
# Replacing missing bmi values with mean
df['bmi'].fillna(df['bmi'].mean(), inplace=True)
df.isnull().sum()

In [None]:
# Creating a health risk score based on normalized values of certain health indicators
df['health_risk_score'] = (
    (df['age'] / 100) +
    df['hypertension'] +
    df['heart_disease'] +
    (df['avg_glucose_level'] / 200) +
    (df['bmi']) / 50)

df['health_risk_score'] = (df['health_risk_score'] - df['health_risk_score'].min()) / (df['health_risk_score'].max() - df['health_risk_score'].min())

df.head()

# Step 3: Feature Understanding - Univariate Analysis

* Plotting Feature Distributions
  * Histogram
  * KDE
  * Boxplot

In [None]:
df['smoking_status'].value_counts()

In [None]:
# Bar plot of smoking status (categorious variable)
ax = df['smoking_status'].value_counts() \
    .head(10) \
    .plot(kind='bar', title='Smoking Status', figsize=(15, 10), color='green', fontsize=20)
ax.set_xlabel('Smoking Status')
ax.set_ylabel('Count')

In [None]:
# Histogram of age (continuous variable)

# Creating the figure and the first axis
fig, ax1 = plt.subplots(figsize=(15, 10))

# Plotting the histogram of 'age' on the first axis
ax1.hist(df['age'], bins=20, alpha=0.5, color='blue', label='Age')
ax1.set_xlabel('Age', fontsize=20)
ax1.set_ylabel('Count of People with Age', fontsize=20)
ax1.tick_params(axis='both', which='major', labelsize=20)

# Creating the second axis
ax2 = ax1.twinx()

# Plotting the histogram of 'avg_glucose_level' on the second axis
ax2.hist(df['avg_glucose_level'], bins=20, alpha=0.5, color='green', label='Avg Glucose Level')
ax2.set_ylabel('Count of People with Avg Glucose Level', fontsize=20)
ax2.tick_params(axis='y', which='major', labelsize=20)

# Adding legends
ax1.legend(loc='upper left', fontsize=15)
ax2.legend(loc='upper right', fontsize=15)

# Showing the plot
plt.show()

In [None]:
# Kernel Density Estimation (KDE) plot of health risk score (continuous variable)
plt.figure(figsize=(10, 6))
sns.kdeplot(df['health_risk_score'], shade=True, color="r")
plt.title('KDE of Health Risk Score')
plt.xlabel('Health Risk Score')
plt.ylabel('Density')
plt.show()

# Step 4: Feature Relationship - Bivariate Analysis

* Scatterplot
* Heatmap Correlation
* Pairplot
* Groupby comparisons

In [None]:
# Scatter plot of age and health risk score (continuous variables)
plt.figure(figsize=(15, 10))
sns.scatterplot(
    x=df['age'], 
    y=df['health_risk_score'],
    hue=df['stroke'],
    palette=['blue', 'red'],
    alpha=0.5)
plt.title('Scatter Plot of Age and Health Risk Score', fontsize=20)
plt.xlabel('Age', fontsize=20)
plt.ylabel('Health Risk Score', fontsize=20)
plt.show()

In [None]:
# Scatter plot of age and health risk score (continuous variables) with regression line and 95% confidence interval
plt.figure(figsize=(15, 10))
sns.regplot(
    x=df['age'], 
    y=df['health_risk_score'], 
    color='green',
    scatter_kws={'alpha': 0.5}, 
    line_kws={'color': 'red'},
    ci=95)
plt.title('Scatter Plot of Age and Health Risk Score', fontsize=20)
plt.xlabel('Age', fontsize=20)
plt.ylabel('Health Risk Score', fontsize=20)
plt.show()

In [None]:
# Pairplot of age, avg_glucose_level, bmi, hypertension, heart_disease
sns.pairplot(
    df,
    vars=['age', 'avg_glucose_level', 'bmi', 'hypertension', 'heart_disease'],
    hue='stroke',
    palette=['blue', 'red']
)
plt.show()


In [None]:
# A more refined pairplot of age, avg_glucose_level, bmi, hypertension, heart_disease

fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(16, 12))

sns.boxplot(x=df['age'], ax=axes[0, 0]).set_title("**BoxPlot For Age Col**")
sns.histplot(data=df, x='age', kde=True, ax=axes[0, 1]).set_title("**Distribution Of Age**")
sns.lineplot(data=df, x='age', y="stroke", ax=axes[0, 2]).set_title('**Lineplot For Age with Stroke**')

sns.boxplot(x=df['avg_glucose_level'], ax=axes[1, 0]).set_title("BoxPlot For Glucose")
sns.histplot(data=df, x='avg_glucose_level', kde=True, ax=axes[1, 1]).set_title("Distribution Of Glucose")
sns.lineplot(data=df, x='avg_glucose_level', y="stroke", ax=axes[1, 2]).set_title('**Lineplot For Glucose Level With Stroke')

sns.boxplot(x=df['bmi'], ax=axes[2, 0]).set_title("BoxPlot For Bmi Col")
sns.histplot(data=df, x='bmi', kde=True, ax=axes[2, 1]).set_title("Distribution Of Bmi")
sns.lineplot(data=df, x='bmi', y="stroke", ax=axes[2, 2]).set_title('Lineplot For Bmi With Stroke')

plt.tight_layout()
plt.show()

In [None]:
# Correlation matrix of age, avg_glucose_level, bmi, hypertension, heart_disease
plt.figure(figsize=(15, 10))
sns.heatmap(
    df[['age', 'avg_glucose_level', 'bmi', 'hypertension', 'heart_disease']].corr(), 
    annot=True,
    cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# Step 5: Ask a Question about the data
* Try to answer a question you have about the data using a plot or statistic.

In [None]:
# What are the residence type and work type most susceptible to stroke?

# Filter only rows where stroke occurred and group by 'Residence_type' and 'work_type', calculating the count for each group
stroke_susceptibility = df[df['stroke'] == 1].groupby(['Residence_type', 'work_type']).size().reset_index(name='count')

# Sort values by count in descending order to see which combination of 'Residence_type' and 'work_type' has the most strokes
stroke_susceptibility = stroke_susceptibility.sort_values('count', ascending=False)

# Concatenating 'Residence_type' and 'work_type' for better labeling
stroke_susceptibility['Residence_Work'] = stroke_susceptibility['Residence_type'] + ", " + stroke_susceptibility['work_type']

# Plotting the results with the combined label
ax = stroke_susceptibility.plot(
    kind='barh',
    x='Residence_Work',
    y='count',
    figsize=(20, 7),
    title='Number of Strokes by Residence and Work Type'
)
ax.set_xlabel('Number of Strokes')
ax.set_ylabel('Residence Type, Work Type')
plt.show()