REPORT

1. Data description

This is a binary classification dataset for stroke prediction, consisting of 5110 records and 12 features that are related to stroke risk factors. The 10 features include:

Demographics: Gender, age, marital status, residence type

Medical History: Hypertension, heart disease

Lifestyle Factors: Work type, smoking status

Health Metrics: Average glucose level, BMI

The target variable is stroke occurrence (binary classification). There are missing values in the BMI column which we can use median imputation to resolve. Feature engineering is used to add more factors for effective learning.

2. The objectives

The goal of the analysis is to create a supervised learning model that predicts whether or not an individual has a stroke based on various risk factors. 
-We want the model to not overfit and generalize well to new examples. 
-We want a high accuracy which is crucial for health-related tasks.

3.

Two deep learning architectures were tested:

Complex model: A deep network with four layers of 16 neurons each. This model showed overfitting, with significantly lower validation accuracy compared to training accuracy.

Optimized / Simplified model: A simpler architecture with one hidden layer containing 8 neurons, ReLU activation, and L2 regularization. This model performed better in terms of validation accuracy and generalization.

I selected the simplified model, which achieved stable validation accuracy (~95%) within 20 epochs.

4. Key Findings

-The optimized model achieved a validation accuracy of ~95%, with minimal overfitting.

-The model converged quickly, suggesting that the dataset is relatively small for deep learning applications.

-Feature importance analysis and correlation tests indicate that age and average glucose level have strong correlations with stroke occurrence

5. Model Limitations and Future Improvements

-The dataset likely has an imbalance in stroke occurrences, which could affect model learning. Using SMOTE or class-weight adjustments may help.
-Trying ensemble methods, such as boosting or hybrid deep learning approaches, could be beneficial.
-The insufficient number of training examples may limit the model's capabilities to genearalize to new data. Collecting additional records or synthetic data generation might resolve this.




In [62]:
#Importing libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np
import pandas as pd

#Loading the data - Stroke Prediction Dataset from Kaggle
file_path = "Stroke Prediction Dataset - FEDESORIANO.csv"
df = pd.read_csv(file_path)

#Data processing and feature engineering

#ID column is not needed. Create new features before splitting into X and y
df.drop(columns=["id"], inplace=True)

#Create age groups
df['age_group'] = pd.cut(df['age'], bins=[0, 20, 40, 60, 80, 100], labels=['0-20', '21-40', '41-60', '61-80', '80+'])

#Create BMI categories based on WHO standards
df['bmi_category'] = pd.cut(df['bmi'], 
                           bins=[0, 18.5, 24.9, 29.9, float('inf')],
                           labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

#Create glucose level categories
df['glucose_category'] = pd.cut(df['avg_glucose_level'],
                               bins=[0, 70, 100, 125, float('inf')],
                               labels=['Low', 'Normal', 'Pre-diabetic', 'Diabetic'])

#Create health risk score combining hypertension and heart disease
df['health_risk'] = df['hypertension'] + df['heart_disease']

#Split features and target
X = df.drop(columns=["stroke"])
y = df["stroke"]

cat_cols = ["gender", "ever_married", "work_type", "Residence_type", "smoking_status", 
            "age_group", "bmi_category", "glucose_category"]
num_cols = ["age", "hypertension", "heart_disease", "avg_glucose_level", "bmi", "health_risk"]

#Processed based on variable type - One-hot encoding for categorical, filling in missing values and normalization for numerical
cat_transformer = OneHotEncoder(handle_unknown="ignore")
num_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")), 
    ("scaler", StandardScaler())  
])

preprocessor = ColumnTransformer(transformers=[
    ("num", num_transformer, num_cols),
    ("cat", cat_transformer, cat_cols)
])

#0.75-0.25 split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)


In [65]:
import tensorflow as tf
from tensorflow import keras
from keras import layers, regularizers
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

#First started with a deep network with 4 16-neuron layers,
#caused overfitting as reflected by low validation accuracy compared to training accuracy.
#Switched to a simpler network with 1 8-neuron layer and ReLU activation and L2 regularization. 
#Sigmoid function applied in the final layer as this is a binary classification task.

model = keras.Sequential([
    
    layers.Dense(8, activation="relu", kernel_regularizer=regularizers.l2(0.001)),

    layers.Dense(1, activation="sigmoid")
])

#Adam optimizer and binary crossentropy loss function achieved the highest accuracy.
optimizer = keras.optimizers.Adam(learning_rate=0.0001)
model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"])

#Train model. The model converges very quickly at around 20 epochs, fluctuating around 0.95 validation accuracy. Small difference between training and validation accuracy indicates the model has generalized well for this task. 
# The small amount of epochs needed for convergence may also indicate insufficient data. Larger datasets, further feature engineering or generating synthetic data using SMOTE may be needed.
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test), 
                    verbose=1)




Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
