# Declaration of Originality

**Student Name:** [Your Name]  
**Student ID:** [Your ID]  
**Class:** [Your Class]  

I declare that this assignment is my own work and has been completed in accordance with the school's academic integrity policy.

**Use of Generative AI:**
- [ ] I did not use any Generative AI tools
- [ ] I used Generative AI tools (Claude, ChatGPT, etc.) for: ___________

**Signature:** ___________  
**Date:** ___________

# 1. Import Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning - Model Selection
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score

# Machine Learning - Preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer

# Machine Learning - Models (ONLY scikit-learn allowed)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier

# Machine Learning - Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, 
    roc_auc_score, confusion_matrix, classification_report,
    roc_curve, auc
)

# Feature Selection
from sklearn.feature_selection import SelectKBest, f_classif, RFE

# Save model
import pickle

# Settings
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
np.random.seed(42)

# Plotting settings
plt.style.use('default')
sns.set_palette('husl')
%matplotlib inline

print("✅ All libraries imported successfully!")

# 2. Problem Statement & Dataset

## 2.1 Business Problem

**Problem:** Predicting stroke risk in patients to enable early intervention and prevention

**Why it matters:**
- Strokes are a leading cause of death and disability worldwide
- Early detection allows preventive measures (lifestyle changes, medication)
- Reduces healthcare costs through prevention vs treatment
- Improves patient quality of life through timely intervention

**Target Audience:** Healthcare providers, clinics, hospitals

## 2.2 Dataset Information

**Source:** Kaggle - Stroke Prediction Dataset  
**URL:** https://www.kaggle.com/datasets/jawairia123/stroke-prediction-dataset/data  
**Size:** 5,110 samples (rows) × 12 features (columns)  
**Target Variable:** `stroke` (0 = No stroke, 1 = Stroke)  
**Problem Type:** Binary Classification  

**Features:**
1. `id` - Unique identifier
2. `gender` - Male/Female/Other
3. `age` - Age of patient
4. `hypertension` - 0 = no hypertension, 1 = has hypertension
5. `heart_disease` - 0 = no heart disease, 1 = has heart disease
6. `ever_married` - Yes/No
7. `work_type` - Type of work (Private, Self-employed, Govt_job, children, Never_worked)
8. `Residence_type` - Urban/Rural
9. `avg_glucose_level` - Average glucose level in blood
10. `bmi` - Body mass index
11. `smoking_status` - formerly smoked, never smoked, smokes, Unknown
12. `stroke` - TARGET: 0 = no stroke, 1 = stroke

## 2.3 Load Dataset

In [None]:
# Load the dataset
FILE_PATH = 'healthcare-dataset-stroke-data.csv'
df = pd.read_csv(FILE_PATH)

# Display basic information
print("Dataset Shape:", df.shape)
print(f"\nNumber of samples: {df.shape[0]:,}")
print(f"Number of features: {df.shape[1]}")
print("\n" + "="*50)

# Display first few rows
display(df.head())

# Display data types and non-null counts
print("\n" + "="*50)
print("\nDataset Info:")
df.info()

In [None]:
# Statistical summary for numerical features
print("Numerical Features Summary:")
display(df.describe())

# Summary for categorical features
print("\n" + "="*50)
print("\nCategorical Features Summary:")
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    print(f"\n{col}:")
    print(df[col].value_counts())

# 3. Exploratory Data Analysis (EDA)

**Purpose:** Understand the data, identify patterns, detect outliers, and discover relationships between features and target variable.

## 3.1 Check for Missing Values

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percent
})
missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df) > 0:
    print("Missing Values Summary:")
    display(missing_df)
    
    # Visualize missing values
    plt.figure(figsize=(10, 4))
    plt.bar(missing_df.index, missing_df['Missing Count'])
    plt.xlabel('Features')
    plt.ylabel('Number of Missing Values')
    plt.title('Missing Values by Feature')
    plt.xticks(rotation=45)
    plt.tight_layout()
    plt.show()
else:
    print("✅ No missing values found!")

# INTERPRETATION: 
# TODO: Write your interpretation here
# Example: "BMI has 201 missing values (3.9%). This represents a small portion of data.
# We will handle this in the data preparation phase by imputing with median value."

## 3.2 Target Variable Distribution

In [None]:
# Target variable distribution
target_counts = df['stroke'].value_counts()
target_percent = df['stroke'].value_counts(normalize=True) * 100

print("Target Variable Distribution:")
print(f"No Stroke (0): {target_counts[0]:,} ({target_percent[0]:.2f}%)")
print(f"Stroke (1): {target_counts[1]:,} ({target_percent[1]:.2f}%)")
print(f"\nImbalance Ratio: {target_counts[0]/target_counts[1]:.2f}:1")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
axes[0].bar(['No Stroke', 'Stroke'], target_counts.values, color=['green', 'red'])
axes[0].set_ylabel('Count')
axes[0].set_title('Stroke Distribution (Count)')
for i, v in enumerate(target_counts.values):
    axes[0].text(i, v + 50, str(v), ha='center', va='bottom', fontweight='bold')

# Pie chart
axes[1].pie(target_counts.values, labels=['No Stroke', 'Stroke'], autopct='%1.1f%%', 
            colors=['green', 'red'], startangle=90)
axes[1].set_title('Stroke Distribution (Percentage)')

plt.tight_layout()
plt.show()

# INTERPRETATION:
# TODO: Write your interpretation here
# Example: "The dataset is highly imbalanced with only 4.9% stroke cases (249 out of 5,110).
# This imbalance suggests we should:
# 1. Focus on recall as primary metric (to minimize missing stroke cases)
# 2. Consider using class weights in our models
# 3. Be cautious about accuracy as a metric - a model predicting all 'No Stroke' 
#    would achieve 95% accuracy but be useless!"

## 3.3 Numerical Features Distribution

In [None]:
# Select numerical columns (exclude 'id' as it's just an identifier)
numerical_cols = ['age', 'avg_glucose_level', 'bmi']

# Distribution plots
fig, axes = plt.subplots(3, 2, figsize=(14, 12))

for i, col in enumerate(numerical_cols):
    # Histogram
    axes[i, 0].hist(df[col].dropna(), bins=30, edgecolor='black', alpha=0.7)
    axes[i, 0].set_xlabel(col)
    axes[i, 0].set_ylabel('Frequency')
    axes[i, 0].set_title(f'{col} Distribution')
    axes[i, 0].axvline(df[col].mean(), color='red', linestyle='--', label=f'Mean: {df[col].mean():.2f}')
    axes[i, 0].axvline(df[col].median(), color='green', linestyle='--', label=f'Median: {df[col].median():.2f}')
    axes[i, 0].legend()
    
    # Box plot
    axes[i, 1].boxplot(df[col].dropna(), vert=True)
    axes[i, 1].set_ylabel(col)
    axes[i, 1].set_title(f'{col} Box Plot (Outlier Detection)')
    axes[i, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

# Outlier detection using IQR method
print("\n" + "="*60)
print("Outlier Analysis (IQR Method):")
print("="*60)

for col in numerical_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    outlier_count = len(outliers)
    outlier_percent = (outlier_count / len(df)) * 100
    
    print(f"\n{col}:")
    print(f"  Q1: {Q1:.2f}, Q3: {Q3:.2f}, IQR: {IQR:.2f}")
    print(f"  Valid range: [{lower_bound:.2f}, {upper_bound:.2f}]")
    print(f"  Outliers detected: {outlier_count} ({outlier_percent:.2f}%)")
    if outlier_count > 0:
        print(f"  Outlier values range: [{outliers[col].min():.2f}, {outliers[col].max():.2f}]")

# INTERPRETATION:
# TODO: Write your interpretation here
# Example: "Age shows a right-skewed distribution with most patients between 40-80 years.
# BMI has some outliers above 50, representing extreme obesity cases which are medically
# relevant for stroke prediction - we should keep these.
# Glucose level shows high variability, which is expected in stroke patients."

## 3.4 Correlation Analysis

In [None]:
# Select numerical columns including binary features for correlation
corr_cols = ['age', 'hypertension', 'heart_disease', 'avg_glucose_level', 'bmi', 'stroke']
correlation_matrix = df[corr_cols].corr()

# Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.3f', cmap='coolwarm', 
            center=0, square=True, linewidths=1)
plt.title('Feature Correlation Heatmap', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Feature correlation with target
print("\nCorrelation with Target Variable (Stroke):")
target_corr = correlation_matrix['stroke'].sort_values(ascending=False)
print(target_corr)

# Visualize correlation with target
plt.figure(figsize=(10, 6))
target_corr[:-1].plot(kind='barh', color='steelblue')
plt.xlabel('Correlation with Stroke')
plt.title('Feature Correlation with Target Variable')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

# INTERPRETATION:
# TODO: Write your interpretation here
# Example: "Age shows strongest positive correlation with stroke (0.25), indicating older
# patients have higher risk. Hypertension and heart disease also show positive correlation.
# All correlations are relatively weak (<0.3), suggesting non-linear relationships which
# tree-based models (Random Forest, Gradient Boosting) might capture better than linear models."

## 3.5 Target Variable vs Key Features Analysis

In [None]:
# Numerical features vs Target
numerical_features = ['age', 'avg_glucose_level', 'bmi']

fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, col in enumerate(numerical_features):
    # Box plot grouped by stroke
    df.boxplot(column=col, by='stroke', ax=axes[i])
    axes[i].set_xlabel('Stroke (0=No, 1=Yes)')
    axes[i].set_ylabel(col)
    axes[i].set_title(f'{col} vs Stroke')
    plt.sca(axes[i])
    plt.xticks([1, 2], ['No Stroke', 'Stroke'])

plt.suptitle('')  # Remove auto-generated title
plt.tight_layout()
plt.show()

# Statistical comparison
print("\nMean Values by Stroke Status:")
print("="*60)
for col in numerical_features:
    no_stroke_mean = df[df['stroke']==0][col].mean()
    stroke_mean = df[df['stroke']==1][col].mean()
    difference = stroke_mean - no_stroke_mean
    percent_diff = (difference / no_stroke_mean) * 100
    
    print(f"\n{col}:")
    print(f"  No Stroke: {no_stroke_mean:.2f}")
    print(f"  Stroke: {stroke_mean:.2f}")
    print(f"  Difference: {difference:+.2f} ({percent_diff:+.1f}%)")

# INTERPRETATION:
# TODO: Write your interpretation here

In [None]:
# Categorical features vs Target
categorical_features = ['gender', 'hypertension', 'heart_disease', 'ever_married', 
                        'work_type', 'Residence_type', 'smoking_status']

fig, axes = plt.subplots(3, 3, figsize=(16, 12))
axes = axes.flatten()

for i, col in enumerate(categorical_features):
    # Create crosstab
    ct = pd.crosstab(df[col], df['stroke'], normalize='index') * 100
    
    # Plot
    ct.plot(kind='bar', stacked=False, ax=axes[i], color=['green', 'red'])
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Percentage (%)')
    axes[i].set_title(f'Stroke Rate by {col}')
    axes[i].legend(['No Stroke', 'Stroke'])
    axes[i].tick_params(axis='x', rotation=45)

# Remove extra subplots
for j in range(len(categorical_features), len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

# Print stroke rates by category
print("\nStroke Rate by Category:")
print("="*60)
for col in categorical_features:
    print(f"\n{col}:")
    stroke_rate = df.groupby(col)['stroke'].mean() * 100
    counts = df.groupby(col)['stroke'].value_counts().unstack(fill_value=0)
    for category in stroke_rate.index:
        print(f"  {category}: {stroke_rate[category]:.2f}% ({counts.loc[category, 1] if 1 in counts.columns else 0} strokes out of {counts.loc[category].sum()} total)")

# INTERPRETATION:
# TODO: Write your interpretation here

## 3.6 EDA Summary & Key Insights

**TODO: Summarize your key findings from EDA**

Example structure:

### Key Findings:
1. **Class Imbalance:** Only 4.9% stroke cases - need to focus on recall metric
2. **Missing Data:** BMI has 201 missing values (3.9%) - will impute with median
3. **Strong Predictors:** Age, hypertension, heart disease show clear association with stroke
4. **Outliers:** BMI outliers represent medically relevant cases - will keep them
5. **Feature Relationships:** Weak linear correlations suggest tree-based models may perform better

### Implications for Modeling:
- Use recall as primary evaluation metric
- Consider class weights or resampling techniques
- Focus on Random Forest and Gradient Boosting (handle non-linear relationships)
- Feature engineering: create age groups, BMI categories, risk scores

# 4. Data Preparation

## 4.1 Handle Missing Values

In [None]:
# Create a copy for preprocessing
df_clean = df.copy()

# Handle missing BMI values
print("Before imputation:")
print(f"Missing BMI values: {df_clean['bmi'].isnull().sum()}")

# TODO: Impute missing BMI values with median
# Justification: Using median instead of mean because BMI has outliers
bmi_median = df_clean['bmi'].median()
df_clean['bmi'].fillna(bmi_median, inplace=True)

print(f"\nAfter imputation:")
print(f"Missing BMI values: {df_clean['bmi'].isnull().sum()}")
print(f"Imputed value (median): {bmi_median:.2f}")

# Justification: 
# "Using median (29.0) instead of mean because BMI distribution has outliers.
# Median is more robust and represents a typical BMI value better than mean
# which would be influenced by extreme obesity cases."

# Verify no missing values remain
print(f"\nTotal missing values in dataset: {df_clean.isnull().sum().sum()}")

## 4.2 Handle Outliers

In [None]:
# Decision on outliers
print("Outlier Handling Decision:")
print("="*60)
print("\nDECISION: KEEP all outliers")
print("\nJustification:")
print("1. BMI outliers (>50) represent extreme obesity - medically relevant for stroke")
print("2. Age outliers represent very elderly patients - high stroke risk group")
print("3. Glucose outliers indicate diabetes/pre-diabetes - important stroke risk factor")
print("4. Removing these would lose valuable information about high-risk patients")
print("5. Tree-based models (our planned approach) are robust to outliers")

# No outlier removal needed - proceed with df_clean as is

## 4.3 Feature Encoding

In [None]:
# Drop ID column (not useful for prediction)
df_clean = df_clean.drop('id', axis=1)

# Encode categorical variables using one-hot encoding
categorical_cols = ['gender', 'ever_married', 'work_type', 'Residence_type', 'smoking_status']

print("Before encoding:")
print(f"Number of features: {df_clean.shape[1]}")
print(f"Categorical columns: {categorical_cols}")

# One-hot encoding (drop_first=True to avoid multicollinearity)
df_encoded = pd.get_dummies(df_clean, columns=categorical_cols, drop_first=True)

print(f"\nAfter encoding:")
print(f"Number of features: {df_encoded.shape[1]}")
print(f"\nNew feature names:")
print(df_encoded.columns.tolist())

# Justification:
# "Using one-hot encoding for categorical variables to convert them into numerical format.
# drop_first=True removes one category from each feature to prevent perfect multicollinearity
# (dummy variable trap), which can cause issues in some models."

## 4.4 Train-Test Split

In [None]:
# Separate features and target
X = df_encoded.drop('stroke', axis=1)
y = df_encoded['stroke']

print("Dataset split:")
print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")

# Split into train and test sets
# stratify=y ensures same proportion of stroke/no-stroke in both sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 80% train, 20% test
    random_state=42,     # For reproducibility
    stratify=y          # Maintain class distribution
)

print("\nTrain set:")
print(f"  X_train: {X_train.shape}")
print(f"  y_train: {y_train.shape}")
print(f"  Stroke distribution: {y_train.value_counts().to_dict()}")

print("\nTest set:")
print(f"  X_test: {X_test.shape}")
print(f"  y_test: {y_test.shape}")
print(f"  Stroke distribution: {y_test.value_counts().to_dict()}")

# Verify stratification worked
train_stroke_rate = y_train.mean() * 100
test_stroke_rate = y_test.mean() * 100
print(f"\nStroke rate in train set: {train_stroke_rate:.2f}%")
print(f"Stroke rate in test set: {test_stroke_rate:.2f}%")
print("✅ Stratification successful!" if abs(train_stroke_rate - test_stroke_rate) < 1 else "⚠️ Check stratification")

# 5. Model Development

## 5.1 Baseline Model

In [None]:
# Baseline model (predicts most frequent class)
baseline = DummyClassifier(strategy='most_frequent', random_state=42)
baseline.fit(X_train, y_train)
y_pred_baseline = baseline.predict(X_test)

print("BASELINE MODEL (Most Frequent Class)")
print("="*60)
print(f"Accuracy: {accuracy_score(y_test, y_pred_baseline):.4f}")
print(f"Recall: {recall_score(y_test, y_pred_baseline):.4f}")
print(f"F1-Score: {f1_score(y_test, y_pred_baseline):.4f}")

print("\nNote: This baseline always predicts 'No Stroke'. We need to beat this!")

# This serves as our minimum benchmark - any real model must perform better

## 5.2 Train Multiple Models

We'll train 3 different algorithms and compare their performance.

In [None]:
# TODO: Model 1 - Random Forest Classifier
print("="*60)
print("MODEL 1: RANDOM FOREST CLASSIFIER")
print("="*60)

# Initialize and train
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    class_weight='balanced',  # Handle class imbalance
    n_jobs=-1  # Use all CPU cores
)

rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)
y_pred_rf_proba = rf_model.predict_proba(X_test)[:, 1]

# Evaluation
print("\nPerformance Metrics:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_rf):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_rf):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_rf_proba):.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_rf))

# TODO: Add your interpretation

In [None]:
# TODO: Model 2 - Gradient Boosting Classifier
print("="*60)
print("MODEL 2: GRADIENT BOOSTING CLASSIFIER")
print("="*60)

# Initialize and train
gb_model = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    random_state=42
)

gb_model.fit(X_train, y_train)

# Predictions
y_pred_gb = gb_model.predict(X_test)
y_pred_gb_proba = gb_model.predict_proba(X_test)[:, 1]

# Evaluation
print("\nPerformance Metrics:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_gb):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_gb):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_gb):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_gb):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_gb_proba):.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_gb))

# TODO: Add your interpretation

In [None]:
# TODO: Model 3 - Logistic Regression
print("="*60)
print("MODEL 3: LOGISTIC REGRESSION")
print("="*60)

# Initialize and train
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight='balanced'
)

lr_model.fit(X_train, y_train)

# Predictions
y_pred_lr = lr_model.predict(X_test)
y_pred_lr_proba = lr_model.predict_proba(X_test)[:, 1]

# Evaluation
print("\nPerformance Metrics:")
print(f"Accuracy:  {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"Precision: {precision_score(y_test, y_pred_lr):.4f}")
print(f"Recall:    {recall_score(y_test, y_pred_lr):.4f}")
print(f"F1-Score:  {f1_score(y_test, y_pred_lr):.4f}")
print(f"ROC-AUC:   {roc_auc_score(y_test, y_pred_lr_proba):.4f}")

print("\nClassification Report:")
print(classification_report(y_test, y_pred_lr))

# TODO: Add your interpretation

## 5.3 Model Comparison

In [None]:
# Create comparison table
model_comparison = pd.DataFrame({
    'Model': ['Baseline', 'Random Forest', 'Gradient Boosting', 'Logistic Regression'],
    'Accuracy': [
        accuracy_score(y_test, y_pred_baseline),
        accuracy_score(y_test, y_pred_rf),
        accuracy_score(y_test, y_pred_gb),
        accuracy_score(y_test, y_pred_lr)
    ],
    'Precision': [
        precision_score(y_test, y_pred_baseline, zero_division=0),
        precision_score(y_test, y_pred_rf),
        precision_score(y_test, y_pred_gb),
        precision_score(y_test, y_pred_lr)
    ],
    'Recall': [
        recall_score(y_test, y_pred_baseline),
        recall_score(y_test, y_pred_rf),
        recall_score(y_test, y_pred_gb),
        recall_score(y_test, y_pred_lr)
    ],
    'F1-Score': [
        f1_score(y_test, y_pred_baseline, zero_division=0),
        f1_score(y_test, y_pred_rf),
        f1_score(y_test, y_pred_gb),
        f1_score(y_test, y_pred_lr)
    ],
    'ROC-AUC': [
        0.5,  # Baseline has no probability predictions
        roc_auc_score(y_test, y_pred_rf_proba),
        roc_auc_score(y_test, y_pred_gb_proba),
        roc_auc_score(y_test, y_pred_lr_proba)
    ]
})

print("\nMODEL COMPARISON TABLE")
print("="*80)
display(model_comparison.round(4))

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Plot 1: All metrics
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
x = np.arange(len(metrics))
width = 0.2

axes[0].bar(x - 1.5*width, model_comparison.iloc[1][1:].values, width, label='Random Forest')
axes[0].bar(x - 0.5*width, model_comparison.iloc[2][1:].values, width, label='Gradient Boosting')
axes[0].bar(x + 0.5*width, model_comparison.iloc[3][1:].values, width, label='Logistic Regression')
axes[0].set_xlabel('Metrics')
axes[0].set_ylabel('Score')
axes[0].set_title('Model Performance Comparison')
axes[0].set_xticks(x)
axes[0].set_xticklabels(metrics)
axes[0].legend()
axes[0].grid(axis='y', alpha=0.3)

# Plot 2: Recall comparison (most important metric)
recall_data = model_comparison[['Model', 'Recall']].sort_values('Recall', ascending=True)
colors = ['red' if x == 'Baseline' else 'steelblue' for x in recall_data['Model']]
axes[1].barh(recall_data['Model'], recall_data['Recall'], color=colors)
axes[1].set_xlabel('Recall Score')
axes[1].set_title('Recall Comparison (Primary Metric)')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

# TODO: Model Selection Rationale
print("\n" + "="*80)
print("MODEL SELECTION RATIONALE")
print("="*80)
print("\nTODO: Write your model selection rationale here")
print("\nExample:")
print("Chosen Model: Random Forest")
print("Reasons:")
print("1. Highest recall (0.XX) - critical for minimizing missed stroke cases")
print("2. Best F1-Score (0.XX) - good balance between precision and recall")
print("3. Strong ROC-AUC (0.XX) - excellent discrimination ability")
print("4. Handles non-linear relationships well (as seen in EDA)")
print("5. Provides feature importance for interpretability")
print("\nBusiness Impact: Missing a stroke case is much more costly than a false alarm,")
print("so recall is our primary metric. Random Forest achieves best recall while")
print("maintaining reasonable precision.")

# 6. Iterative Model Development

## 6.1 Feature Engineering

In [None]:
# TODO: Create new features
# This is where you'll add engineered features to improve model performance

print("FEATURE ENGINEERING")
print("="*60)

# Create a copy of the encoded data
df_fe = df_encoded.copy()

# Feature 1: Age Groups
# TODO: Create age categories
# Example: Young (0-40), Middle (41-60), Senior (61-80), Elderly (80+)

# Feature 2: BMI Categories
# TODO: Create BMI categories
# Example: Underweight (<18.5), Normal (18.5-25), Overweight (25-30), Obese (>30)

# Feature 3: Health Risk Score
# TODO: Combine risk factors
# Example: hypertension + heart_disease + (age > 60) + (bmi > 30)

# Feature 4: Glucose Category
# TODO: Categorize glucose levels
# Example: Normal (<140), Prediabetic (140-200), Diabetic (>200)

print("\nNew features created:")
print("TODO: List your new features here")
print(f"\nTotal features before: {df_encoded.shape[1]}")
print(f"Total features after: {df_fe.shape[1]}")

# Justification:
print("\nJustification:")
print("TODO: Explain why these features should improve the model")

In [None]:
# TODO: Train model with new features
# Split data
# X_fe = df_fe.drop('stroke', axis=1)
# y_fe = df_fe['stroke']
# X_train_fe, X_test_fe, y_train_fe, y_test_fe = train_test_split(
#     X_fe, y_fe, test_size=0.2, random_state=42, stratify=y_fe)

# Train Random Forest with new features
# rf_fe = RandomForestClassifier(...)
# rf_fe.fit(X_train_fe, y_train_fe)
# y_pred_fe = rf_fe.predict(X_test_fe)

# Compare results
# print("BEFORE Feature Engineering:")
# print(f"Recall: {recall_score(y_test, y_pred_rf):.4f}")
# print(f"F1-Score: {f1_score(y_test, y_pred_rf):.4f}")

# print("\nAFTER Feature Engineering:")
# print(f"Recall: {recall_score(y_test_fe, y_pred_fe):.4f}")
# print(f"F1-Score: {f1_score(y_test_fe, y_pred_fe):.4f}")

# print("\nImprovement:")
# print(f"Recall: {recall_score(y_test_fe, y_pred_fe) - recall_score(y_test, y_pred_rf):+.4f}")

print("TODO: Implement feature engineering and show improvement")

## 6.2 Hyperparameter Tuning

**IMPORTANT:** Must use RandomizedSearchCV (NOT GridSearchCV) with max 3 values per hyperparameter

In [None]:
# TODO: Hyperparameter tuning with RandomizedSearchCV
print("HYPERPARAMETER TUNING")
print("="*60)

# Define parameter distribution (max 3 values per parameter)
param_dist = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# TODO: Implement RandomizedSearchCV
# rs_rf = RandomizedSearchCV(
#     RandomForestClassifier(random_state=42, class_weight='balanced'),
#     param_distributions=param_dist,
#     n_iter=20,
#     cv=5,
#     scoring='recall',
#     random_state=42,
#     verbose=1,
#     n_jobs=-1
# )

# rs_rf.fit(X_train_fe, y_train_fe)

# Best parameters and results
# print("\nBest Parameters:", rs_rf.best_params_)
# print("Best CV Recall:", rs_rf.best_score_)

print("TODO: Implement hyperparameter tuning")

# 7. Final Model Evaluation

## 7.1 Comprehensive Metrics

In [None]:
# TODO: Evaluate final tuned model
print("FINAL MODEL EVALUATION")
print("="*60)

# Classification report
# print(classification_report(y_test_fe, y_pred_tuned))

# Confusion matrix
# cm = confusion_matrix(y_test_fe, y_pred_tuned)
# plt.figure(figsize=(8, 6))
# sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
# plt.xlabel('Predicted')
# plt.ylabel('Actual')
# plt.title('Confusion Matrix - Final Model')
# plt.show()

print("TODO: Show final model evaluation")

## 7.2 Evaluation Metric Justification

**TODO: Justify your choice of evaluation metrics**

Example:

### Primary Metric: RECALL

**Business Rationale:**
- **Cost of False Negative (Missing Stroke):** Very High
  - Patient doesn't receive preventive care
  - Stroke occurs → disability, death, high treatment costs
  - Lost opportunity for lifestyle intervention

- **Cost of False Positive (False Alarm):** Low
  - Additional medical testing
  - Minor inconvenience
  - Potentially discovers other health issues

**Therefore:** Optimize for recall to minimize missed cases

### Secondary Metrics:
- **F1-Score:** Ensures we maintain reasonable precision
- **ROC-AUC:** Shows overall discrimination ability

### Business Impact:
- Final recall of 0.XX means we identify XX% of stroke cases
- In a population of 10,000, this prevents XX strokes from being missed
- Estimated cost savings: $XXX per prevented stroke × XX cases = $XXX,XXX

## 7.3 Feature Importance Analysis

In [None]:
# TODO: Show feature importance from final model
# importances = best_model.feature_importances_
# feature_names = X_train_fe.columns
# feature_importance_df = pd.DataFrame({
#     'Feature': feature_names,
#     'Importance': importances
# }).sort_values('Importance', ascending=False)

# plt.figure(figsize=(10, 8))
# plt.barh(feature_importance_df['Feature'][:15], feature_importance_df['Importance'][:15])
# plt.xlabel('Importance')
# plt.title('Top 15 Most Important Features')
# plt.gca().invert_yaxis()
# plt.tight_layout()
# plt.show()

print("TODO: Implement feature importance visualization")

# 8. Save Final Model

Save the model for deployment in Streamlit app

In [None]:
# TODO: Save your final model
# with open('stroke_prediction_model.pkl', 'wb') as f:
#     pickle.dump(best_model, f)

# # Also save feature names for consistency in Streamlit
# with open('feature_names.pkl', 'wb') as f:
#     pickle.dump(X_train_fe.columns.tolist(), f)

# print("✅ Model saved successfully!")
# print("Files created:")
# print("  - stroke_prediction_model.pkl")
# print("  - feature_names.pkl")

print("TODO: Save your final model")

# 9. Development Log

**TODO: Document your iterative development process**

Example:

## Iteration 1: Baseline Models (Date: XX/XX/2025)
- Trained 3 models: Random Forest, Gradient Boosting, Logistic Regression
- Best: Random Forest with recall=0.XX
- Issue: Low recall, many false negatives
- Decision: Focus on Random Forest, add feature engineering

## Iteration 2: Feature Engineering (Date: XX/XX/2025)
- Added: Age groups, BMI categories, Health risk score, Glucose categories
- Result: Recall improved to 0.XX (+0.XX improvement)
- Analysis: Age groups and risk score most impactful
- Decision: Keep all new features for final model

## Iteration 3: Hyperparameter Tuning (Date: XX/XX/2025)
- Used RandomizedSearchCV with 20 iterations
- Tuned: n_estimators, max_depth, min_samples_split, min_samples_leaf
- Best params: [list params]
- Result: Recall improved to 0.XX (+0.XX improvement)

## Final Model Performance
- Model: Random Forest with feature engineering and tuned hyperparameters
- Recall: 0.XX (primary metric)
- F1-Score: 0.XX
- ROC-AUC: 0.XX
- Total improvement from baseline: +0.XX recall

# Next Steps

1. ✅ Complete all TODO sections in this notebook
2. ✅ Build Streamlit web application
3. ✅ Deploy to Streamlit Cloud
4. ✅ Prepare presentation slides
5. ✅ Complete Word document with links and screenshots
6. ✅ Push code to GitHub for version control evidence