### ***PREDICTING STROKE RISK USING PATIENT HEALTH DATA - MODELING***

**Moedling Plan**

1. Select and Fit Models (Two to Three Models)

    Given that we're working on stroke prediction (a classification problem), a few models to consider include:
        Logistic Regression (baseline model)
        Random Forest Classifier
        Gradient Boosting

    Code Outline: For each model, we will:
        Fit the model to our training dataset.
        Evaluate its performance using metrics like accuracy, precision, recall, and F1 score.
        Optionally perform hyperparameter tuning (e.g., using GridSearchCV or cross-validation).
   
3. Hyperparameter Tuning

    Use GridSearchCV or RandomizedSearchCV to optimize parameters of at least one model

3. Compare Model Performance

    After fitting our models and tuning hyperparameters, compare them based on:
        Accuracy
        Precision
        Recall
        F1 Score
        Confusion Matrix (for a more detailed breakdown)
    Create a comparison table of these metrics for each model.

4. Select the Best Model

    Based on the performance metrics, determine which model best fits our criteria.
    Discuss why this model is the best for this project (consider not only accuracy but also scalability, computational efficiency, and interpretability).

In [74]:
# First we'll start by importing the necessary libraries
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import warnings
warnings.filterwarnings('ignore')  # Ignore warnings

In [50]:
# Next, we'll load the scaled/preprocessed datasets that we created from the previous steps of this project
# Load the dataset
file_path = 'C:/Users/hecsa/Springboard/Springboard Github/Springboard/Data Science Capstone Two/dataset/stroke_data_scaled.csv'
stroke_data_scaled = pd.read_csv(file_path)

X_train = pd.read_csv('C:/Users/hecsa/Springboard/Springboard Github/Springboard/Data Science Capstone Two/dataset/X_train.csv')
X_test = pd.read_csv('C:/Users/hecsa/Springboard/Springboard Github/Springboard/Data Science Capstone Two/dataset/X_test.csv')
y_train = pd.read_csv('C:/Users/hecsa/Springboard/Springboard Github/Springboard/Data Science Capstone Two/dataset/y_train.csv')
y_test = pd.read_csv('C:/Users/hecsa/Springboard/Springboard Github/Springboard/Data Science Capstone Two/dataset/y_test.csv')

In [52]:
print(X_train.dtypes)

id                                       int64
age                                    float64
hypertension                             int64
heart_disease                            int64
avg_glucose_level                      float64
bmi                                    float64
age_group                               object
age_hypertension_interaction           float64
gender_Male_True                          bool
gender_Other_True                         bool
ever_married_Yes_True                     bool
work_type_Never_worked_True               bool
work_type_Private_True                    bool
work_type_Self-employed_True              bool
work_type_children_True                   bool
Residence_type_Urban_True                 bool
smoking_status_formerly smoked_True       bool
smoking_status_never smoked_True          bool
smoking_status_smokes_True                bool
dtype: object


In [54]:
# Apply one-hot encoding to categorical columns
X_train_encoded = pd.get_dummies(X_train, drop_first=True)  # One-hot encoding
X_test_encoded = pd.get_dummies(X_test, drop_first=True)

# Align columns in case one-hot encoding created different columns
X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join='left', axis=1, fill_value=0)

In [102]:
#Check class imbalance: Use value_counts() to see if you have many more non-stroke cases than stroke cases.
print(y_train.value_counts())
print(y_test.value_counts())

stroke
0         3901
1          187
Name: count, dtype: int64
stroke
0         960
1          62
Name: count, dtype: int64


In [126]:
# Dictionary to store evaluation metrics for each model
evaluation_metrics = {}

In [128]:
# Logistic Regression Model
log_reg = LogisticRegression(random_state=42, class_weight='balanced')
log_reg.fit(X_train_encoded, y_train)
y_pred_log = log_reg.predict(X_test_encoded)

# Evaluation Metrics for Logistic Regression
log_reg_metrics = {
    'Accuracy': accuracy_score(y_test, y_pred_log),
    'Precision': precision_score(y_test, y_pred_log),
    'Recall': recall_score(y_test, y_pred_log),
    'F1 Score': f1_score(y_test, y_pred_log)
}
evaluation_metrics['Logistic Regression'] = log_reg_metrics

In [130]:
# Random Forest Classifier
rf_clf = RandomForestClassifier(random_state=42, class_weight='balanced')
rf_clf.fit(X_train_encoded, y_train)
y_pred_rf = rf_clf.predict(X_test_encoded)

# Evaluation Metrics for Random Forest
rf_clf_metrics = {
    'Accuracy': accuracy_score(y_test, y_pred_rf),
    'Precision': precision_score(y_test, y_pred_rf),
    'Recall': recall_score(y_test, y_pred_rf),
    'F1 Score': f1_score(y_test, y_pred_rf)
}
evaluation_metrics['Random Forest'] = rf_clf_metrics

In [132]:
# Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(random_state=42)
gb_clf.fit(X_train_encoded, y_train)
y_pred_gb = gb_clf.predict(X_test_encoded)

# Evaluation Metrics for Gradient Boosting
gb_clf_metrics = {
    'Accuracy': accuracy_score(y_test, y_pred_gb),
    'Precision': precision_score(y_test, y_pred_gb),
    'Recall': recall_score(y_test, y_pred_gb),
    'F1 Score': f1_score(y_test, y_pred_gb)
}
evaluation_metrics['Gradient Boosting'] = gb_clf_metrics

In [134]:
# XGBoost Model
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Calculate the class imbalance ratio
# This ratio is the number of negative examples (class 0) divided by the number of positive examples (class 1)
imbalance_ratio = y_train.value_counts()[0] / y_train.value_counts()[1]

# Initialize the XGBoost model with scale_pos_weight to address class imbalance
gb_clf = XGBClassifier(random_state=42, scale_pos_weight=imbalance_ratio)

# Fit the model on the training data
gb_clf.fit(X_train_encoded, y_train)

# Make predictions on the test data
y_pred_gb = gb_clf.predict(X_test_encoded)

# Calculate the evaluation metrics for XGBoost
gb_clf_metrics = {
    'Accuracy': accuracy_score(y_test, y_pred_gb),
    'Precision': precision_score(y_test, y_pred_gb, zero_division=0),
    'Recall': recall_score(y_test, y_pred_gb, zero_division=0),
    'F1 Score': f1_score(y_test, y_pred_gb, zero_division=0)
}

# Adding XGBoost metrics to the evaluation dictionary
evaluation_metrics['XGBoost'] = gb_clf_metrics


In [136]:
# Display the evaluation metrics for all models
for model_name, metrics in evaluation_metrics.items():
    print(f"\n{model_name} Model Performance:")
    for metric_name, metric_value in metrics.items():
        print(f"{metric_name}: {metric_value:.4f}")


Logistic Regression Model Performance:
Accuracy: 0.6947
Precision: 0.1469
Recall: 0.8387
F1 Score: 0.2500

Random Forest Model Performance:
Accuracy: 0.9393
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Gradient Boosting Model Performance:
Accuracy: 0.9384
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

XGBoost Model Performance:
Accuracy: 0.9168
Precision: 0.2195
Recall: 0.1452
F1 Score: 0.1748


In [138]:
# Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier

# Define the parameter grid for XGBoost
param_grid = {
    'n_estimators': [100, 200, 300],        # Number of trees
    'learning_rate': [0.01, 0.1, 0.2],      # Step size shrinkage
    'max_depth': [3, 5, 7],                 # Maximum depth of trees
    'subsample': [0.8, 1.0],                # Fraction of samples used for training each tree
    'colsample_bytree': [0.8, 1.0],         # Fraction of features used for each tree
    'gamma': [0, 1, 5],                     # Minimum loss reduction for a split to happen
    'scale_pos_weight': [imbalance_ratio]    # Handling class imbalance
}

# Initialize XGBoost model
xgb_clf = XGBClassifier(random_state=42)

# Set up GridSearchCV for XGBoost
grid_search = GridSearchCV(estimator=xgb_clf, param_grid=param_grid, cv=3, scoring='f1', n_jobs=-1, verbose=1)

# Fit GridSearchCV on the training data
grid_search.fit(X_train_encoded, y_train)

# Tuned XGBoost Model
best_xgb = grid_search.best_estimator_
y_pred_best_xgb = best_xgb.predict(X_test_encoded)

# Evaluation Metrics for Tuned XGBoost
tuned_xgb_metrics = {
    'Accuracy': accuracy_score(y_test, y_pred_best_xgb),
    'Precision': precision_score(y_test, y_pred_best_xgb),
    'Recall': recall_score(y_test, y_pred_best_xgb),
    'F1 Score': f1_score(y_test, y_pred_best_xgb)
}

# Adding the tuned XGBoost metrics to the evaluation dictionary
evaluation_metrics['Tuned XGBoost'] = tuned_xgb_metrics

Fitting 3 folds for each of 324 candidates, totalling 972 fits


In [140]:
# Display the evaluation metrics for all models
for model_name, metrics in evaluation_metrics.items():
    print(f"\n{model_name} Model Performance:")
    for metric_name, metric_value in metrics.items():
        print(f"{metric_name}: {metric_value:.4f}")


Logistic Regression Model Performance:
Accuracy: 0.6947
Precision: 0.1469
Recall: 0.8387
F1 Score: 0.2500

Random Forest Model Performance:
Accuracy: 0.9393
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

Gradient Boosting Model Performance:
Accuracy: 0.9384
Precision: 0.0000
Recall: 0.0000
F1 Score: 0.0000

XGBoost Model Performance:
Accuracy: 0.9168
Precision: 0.2195
Recall: 0.1452
F1 Score: 0.1748

Tuned XGBoost Model Performance:
Accuracy: 0.8004
Precision: 0.1858
Recall: 0.6774
F1 Score: 0.2917


In [142]:
# Create a comparison table for all models
models = list(evaluation_metrics.keys())
accuracy = [metrics['Accuracy'] for metrics in evaluation_metrics.values()]
precision = [metrics['Precision'] for metrics in evaluation_metrics.values()]
recall = [metrics['Recall'] for metrics in evaluation_metrics.values()]
f1_scores = [metrics['F1 Score'] for metrics in evaluation_metrics.values()]

# Create a DataFrame for comparison
comparison_df = pd.DataFrame({
    'Model': models,
    'Accuracy': accuracy,
    'Precision': precision,
    'Recall': recall,
    'F1 Score': f1_scores
})

# Display the comparison DataFrame
print(comparison_df)

                 Model  Accuracy  Precision    Recall  F1 Score
0  Logistic Regression  0.694716   0.146893  0.838710  0.250000
1        Random Forest  0.939335   0.000000  0.000000  0.000000
2    Gradient Boosting  0.938356   0.000000  0.000000  0.000000
3              XGBoost  0.916830   0.219512  0.145161  0.174757
4        Tuned XGBoost  0.800391   0.185841  0.677419  0.291667


**Evaluating the Metrics for Stroke Prediction:**

Accuracy: 
    This is the percentage of correct predictions. While higher accuracy is generally good, accuracy can be misleading in imbalanced datasets, such as in 
    stroke prediction (where fewer positive cases of strokes exist compared to negative cases).
    Random Forest and Gradient Boosting have very high accuracy (~93-94%), but their precision, recall, and F1 score are extremely low, meaning theyâ€™re 
    not identifying stroke cases well at all.

Precision: 
    This metric indicates how many predicted positive cases (stroke) are actually true positives. In healthcare, precision is important to avoid too many 
    false positives, which could lead to unnecessary interventions.
    XGBoost and Tuned XGBoost show better precision compared to other models. Tuned XGBoost has a slightly lower precision than XGBoost but still performs 
    better than Logistic Regression.

Recall: 
    This is critical in healthcare. Recall (sensitivity) measures the ability to correctly identify all actual stroke cases. Missing a stroke case (false 
    negative) could have severe consequences, so high recall is essential.
    Tuned XGBoost stands out here with 0.677 recall, significantly higher than all other models. Logistic Regression also has good recall, but other 
    metrics (like precision and F1 score) are weaker.

F1 Score: 
    The F1 score is a balance between precision and recall. It is important for imbalanced datasets because it balances identifying true positives 
    (recall) with avoiding false positives (precision). In this case:
    Tuned XGBoost has the highest F1 score at 0.2917, indicating the best trade-off between precision and recall, which is important for stroke prediction.

**Conclusion and Best Model Choice:**

Tuned XGBoost is the best choice for this stroke prediction task based on overall performance, especially its recall (0.677) and F1 score (0.2917). 

High recall ensures that the model identifies the majority of stroke cases, which is vital in healthcare to avoid missing high-risk patients.

Logistic Regression also has high recall but performs poorly in precision, which means it's likely to generate many false positives (patients incorrectly predicted to have a stroke).
    
XGBoost (untuned) performs worse than the tuned version, especially in recall.

**Why Tuned XGBoost is the Best Fit:**

Recall is the highest, which is critical in healthcare contexts (missing fewer stroke cases).

F1 Score balances precision and recall, making it a reliable choice for dealing with imbalanced data.

Scalability: XGBoost is designed to handle large datasets and is efficient in terms of computational performance.

Interpretability: While not as interpretable as Logistic Regression, XGBoost offers feature importance insights, which can help explain what factors contribute most to stroke prediction.

Tuning improved the performance, so the model is optimized for this dataset.

Thus, moving forward with Tuned XGBoost is the most reasonable choice.

In [148]:
# Saving the Tuned XGBoost Model:
import joblib

# Save the tuned XGBoost model to a file
joblib.dump(best_xgb, 'tuned_xgboost_model.pkl')

print("Tuned XGBoost model saved successfully!")

Tuned XGBoost model saved successfully!


**Follow the instructions below to load the saved model in the future**

#Load the saved XGBoost model from the file

loaded_xgb_model = joblib.load('tuned_xgboost_model.pkl')

print("Tuned XGBoost model loaded successfully!")
