<h2>Predicting Heart Attack Risk</h2>
<h3>Notebook 6: Final Model Random Search</h3>
<p><b>Author: Nikhar Bhavsar</b></p>

<hr>

- Selected **Random Forest** as the final model.
- Trained on the full data using optimal parameters.
- Saved model and feature importance:
  - `random_forest_heart_attack_model.pkl`
  - `feature_importance.csv`

### Table of Contents
1. [Importing Libraries](#importing-libraries)
3. [Loading Data](#loading-data)
4. [Train Model](#train-model)
5. [Conclusion](#conclusion)

### Importing Libraries

In [30]:
import sys
import os
sys.path.append(os.path.abspath('../utilities'))
import global_utils
import pre_processing_utils
import model_training_utils

In [31]:
import numpy as np  
import pandas as pd

# Data Visualization
import matplotlib.pyplot as plt  
import seaborn as sns  

# Data Preprocessing
from sklearn.model_selection import train_test_split  
from sklearn.preprocessing import StandardScaler, MinMaxScaler 
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_classif 
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, precision_recall_curve, auc, classification_report
)
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN

from sklearn.ensemble import RandomForestClassifier
from sklearn.base import BaseEstimator, ClassifierMixin

import joblib as jb
preprocessor = jb.load('../models/preprocessor.pkl')

### Loading Data

In [33]:
patient_health_train = global_utils.import_csv('./../data/test_train/heart_attack_train.csv')
patient_health_test = global_utils.import_csv('./../data/test_train/heart_attack_test.csv')
heart_attack_status_train = global_utils.import_csv('./../data/test_train/heart_attack_train_target.csv',)
heart_attack_status_test = global_utils.import_csv('./../data/test_train/heart_attack_test_target.csv')

global_utils.define_df_settings()

Let's look at the columns which are present into our dataset.

In [35]:
patient_health_train.head(10)

Unnamed: 0,State,Sex,GeneralHealth,PhysicalActivities,SleepHours,HadAngina,HadStroke,HadCOPD,HadKidneyDisease,HadArthritis,HadDiabetes,SmokerStatus,ChestScan,RaceEthnicityCategory,AgeCategory,HeightInMeters,WeightInKilograms,BMI,AlcoholDrinkers
0,Ohio,Male,Very good,No,7.0,No,No,No,No,Yes,No,Former smoker,Yes,White,45-49,1.88,95.25,26.96,Yes
1,Wisconsin,Female,Fair,Yes,8.0,No,No,No,Yes,No,Yes,Never smoked,Yes,White,55-59,1.68,79.38,28.25,Yes
2,South Dakota,Male,Good,Yes,7.0,No,No,No,No,Yes,No,Former smoker,Yes,White,65-69,1.8,95.25,29.29,Yes
3,Idaho,Male,Very good,Yes,7.0,No,No,No,No,No,No,Never smoked,No,White,18-24,1.73,63.5,21.29,No
4,Indiana,Male,Poor,No,7.0,No,No,No,No,Yes,Yes,Former smoker,No,Hispanic,60-64,1.5,53.07,23.63,Yes
5,Virginia,Female,Very good,Yes,7.0,No,No,No,No,No,No,Never smoked,Yes,White,60-64,1.68,56.7,20.18,Yes
6,Michigan,Male,Very good,Yes,7.0,No,No,No,No,Yes,No,Former smoker,Yes,White,45-49,1.85,108.86,31.66,Yes
7,Florida,Female,Good,Yes,7.0,No,No,No,No,No,Yes,Never smoked,Yes,White,75-79,1.57,41.28,16.64,No
8,Maryland,Male,Excellent,Yes,7.0,No,No,No,No,Yes,No,Never smoked,No,White,65-69,1.78,83.91,26.54,Yes
9,Colorado,Female,Fair,Yes,6.0,No,Yes,No,No,Yes,No,Former smoker,No,Hispanic,55-59,1.65,66.22,24.3,No


In [36]:
heart_attack_status_train.shape

(196599, 1)

### Train Model

As we have seen in our last notebook the GridSearch gave us the RandomForest as a best model with the parameters. Let's train our RandomForest model with the best parameters returned by the GridSearch.

In [37]:
best_model = RandomForestClassifier(
    max_depth=5,
    min_samples_split=5,
    n_estimators=200,
    random_state=42
)

resampler = SMOTE(random_state=42)

# Final pipeline
pipeline = ImbPipeline(steps=[
    ('preprocess', preprocessor),
    ('resample', resampler),
    ('clf', best_model)
])

# Train the model
pipeline.fit(patient_health_train, heart_attack_status_train)

# Predict on training data itself (since we’re training on full data)
heart_attack_status_proba = pipeline.predict_proba(patient_health_train)[:, 1]
heart_attack_status_pred = pipeline.predict(patient_health_train)

# Metrics
accuracy = accuracy_score(heart_attack_status_train, heart_attack_status_pred)
precision = precision_score(heart_attack_status_train, heart_attack_status_pred)
recall = recall_score(heart_attack_status_train, heart_attack_status_pred)
f1 = f1_score(heart_attack_status_train, heart_attack_status_pred)
roc = roc_auc_score(heart_attack_status_train, heart_attack_status_proba)

precisions, recalls, _ = precision_recall_curve(heart_attack_status_train, heart_attack_status_proba)
pr_auc = auc(recalls, precisions)

print("\n========= Evaluation on Full Training Data =========")
print(f"Accuracy:  {accuracy:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall:    {recall:.3f}")
print(f"F1 Score:  {f1:.3f}")
print(f"ROC AUC:   {roc:.3f}")
print(f"PR AUC:    {pr_auc:.3f}")
print("\nClassification Report:\n")
print(classification_report(heart_attack_status_train, heart_attack_status_pred))

# Save the model
jb.dump(pipeline, "../models/random_forest_heart_attack_model.pkl")
print("\nModel saved as 'random_forest_heart_attack_model.pkl'")

  return fit_method(estimator, *args, **kwargs)



Accuracy:  0.859
Precision: 0.236
Recall:    0.705
F1 Score:  0.354
ROC AUC:   0.879
PR AUC:    0.387

Classification Report:

              precision    recall  f1-score   support

           0       0.98      0.87      0.92    185861
           1       0.24      0.70      0.35     10738

    accuracy                           0.86    196599
   macro avg       0.61      0.79      0.64    196599
weighted avg       0.94      0.86      0.89    196599


Model saved as 'random_forest_heart_attack_model.pkl'


In [38]:
importances = best_model.feature_importances_
feature_names = model_training_utils.get_feature_names_from_column_transformer(preprocessor)

# Create a DataFrame and sort by importance
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

# Save top 10 features to a list
top_features = importance_df.head(10)['Feature'].tolist()

# Save the top 10 features to a .pkl file
jb.dump(top_features, "../models/top_10_features.pkl")

# Optional: Also save full importances
importance_df.to_csv("../models/all_feature_importances.csv", index=False)

print("Top 10 features saved to top_10_features.pkl")
print("Full feature importances saved to all_feature_importances.csv")

Top 10 features saved to top_10_features.pkl
Full feature importances saved to all_feature_importances.csv


### Conclusion

Our final trained model is a Random Forest Classifier, trained on a highly imbalanced health dataset. The model was specifically optimized to achieve high recall for the positive class (heart attack), in order to minimize the risk of missing actual cases, which is critical in a healthcare setting.

On the full training data, the model achieved the following performance:

<b>Accuracy:</b> 85.9%

<b>Recall (Heart Attack):</b> 70.5%

<b>Precision (Heart Attack):</b> 23.6%

<b>F1 Score (Heart Attack):</b> 0.354

<b>ROC AUC:</b> 0.879

<b>PR AUC:</b> 0.387

This indicates that the model is effective at capturing most positive cases, although it comes with a tradeoff of relatively lower precision, meaning more false positives are present.

### Limitations

<b>Low Precision (23.6%):</b> While recall is high, precision is still low, meaning a significant number of false alarms. This could lead to over-testing or unnecessary follow-ups in real-world applications.

<b>Imbalanced Dataset:</b> The dataset is heavily skewed toward the negative class. Although resampling and threshold tuning were applied, this imbalance inherently limits precision.