# Title

Justin Lee

This notebook is prepared for the Ministry of Health and Family Welfare (MoHFW). The MoHFW aims to understand how they can use ML models to identify high-risk characteristics to heart attacks and allocate medical resources accordingly. This can be used for public awareness campaigns and public health initaitives targeting key risk factors.

### Business Understanding

The dataset is based on Indian cardiovascular health statistics, medical research reports, and national surveys, incorporating data from:

Indian Council of Medical Research (ICMR) – Reports on heart disease prevalence in India Ministry of Health & Family Welfare, Government of India – National health statistics World Health Organization (WHO) – India Reports – Cardiovascular disease risk factors National Family Health Survey (NFHS-5) – Demographic and health-related indicators Global Burden of Disease (GBD) Study – India-specific cardiovascular mortality rates Indian Heart Journal & AIIMS Research – Clinical insights on CVD trends in India.

This dataset consists of 10,000 records with each corresponding to a patient ID and their associated health characteristics, conditions, lifestyle choices and other related metrics culminating in assessing whether or not the patient is at risk for a heart attack.

### Data Understanding

In [1]:
# Import relevant libraries
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [2]:
# Load in dataframe
df = pd.read_csv('heart_attack_prediction_india.csv')

df.head()

Unnamed: 0,Patient_ID,State_Name,Age,Gender,Diabetes,Hypertension,Obesity,Smoking,Alcohol_Consumption,Physical_Activity,...,Diastolic_BP,Air_Pollution_Exposure,Family_History,Stress_Level,Healthcare_Access,Heart_Attack_History,Emergency_Response_Time,Annual_Income,Health_Insurance,Heart_Attack_Risk
0,1,Rajasthan,42,Female,0,0,1,1,0,0,...,119,1,0,4,0,0,157,611025,0,0
1,2,Himachal Pradesh,26,Male,0,0,0,0,1,1,...,115,0,0,7,0,0,331,174527,0,0
2,3,Assam,78,Male,0,0,1,0,0,1,...,117,0,1,10,1,0,186,1760112,1,0
3,4,Odisha,58,Male,1,0,1,0,0,1,...,65,0,0,1,1,1,324,1398213,0,0
4,5,Karnataka,22,Male,0,0,0,0,0,1,...,109,0,0,9,0,0,209,97987,0,1


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 26 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Patient_ID               10000 non-null  int64 
 1   State_Name               10000 non-null  object
 2   Age                      10000 non-null  int64 
 3   Gender                   10000 non-null  object
 4   Diabetes                 10000 non-null  int64 
 5   Hypertension             10000 non-null  int64 
 6   Obesity                  10000 non-null  int64 
 7   Smoking                  10000 non-null  int64 
 8   Alcohol_Consumption      10000 non-null  int64 
 9   Physical_Activity        10000 non-null  int64 
 10  Diet_Score               10000 non-null  int64 
 11  Cholesterol_Level        10000 non-null  int64 
 12  Triglyceride_Level       10000 non-null  int64 
 13  LDL_Level                10000 non-null  int64 
 14  HDL_Level                10000 non-null

In [4]:
df.describe()

Unnamed: 0,Patient_ID,Age,Diabetes,Hypertension,Obesity,Smoking,Alcohol_Consumption,Physical_Activity,Diet_Score,Cholesterol_Level,...,Diastolic_BP,Air_Pollution_Exposure,Family_History,Stress_Level,Healthcare_Access,Heart_Attack_History,Emergency_Response_Time,Annual_Income,Health_Insurance,Heart_Attack_Risk
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,...,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,49.3949,0.0929,0.2469,0.3037,0.3014,0.3528,0.5958,5.0217,224.753,...,89.312,0.4036,0.3113,5.5188,0.311,0.1525,206.3834,1022062.0,0.3447,0.3007
std,2886.89568,17.280301,0.290307,0.43123,0.459878,0.458889,0.477865,0.490761,3.156394,43.359172,...,17.396486,0.490644,0.463048,2.866264,0.462926,0.359523,112.391711,560597.8,0.475294,0.458585
min,1.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,150.0,...,60.0,0.0,0.0,1.0,0.0,0.0,10.0,50353.0,0.0,0.0
25%,2500.75,35.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,187.0,...,74.0,0.0,0.0,3.0,0.0,0.0,110.0,535783.8,0.0,0.0
50%,5000.5,49.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,226.0,...,89.0,0.0,0.0,6.0,0.0,0.0,206.0,1021383.0,0.0,0.0
75%,7500.25,64.0,0.0,0.0,1.0,1.0,1.0,1.0,8.0,262.0,...,104.0,1.0,1.0,8.0,1.0,0.0,304.0,1501670.0,1.0,1.0
max,10000.0,79.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0,299.0,...,119.0,1.0,1.0,10.0,1.0,1.0,399.0,1999714.0,1.0,1.0


In [5]:
# Check how many null values there are
df.isnull().sum()

Patient_ID                 0
State_Name                 0
Age                        0
Gender                     0
Diabetes                   0
Hypertension               0
Obesity                    0
Smoking                    0
Alcohol_Consumption        0
Physical_Activity          0
Diet_Score                 0
Cholesterol_Level          0
Triglyceride_Level         0
LDL_Level                  0
HDL_Level                  0
Systolic_BP                0
Diastolic_BP               0
Air_Pollution_Exposure     0
Family_History             0
Stress_Level               0
Healthcare_Access          0
Heart_Attack_History       0
Emergency_Response_Time    0
Annual_Income              0
Health_Insurance           0
Heart_Attack_Risk          0
dtype: int64

In [6]:
# Check class imbalance in our target column
df['Heart_Attack_Risk'].value_counts()

0    6993
1    3007
Name: Heart_Attack_Risk, dtype: int64

Our target variable is imbalanced (with roughly a 70/30 split). We will prepare our data to handle this class imbalance and then proceed with random under sampling on a logistic regression model.

### Data Preparation

In [7]:
# Define features (X) and the target variable (y)
# Dropping non-informative columns
X = df.drop(columns=["Heart_Attack_Risk", "Patient_ID", "State_Name"])
y = df["Heart_Attack_Risk"]

In [8]:
# 'Gender' is our only categorical variable left. We'll encode 'Gender' as binary (0 = Female, 1 = Male)
label_encoder = LabelEncoder()
X["Gender"] = label_encoder.fit_transform(X["Gender"])

# Verify encoding
X["Gender"].value_counts()

1    5516
0    4484
Name: Gender, dtype: int64

In [9]:
# Apply Random Undersampling to balance the dataset
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

# Check new class distribution
pd.Series(y_resampled).value_counts()

1    3007
0    3007
Name: Heart_Attack_Risk, dtype: int64

### Logistic Regression Model

In [10]:
# Split into 70% training and 30% testing
X_train, X_test, y_train, y_test = train_test_split(
    X_resampled, y_resampled, test_size=0.3, random_state=42, stratify=y_resampled)

# Check the shape of training and testing sets
X_train.shape, X_test.shape

((4209, 23), (1805, 23))

In [11]:
# Identify numeric columns that need scaling
numeric_columns = [
    "Age", "Diet_Score", "Cholesterol_Level", "Triglyceride_Level", "LDL_Level", "HDL_Level", 
    "Systolic_BP", "Diastolic_BP", "Stress_Level", "Emergency_Response_Time", "Annual_Income"
]

# Initialize the StandardScaler
scaler = StandardScaler()

# Copy the original data to keep the binary columns unchanged
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()

# Apply scaling only to the specified numeric columns
X_train_scaled[numeric_columns] = scaler.fit_transform(X_train[numeric_columns])
X_test_scaled[numeric_columns] = scaler.transform(X_test[numeric_columns])

# View summary of preprocessed data
X_train_scaled.head()

Unnamed: 0,Age,Gender,Diabetes,Hypertension,Obesity,Smoking,Alcohol_Consumption,Physical_Activity,Diet_Score,Cholesterol_Level,...,Systolic_BP,Diastolic_BP,Air_Pollution_Exposure,Family_History,Stress_Level,Healthcare_Access,Heart_Attack_History,Emergency_Response_Time,Annual_Income,Health_Insurance
1744,-0.839303,0,0,0,0,0,1,0,-1.589428,1.415297,...,1.62947,0.547908,0,1,-0.175445,1,1,-1.573181,-0.870631,0
866,0.904802,0,0,1,0,0,0,1,0.943187,-0.331269,...,-0.872261,0.26028,0,0,-0.522951,1,0,-1.262888,0.949112,1
1143,-0.548619,1,0,0,0,0,0,1,0.62661,0.702882,...,1.013659,-0.545079,1,0,0.519566,0,0,0.315173,-0.783374,0
2649,-0.257935,1,0,1,1,0,1,0,1.259764,0.197297,...,-1.218655,0.030177,0,1,-0.522951,0,0,-1.715029,1.480959,1
1040,1.13735,1,0,0,0,0,1,0,0.943187,0.013448,...,-0.217962,0.030177,1,1,1.214578,0,1,-0.402933,-0.556504,0


In [12]:
# Initialize and train Logistic Regression model
log_reg = LogisticRegression(random_state=42)
log_reg.fit(X_train_scaled, y_train)

# Make Predictions
y_pred = log_reg.predict(X_test_scaled)

### Logistic Regression Evaluation

In [13]:
# Print evaluation results
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.4919667590027701
Confusion Matrix:
 [[419 484]
 [433 469]]
Classification Report:
               precision    recall  f1-score   support

           0       0.49      0.46      0.48       903
           1       0.49      0.52      0.51       902

    accuracy                           0.49      1805
   macro avg       0.49      0.49      0.49      1805
weighted avg       0.49      0.49      0.49      1805



Accuracy is around 49%, which is very close to random chance (50%) in a binary classification problem. This suggests that the model is not effectively distinguishing between high-risk and low-risk individuals.

We have 469 correctly predicted high-risk cases, 419 correctly predicted low-risk cases, 484 wrongly predicted high-risk cases that were actually low-risk, and 433 wrongly predicted low-risk cases that were actually high-risk.

We had a precision of 0.49, meaning a high false positive rate (misclassifying low-risk as high-risk). This is bad if governments use this model for public health initiatives as they may over-allocate resources to false high-risk cases. We had a recall of 0.52, meaning this model only captures 52% of actual heart attack risk cases and missing the other 48%. This model obtained an F1-score of ~0.49 which is below for both classes, which leans on the poorer side of balance between precision and recall.

### Decision Tree Model

### Decision Tree Evaluation

### Conclusion

### Justin's Draft Notes
- started with logistic regression and random undersampled (converted 10000 rows to 4209) only scaled the non-binary features 
- pursue decision tree classifier
- am i on track in what i have so far?
- how to iteratively push to github from terminal
- additional tips on completing remainder of project - probing questions to think of?

Mark Notes
- Try doing one without RUS, and then one with (this is a baseline)
- Our accuracy would be overly confident wihtout RUS
- Use F1 score and accuracy as north star, then precision and recall
- If there's no diff in F1 score, then use the model with RUS and then focus on accuracy. If not then stick with model with best F1 score
- Then can do decision tree with best model
- compare feature importance and coefficients between better baseline model and decision tree evaluation
- then it will give me best recs. if there are similarities between the two then it will tell me what features are the best to use

- correct on only scaling non binary features