# Abstract

Heart disease is a leading cause of death worldwide, and early detection and prevention can save lives. In this project, we analysed data from a 2020 annual CDC survey data of 400k adults related to their health status, and applied different machine learning algorithms aiming to predict chance of one individual having heart disease. The project first performed data cleaning and feature selection to prepare the data for modeling. Then, several machine learning models such as LDA, Random Forest, SVM, Decision Tree, XGBoost, and Neural Network were applied to find the model with the highest sensitivity for the final prediction. The results showed that the XGBoost model performed the best in terms of sensitivity, and thus was selected as the final model for the prediction of heart disease. We created a user interface and linked with our XGBoost model, which allow user to input data about their personal health status and returns whether the patient require further diagnoise or not. This project highlights the importance of data preparation and model selection in achieving accurate results in the field of medical diagnosis.

# Introduction

## Background

Heart disease continues to be a major global health challenge, accounting for significant morbidity and mortality rates. In the United States, nearly half of the population is affected by at least one of the three primary risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Additional indicators, such as diabetic status, obesity (high BMI), inadequate physical activity, and excessive alcohol consumption, further contribute to the complexity of heart disease management. In recent years, advancements in computational power and the proliferation of large datasets have facilitated the application of machine learning techniques in healthcare. These techniques can process vast amounts of data to detect patterns and predict health outcomes, thus offering valuable insights for early identification and timely intervention of heart disease. With the potential to revolutionize clinical decision-making and patient care, machine learning applications in heart disease prediction are increasingly gaining traction among researchers and healthcare professionals. 

## Motivation

The motivation behind this study lies in the pressing need to improve early detection and prevention of heart disease, which can significantly reduce the associated health burden and improve patient outcomes. By harnessing the power of machine learning and utilizing the rich information available in large-scale health datasets, we aim to develop an effective predictive model that can identify individuals at high risk for heart disease. Such a model would enable healthcare providers to implement targeted interventions, promote healthier lifestyle choices, and facilitate more accurate decision-making in clinical settings. Furthermore, the comparative analysis of various classification algorithms in this study will provide valuable insights into the effectiveness of different machine learning techniques in heart disease prediction. Ultimately, the development of an accurate and robust predictive model has the potential to transform the landscape of heart disease management, enhancing the overall quality of healthcare and potentially saving countless lives.

# Data

Our data gathered 401,958 participant data through phone call surveys about their health conditions that may potentially impact heart condition. We interpreted the data by finding unique values of each variables. Variables with only 2 unique values are binary values and we encoded them to 1 and 0s. For categorical variables like age and race, we used one-hot coding method and turned categories into dummy binary variables. For example, original column "race" is encoded into breakdown categories like "Race_Asian", "Race_Black", etc. We randomly chose selected 20% of the dataset to prevent from any kind of modification, as to retain original data for higher credibility. We then introduced Synthetic Minority Over-sampling Technique (SMOTE) to address the problem we have for imbalanced dataset. It selects minority class samples and computes their k-nearest neighbors. Then, by interpolates between a chosen minority class sample and its nearest neighbors, new synthetic samples are added to original dataset. To deal with problem of overfitting, we chosen the specific borderline SMOTE technique, which only select samples that lies on the borderline between minority and majority classes. Thus reduces effects caused by noisy data and resolves overfitting issue. After balancing the dataset, we scaled the dataset and fit the test data to the remaining data.

In [1]:
import numpy as np
import pandas as pd
data = pd.read_csv("heart_2020_cleaned.csv")
data_binary = data[data.columns].replace({'Yes':1, 'No':0, 'Male':1,'Female':0,'No, borderline diabetes':'0','Yes (during pregnancy)':'1' })
data_binary['Diabetic'] = data_binary['Diabetic'].astype(int)
categorical_columns = [name for name in data_binary.columns 
                       if data_binary[name].dtype=='O']
data_dummy = pd.get_dummies(data=data_binary, columns=categorical_columns, drop_first=False)
data_dummy.head(5)

Unnamed: 0,HeartDisease,BMI,Smoking,AlcoholDrinking,Stroke,PhysicalHealth,MentalHealth,DiffWalking,Sex,Diabetic,...,Race_Asian,Race_Black,Race_Hispanic,Race_Other,Race_White,GenHealth_Excellent,GenHealth_Fair,GenHealth_Good,GenHealth_Poor,GenHealth_Very good
0,0,16.6,1,0,0,3.0,30.0,0,0,1,...,0,0,0,0,1,0,0,0,0,1
1,0,20.34,0,0,1,0.0,0.0,0,0,0,...,0,0,0,0,1,0,0,0,0,1
2,0,26.58,1,0,0,20.0,30.0,0,1,1,...,0,0,0,0,1,0,1,0,0,0
3,0,24.21,0,0,0,0.0,0.0,0,0,0,...,0,0,0,0,1,0,0,1,0,0
4,0,23.71,0,0,0,28.0,0.0,1,0,0,...,0,0,0,0,1,0,0,0,0,1


In [2]:
from sklearn.model_selection import train_test_split

# assume df is the DataFrame and y is the target variable
# split the data into training and test sets with 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(data_dummy.drop(columns=[data_dummy.columns[0]]), 
                                                    data_dummy.iloc[:,0], 
                                                    test_size=0.2, 
                                                    random_state=42)

train_data = pd.concat([pd.DataFrame(X_train), pd.DataFrame(y_train)], axis=1)
test_data = pd.concat([pd.DataFrame(X_test), pd.DataFrame(y_test)], axis=1)

from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from sklearn.datasets import make_classification

X = X_train
y = y_train

print('Before SMOTE:', X.shape, y.shape)


# # create SMOTE object with desired sampling strategy
# sm = SMOTE(sampling_strategy='auto', random_state=42)

# # apply SMOTE to generate new samples
# X_resampled, y_resampled = sm.fit_resample(X, y)

smote = BorderlineSMOTE(random_state=1)
X_resampled, y_resampled = smote.fit_resample(X, y)

# after applying SMOTE
train_resampled = pd.concat([pd.DataFrame(X_resampled), pd.DataFrame(y_resampled)], axis=1)

pd.Series.value_counts(y_resampled)

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scale = scaler.fit_transform(X_resampled)
X_train_scale = pd.DataFrame(X_train_scale, columns=X_resampled.columns)
X_test_scale = scaler.transform(X_test)
X_test_scale = pd.DataFrame(X_test_scale, columns=X_resampled.columns)
train_data_scale = pd.concat([pd.DataFrame(y_resampled), pd.DataFrame(X_train_scale)], axis=1)
test_data_scale = pd.concat([pd.DataFrame(y_test), pd.DataFrame(X_test_scale)])

ImportError: cannot import name '_MissingValues' from 'sklearn.utils._param_validation' (/Users/yutongwu/anaconda3/lib/python3.11/site-packages/sklearn/utils/_param_validation.py)

# Feature Selection

## Correlation Heatmap

We present the application of a correlation heatmap as a valuable tool for feature selection. The primary goal of employing a correlation heatmap is to visualize the pairwise relationships between the features in our dataset. This visualization allows us to identify highly correlated variables, which can often lead to multicollinearity issues in our model, thus negatively affecting its performance.we can easily spot highly correlated features, facilitating the decision-making process for feature selection. Consequently, we can remove or combine redundant features, improving the efficiency of our model and reducing the risk of overfitting. In our dataset, none of the variables are highly correlated. All the correlation values, including the correlation with our fecture variables, are below 0.3, indicating a weak relation. Thus we kept all the features, and hypothesized that none of the variables would have significant impact on the outcome.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(data_binary.corr(), cmap='coolwarm', center=0, annot=True)
ax.set_title('Correlation Heatmap', fontsize=20)
ax.tick_params(labelsize=12)
plt.xticks(rotation=45)
plt.show()

## Principal Component Analysis (PCA)

In [None]:
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_train_scale)

explained_variance_ratio = pca.explained_variance_ratio_

PC1 = pca.fit_transform(X_train_scale)[:,0]
PC2 = pca.fit_transform(X_train_scale)[:,1]
ldngs = pca.components_

scalePC1 = 1.0/(PC1.max() - PC1.min())
scalePC2 = 1.0/(PC2.max() - PC2.min())
features = X_train.columns

fig, ax = plt.subplots(figsize=(14, 9))

for i, feature in enumerate(features):
    ax.arrow(0, 0, ldngs[0, i], 
             ldngs[1, i], 
             head_width=0.03, 
             head_length=0.03)
    ax.text(ldngs[0, i] * 1.15, 
            ldngs[1, i] * 1.15, 
            feature, fontsize = 18)

scatter = ax.scatter(PC1 * scalePC1, 
                     PC2 * scalePC2, 
                     c=y_resampled, 
                     cmap='viridis')

ax.set_xlabel('PC1', fontsize=20)
ax.set_ylabel('PC2', fontsize=20)
ax.set_title('Figure 1', fontsize=20)

legend1 = ax.legend(*scatter.legend_elements(),
                    loc="lower left", 
                    title="Groups")
ax.add_artist(legend1)

We then utlized PCA to analyse the given data. The results from 1 and 0 both gave a wide spread, the variables have large amount of variance. It shows that "Race_White" and "Physical Activity" has longest vectors, indicating a strong influence on the principal components. And we tend to pay more attention on these two variables.

## L1 regularization

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
from statistics import mean

lasso_reg = Lasso(alpha=0.1)

# Fit the model to the training data
lasso_reg.fit(X_train, y_train)

# Make predictions on the test data
y_pred = lasso_reg.predict(X_test)

# Compute the mean squared error of the predictions
mse = mean_squared_error(y_test, y_pred)

alphas = np.logspace(-4, 1, 50)

# Initialize a list to store the coefficients for each alpha value
coefs = []

# Fit the model for each alpha value and store the coefficients
for a in alphas:
    lasso_reg = Lasso(alpha=a)
    lasso_reg.fit(X_train, y_train)
    coefs.append(lasso_reg.coef_)
    
# Get the names or labels of the features
feature_names = X_train.columns

# Create a DataFrame to store the coefficients
df = pd.DataFrame(coefs, columns=feature_names)

# Transpose the DataFrame to have the alpha values as the index
df = df.transpose()

# Plot the coefficients as a function of alpha
plt.figure(figsize=(10, 6))
ax = plt.gca()

# Plot the coefficients for each feature
for i, c in enumerate(coefs[0]):
    ax.plot(alphas, [coef[i] for coef in coefs], label=feature_names[i])

ax.set_xscale('log')
ax.set_xlabel('alpha')
ax.set_ylabel('coefficients')
ax.set_title('L1 Regularization Path')

# Add a legend to the plot
plt.legend(loc='best')

plt.axis('tight')
plt.show()