Let's install UMAP, a dimension reduction package we want to use to build a scatter plot.

In [1]:
!pip install --quiet umap-learn
print('UMAP installed.')

UMAP installed.


Let's load up all the data and change the type of our target variable.

In [2]:
import pandas as pd

HEART = '/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv'

df = pd.read_csv(filepath_or_buffer=HEART)
df['output'] = df['output'] == 1

columns = df.drop(columns=['output']).columns
df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,True
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,True
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,True
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,True
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,True


Are our classes balanced in our target variable?

In [3]:
df['output'].value_counts().to_dict()

{True: 165, False: 138}

Let's use dimension reduction to make a scatter plot. If we see a lot of clustering then our model will probably do well; if our scatter plot looks kind of random our model will probably not do well.

In [4]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
df[['x', 'y']] = umap.fit_transform(X=df[columns])
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:10.444664


In [5]:
from plotly import express
express.scatter(data_frame=df, x='x', y='y', color='output')

What do we see? We see some local clustering but we also see a lot of mixing; there are a lot of points that have nearest neighbors that are in the other class. Let's build a model.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(df[columns], df['output'], test_size=0.2, random_state=2024, stratify=df['output'])

logreg = LogisticRegression(max_iter=1000, tol=1e-12).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0)))
print(classification_report(y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0))

model fit in 838 iterations
accuracy: 0.8689
f1: 0.8689
              precision    recall  f1-score   support

       False       0.86      0.86      0.86        28
        True       0.88      0.88      0.88        33

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61



Before we go let's look at the regression coefficients.

In [7]:
express.histogram(x=columns, y=logreg.coef_[0])

What are the most important features? Sex, chest pain, exercise-induced angina, and thall, whatever that is.