Let's install UMAP, a dimension reduction package we want to use to build a scatter plot.

In [1]:
!pip install --quiet umap-learn
print('UMAP installed.')

UMAP installed.


Let's load up all the data and change the type of our target variable. Let's also do a tiny bit of feature engineering; we tried introducing dummies for several variables, and only introducing dummies for the thall variable improved the accuracy of our model.

In [2]:
import pandas as pd

HEART = '/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv'

df = pd.read_csv(filepath_or_buffer=HEART)
df['output'] = df['output'] == 1
df = pd.get_dummies(data=df, columns=['thall'])
columns = df.drop(columns=['output']).columns
df.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,output,thall_0,thall_1,thall_2,thall_3
0,63,1,3,145,233,1,0,150,0,2.3,0,0,True,False,True,False,False
1,37,1,2,130,250,0,1,187,0,3.5,0,0,True,False,False,True,False
2,41,0,1,130,204,0,0,172,0,1.4,2,0,True,False,False,True,False
3,56,1,1,120,236,0,1,178,0,0.8,2,0,True,False,False,True,False
4,57,0,0,120,354,0,1,163,1,0.6,2,0,True,False,False,True,False


Are the classes balanced in our target variable?

In [3]:
df['output'].value_counts(dropna=False).to_dict()

{True: 165, False: 138}

Let's use dimension reduction to make a scatter plot. If we see a lot of clustering then our model will probably do well; if our scatter plot looks kind of random our model will probably not do well.

In [4]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
df[['x', 'y']] = umap.fit_transform(X=df[columns])
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:10.922132


In [5]:
from plotly import express
express.scatter(data_frame=df, x='x', y='y', color='output')

What do we see? We see some local clustering but we also see a lot of mixing; there are a lot of points that have nearest neighbors that are in the other class. Let's build a model.

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(df[columns], df['output'], test_size=0.2, random_state=2024, stratify=df['output'])

logreg = LogisticRegression(max_iter=10000, tol=1e-12).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0)))
print(classification_report(y_true=y_test, y_pred=logreg.predict(X=X_test), zero_division=0))

model fit in 1057 iterations
accuracy: 0.8852
f1: 0.8851
              precision    recall  f1-score   support

       False       0.89      0.86      0.87        28
        True       0.88      0.91      0.90        33

    accuracy                           0.89        61
   macro avg       0.89      0.88      0.88        61
weighted avg       0.89      0.89      0.89        61



Before we go let's look at the regression coefficients.

In [7]:
express.histogram(x=columns, y=logreg.coef_[0])

What are the most important features? Sex, chest pain, exercise-induced angina, and thall, whatever that is. We have broken thall out into four variables, and only two of them have large regression coefficients.

Let's look at the prediction probabilities from our model.

In [8]:
probability_df = pd.DataFrame(data=logreg.predict_proba(X=X_test), columns=['pr(False)', 'pr(True)'])
probability_df['prediction'] = logreg.predict(X=X_test)
probability_df['actual'] = y_test
probability_df['actual'] = probability_df['actual'] == True
probability_df.head()

Unnamed: 0,pr(False),pr(True),prediction,actual
0,0.370054,0.629946,True,False
1,0.023078,0.976922,True,False
2,0.992583,0.007417,False,False
3,0.893072,0.106928,False,True
4,0.033502,0.966498,True,True


Obviously pr(False) + pr(True) = 1, so we expect our data to fall on a line.

In [9]:
express.scatter(data_frame=probability_df, x='pr(False)', y='pr(True)', color='prediction')

Of course our model probabilities are nicely separated. How do they look if we color them according to the actual values?

In [10]:
express.scatter(data_frame=probability_df, x='pr(False)', y='pr(True)', color='actual')