Our data is pretty clean, so let's load it up and convert the target variable to a Boolean.

In [1]:
import pandas as pd

DATA = '/kaggle/input/heart-diseae/heart-disease.csv'
df = pd.read_csv(filepath_or_buffer=DATA)
df['target'] = df['target'] == 1
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,True
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,True
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,True
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,True
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,True


Let's use UMAP to give our data x/y coordinates so we can make a scatter plot.

In [2]:
import arrow
from umap import UMAP

time_start = arrow.now()
umap = UMAP(random_state=2024, verbose=False, n_jobs=1, low_memory=False, n_epochs=201)
df[['x', 'y']] = umap.fit_transform(X=df.drop(columns=['target']))
print('done with UMAP in {}'.format(arrow.now() - time_start))

done with UMAP in 0:00:10.646632


In [3]:
from plotly import express

express.scatter(data_frame=df, x='x', y='y', color='target', )

If we look at this data using UMAP we see some local clustering but we don't see much spacing between the classes. Let's build a model.

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['target', 'x', 'y']), df['target'], test_size=0.2, random_state=2024, stratify=df['target'])

logreg = LogisticRegression(max_iter=100000, tol=1e-12).fit(X_train, y_train)
print('model fit in {} iterations'.format(logreg.n_iter_[0]))
print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=logreg.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(average='weighted', y_true=y_test, y_pred=logreg.predict(X=X_test))))
print(classification_report(y_true=y_test, y_pred=logreg.predict(X=X_test)))

model fit in 838 iterations
accuracy: 0.8689
f1: 0.8689
              precision    recall  f1-score   support

       False       0.86      0.86      0.86        28
        True       0.88      0.88      0.88        33

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61



What do our regression coefficients look like?

In [5]:
express.histogram(x=df.drop(columns=['target', 'x', 'y']).columns.tolist(), y=logreg.coef_[0],)

Turns out sex matters a lot.

We can improve our results slightly with support vectors.

In [6]:
from sklearn.svm import SVC

svc = SVC(kernel='linear', C=0.025, random_state=2024, probability=True).fit(X=X_train, y=y_train)
print('accuracy: {:5.4f}'.format(accuracy_score(y_true=y_test, y_pred=svc.predict(X=X_test))))
print('f1: {:5.4f}'.format(f1_score(y_true=y_test, y_pred=svc.predict(X=X_test))))
print(classification_report(y_true=y_test, y_pred=svc.predict(X=X_test)))

accuracy: 0.8689
f1: 0.8857
              precision    recall  f1-score   support

       False       0.92      0.79      0.85        28
        True       0.84      0.94      0.89        33

    accuracy                           0.87        61
   macro avg       0.88      0.86      0.87        61
weighted avg       0.87      0.87      0.87        61



Let's take a look at the model probabilities from our support vector classifier. We want to use the same scatter plot coordinates we used above, so we use the UMAP model we fit above, but we use it to transform the test data.

In [7]:
probability_df = pd.DataFrame(data=svc.predict_proba(X=X_test).max(axis=1), columns=['probability'])
probability_df['true'] = y_test.tolist()
probability_df['pred'] = svc.predict(X=X_test)
probability_df['correct'] = probability_df['true'] == probability_df['pred']
probability_df[['x', 'y']] = umap.transform(X=X_test)

probability_df.head()

Unnamed: 0,probability,true,pred,correct,x,y
0,0.537049,True,True,True,8.931486,2.469853
1,0.898172,True,True,True,9.944916,6.960524
2,0.968316,False,False,True,2.12659,1.054348
3,0.859208,False,False,True,8.78213,3.158736
4,0.703697,True,True,True,12.430548,3.757477


How do our model probabilities look conditioned on whether the model is correct? Recall that because we only have to output classes, True and False, our model probabilities must lie between 0.5 and 1.0.

In [8]:
probability_df[['probability', 'correct']].groupby(by=['correct']).mean().to_dict()

{'probability': {False: 0.6782530094803161, True: 0.7864846867997911}}

When our model is correct the probabilities are higher than when the model is incorrect, but they're surprisingly low.

In [9]:
express.scatter(data_frame=probability_df, x='x', y='y', color='probability', facet_col='correct', hover_name='true')

In [10]:
express.histogram(data_frame=probability_df, x='probability', facet_col='correct', hover_name='true', nbins=10, range_x=[0, 1], range_y=[0, 10])