In [1]:
"""@author Okorie Ndubuisi February 2025"""
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
sns.set()

#### Datatset
-You can download this dataset from Kaggle.  This dataset is obtained from Kaggle: [Heart Failure Prediction Dataset](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction)

#### Context
- Cardiovascular disease (CVDs) is the number one cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Four out of five CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs.
- People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management.  
- This dataset contains 11 features that can be used to predict possible heart disease.
- Let's train a machine learning model to assist with diagnosing this disease.

#### Attribute Information
- Age: age of the patient [years]
- Sex: sex of the patient [M: Male, F: Female]
- ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
- RestingBP: resting blood pressure [mm Hg]
- Cholesterol: serum cholesterol [mm/dl]
- FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
- RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]
- MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
- ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
- Oldpeak: oldpeak = ST [Numeric value measured in depression]
- ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]
- HeartDisease: output class [1: heart disease, 0: Normal]

In [2]:
heart = pd.read_csv('data/heart.csv')

In [3]:
heart.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
cat_variables = ['Sex',
'ChestPainType',
'RestingECG',
'ExerciseAngina',
'ST_Slope'
]

In [5]:
heart = pd.get_dummies(data = heart, prefix = cat_variables, columns = cat_variables, dtype=int)

In [6]:
heart.head()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease,Sex_F,Sex_M,ChestPainType_ASY,...,ChestPainType_NAP,ChestPainType_TA,RestingECG_LVH,RestingECG_Normal,RestingECG_ST,ExerciseAngina_N,ExerciseAngina_Y,ST_Slope_Down,ST_Slope_Flat,ST_Slope_Up
0,40,140,289,0,172,0.0,0,0,1,0,...,0,0,0,1,0,1,0,0,0,1
1,49,160,180,0,156,1.0,1,1,0,0,...,1,0,0,1,0,1,0,0,1,0
2,37,130,283,0,98,0.0,0,0,1,0,...,0,0,0,0,1,1,0,0,0,1
3,48,138,214,0,108,1.5,1,1,0,1,...,0,0,0,1,0,0,1,0,1,0
4,54,150,195,0,122,0.0,0,0,1,0,...,1,0,0,1,0,1,0,0,0,1


In [7]:
heart.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 21 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Age                918 non-null    int64  
 1   RestingBP          918 non-null    int64  
 2   Cholesterol        918 non-null    int64  
 3   FastingBS          918 non-null    int64  
 4   MaxHR              918 non-null    int64  
 5   Oldpeak            918 non-null    float64
 6   HeartDisease       918 non-null    int64  
 7   Sex_F              918 non-null    int32  
 8   Sex_M              918 non-null    int32  
 9   ChestPainType_ASY  918 non-null    int32  
 10  ChestPainType_ATA  918 non-null    int32  
 11  ChestPainType_NAP  918 non-null    int32  
 12  ChestPainType_TA   918 non-null    int32  
 13  RestingECG_LVH     918 non-null    int32  
 14  RestingECG_Normal  918 non-null    int32  
 15  RestingECG_ST      918 non-null    int32  
 16  ExerciseAngina_N   918 non

Let us split the data for train and test

In [8]:
target = np.array(heart['HeartDisease'])
Y_train_target = target[:800]
target.shape

(918,)

In [10]:
Y_train_target.shape

(800,)

In [11]:
features = heart.drop('HeartDisease', axis=1)
features = np.array(features)
X_train_features = features[:800,:]
features.shape

(918, 20)

In [12]:
X_train_features.shape

(800, 20)

In [15]:
test_features = features[800:, :]
test_target = target[800:]
print(test_features.shape)
print(test_target.shape)

(118, 20)
(118,)


Import machine learning libraries, we will start by standardising our data set

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler

In [16]:
scaler = StandardScaler()
scaled_X_train_features = scaler.fit_transform(X_train_features)
scaled_X_test = scaler.transform(test_features)

In [18]:
knn = KNeighborsClassifier(n_neighbors = 5)

In [19]:
knn.fit(scaled_X_train_features, Y_train_target)
knn_predictions = knn.predict(scaled_X_test)

Let us check the accuracy of our model and the number of misclassified points

In [23]:
accuracy = (len(np.where(knn_predictions == test_target)[0]) / len(test_target))*100
print('Accuracy:', accuracy)
print('Number of misclassified points:', len(np.where(knn_predictions != test_target)[0]))

Accuracy: 80.50847457627118
Number of misclassified points: 23
