## 1. Introduction
Objective:
Amongst all the organs, The heart is a significant part of our body. The heart beats about 2.5 billion times over the average lifetime, pushing millions of gallons of blood to every part of the body.
In this era, the heart disease is increasing day by day due to the modern lifestyle and food. The diagnosis of heart disease is a challenging task. This classification model will predict whether the patient has heart disease or not based on various conditions/symptoms of their body.

## Dataset
Goal: presence/absence of heart disease based the following health-related features

- *age*: age in years 
- *sex*: (1 = male; 0 = female) 
- *cp*: chest pain type 
- *trestbps*: resting blood pressure (in mm Hg on admission to the hospital) 
- *chol*: serum cholestoral in mg/dl 
- *fbs*: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 
- *restecg*: resting electrocardiographic results 
- *thalach*: maximum heart rate achieved 
- *exang*: exercise induced angina (1 = yes; 0 = no) 
- *oldpeak*: ST depression induced by exercise relative to rest 
- *slope*: the slope of the peak exercise ST segment 
- *ca*: number of major vessels (0-3) colored by flourosopy 
- *thal*: 3 = normal; 6 = fixed defect; 7 = reversable defect 
- *target*: have disease or not (1=yes, 0=no)

##  Exploratory Data Analysis (EDA)
###  Import Libraries


In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression

### Import Datasets

In [17]:
data= pd.read_csv("heart.csv")
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


### Basic Analysis

In [20]:
data = pd.DataFrame(data)

In [21]:
print(data)

      age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  \
0      52    1   0       125   212    0        1      168      0      1.0   
1      53    1   0       140   203    1        0      155      1      3.1   
2      70    1   0       145   174    0        1      125      1      2.6   
3      61    1   0       148   203    0        1      161      0      0.0   
4      62    0   0       138   294    1        1      106      0      1.9   
...   ...  ...  ..       ...   ...  ...      ...      ...    ...      ...   
1020   59    1   1       140   221    0        1      164      1      0.0   
1021   60    1   0       125   258    0        0      141      1      2.8   
1022   47    1   0       110   275    0        0      118      1      1.0   
1023   50    0   0       110   254    0        0      159      0      0.0   
1024   54    1   0       120   188    0        1      113      0      1.4   

      slope  ca  thal  target  
0         2   2     3       0  
1         0

In [22]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


In [23]:
data.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0,1025.0
mean,54.434146,0.69561,0.942439,131.611707,246.0,0.149268,0.529756,149.114146,0.336585,1.071512,1.385366,0.754146,2.323902,0.513171
std,9.07229,0.460373,1.029641,17.516718,51.59251,0.356527,0.527878,23.005724,0.472772,1.175053,0.617755,1.030798,0.62066,0.50007
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,0.0,120.0,211.0,0.0,0.0,132.0,0.0,0.0,1.0,0.0,2.0,0.0
50%,56.0,1.0,1.0,130.0,240.0,0.0,1.0,152.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,275.0,0.0,1.0,166.0,1.0,1.8,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [25]:
data.isnull().sum()

age         0
sex         0
cp          0
trestbps    0
chol        0
fbs         0
restecg     0
thalach     0
exang       0
oldpeak     0
slope       0
ca          0
thal        0
target      0
dtype: int64

### data preprocessing 
### One hot encoding 
This is used for categorical features. For example, the 'thal' feature is coded as (normal: 3; fixed defect: 6, reversable defect: 7). Model might make false associations such as reversable > fixed. Instead, we create three new features: is_normal, is_fixed, is_reversible.

In [28]:
data['thal'].replace({1: 'normal', 2: 'fixed', 3: 'reversible'}, inplace = True)
data = pd.get_dummies(data, columns=["thal"], prefix=["is"])

### Feature Normalization

The range of values for different features usually varies widely. This may cause problems if models try to compare features. Thus, we rescale all features from 0 to 1 using min-max normalization. For a feature x, the formula is: x' = (x - min(x)) / (max(x) - min(x)).


In [29]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
data = pd.DataFrame(scaler.fit_transform(data.values), columns = data.columns , index = data.index)

In [30]:
data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,target,is_0,is_fixed,is_normal,is_reversible
0,0.479167,1.0,0.000000,0.292453,0.196347,0.0,0.5,0.740458,0.0,0.161290,1.0,0.50,0.0,0.0,0.0,0.0,1.0
1,0.500000,1.0,0.000000,0.433962,0.175799,1.0,0.0,0.641221,1.0,0.500000,0.0,0.00,0.0,0.0,0.0,0.0,1.0
2,0.854167,1.0,0.000000,0.481132,0.109589,0.0,0.5,0.412214,1.0,0.419355,0.0,0.00,0.0,0.0,0.0,0.0,1.0
3,0.666667,1.0,0.000000,0.509434,0.175799,0.0,0.5,0.687023,0.0,0.000000,1.0,0.25,0.0,0.0,0.0,0.0,1.0
4,0.687500,0.0,0.000000,0.415094,0.383562,1.0,0.5,0.267176,0.0,0.306452,0.5,0.75,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,0.625000,1.0,0.333333,0.433962,0.216895,0.0,0.5,0.709924,1.0,0.000000,1.0,0.00,1.0,0.0,1.0,0.0,0.0
1021,0.645833,1.0,0.000000,0.292453,0.301370,0.0,0.0,0.534351,1.0,0.451613,0.5,0.25,0.0,0.0,0.0,0.0,1.0
1022,0.375000,1.0,0.000000,0.150943,0.340183,0.0,0.0,0.358779,1.0,0.161290,0.5,0.25,0.0,0.0,1.0,0.0,0.0
1023,0.437500,0.0,0.000000,0.150943,0.292237,0.0,0.0,0.671756,0.0,0.000000,1.0,0.00,1.0,0.0,1.0,0.0,0.0


### Train/test Split

- Training Data: examples used to train the model
- Testing Data: examples used to test the model, separate from training data
- We randomly choose 75% of all data for training, and the remaining 25% for testing. 

In [31]:
from sklearn.model_selection import train_test_split
x = data.drop(columns = "target")
y = data["target"]
x_train, x_test,y_train , y_test = train_test_split(x,y, test_size = 0.25 , random_state = 42)

## Logistic Regression 

In [32]:
from sklearn.linear_model import LogisticRegression 
clf = LogisticRegression(random_state= 42, penalty ='l2', solver = "liblinear", C = 0.1)
clf.fit(x_train, y_train)

In [52]:
pred = clf.predict(x_test)
scores = clf.predict_proba(x_test)[:,1]

print('Accuracy: ', accuracy_score(y_test, pred))
print('AUROC: ', roc_auc_score(y_test, scores))
print(classification_report(y_test, pred))

Accuracy:  0.8171206225680934
AUROC:  0.8927878787878788
              precision    recall  f1-score   support

         0.0       0.84      0.80      0.82       132
         1.0       0.80      0.84      0.82       125

    accuracy                           0.82       257
   macro avg       0.82      0.82      0.82       257
weighted avg       0.82      0.82      0.82       257



In [53]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

conf_matrix = confusion_matrix(y_test, pred)
print("Confusion Matrix:")
print(conf_matrix)

LS_accuracy = accuracy_score(y_test, pred)
print("Accuracy:", LS_accuracy)
print("Classification Report:")
print(classification_report(y_test, pred))

Confusion Matrix:
[[105  27]
 [ 20 105]]
Accuracy: 0.8171206225680934
Classification Report:
              precision    recall  f1-score   support

         0.0       0.84      0.80      0.82       132
         1.0       0.80      0.84      0.82       125

    accuracy                           0.82       257
   macro avg       0.82      0.82      0.82       257
weighted avg       0.82      0.82      0.82       257



In [54]:
def rsme(predictions, actuals):
    from sklearn.metrics import mean_squared_error
    return mean_squared_error(actuals, predictions, squared=False)
LS_rsme = rsme(y_test, pred)
print('RSME: ', LS_rsme)

RSME:  0.42764398444489615


### RandomForestClassifier

In [56]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(random_state=42)
rf.fit(x_train, y_train)
pred = rf.predict(x_test)
scores = rf.predict_proba(x_test)
RFC_accuracy = accuracy_score(y_test, pred)
print("Accuracy:", RFC_accuracy)

Accuracy: 0.9883268482490273


In [57]:
RFC_rsme = rsme(y_test, pred)
print('RSME: ', RFC_rsme)

RSME:  0.10804236090984297


### KNN 

In [58]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)

In [59]:
KNN_accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", KNN_accuracy)

Accuracy: 0.9494163424124513


In [60]:
KNN_rsme = rsme(y_test, y_pred)
print('RSME: ', KNN_rsme)

RSME:  0.2249081092080689


In [61]:
#Create Accuracy Comparison Model
compare = pd.DataFrame({'Model': ['Logistic Regression', 'K-Nearest Neighbour', 'Random Forest'],
                        'Accuracy': [LS_accuracy*100, KNN_accuracy*100, RFC_accuracy*100, ], "RSME": [LS_rsme, KNN_rsme, RFC_rsme]})

#Create Accuracy Comparison Model
compare.sort_values(by='Accuracy', ascending=False).style.background_gradient(cmap='PuRd').hide_index().set_properties(**{'font-family': 'Segoe UI'})

  compare.sort_values(by='Accuracy', ascending=False).style.background_gradient(cmap='PuRd').hide_index().set_properties(**{'font-family': 'Segoe UI'})


Model,Accuracy,RSME
Random Forest,98.832685,0.108042
K-Nearest Neighbour,94.941634,0.224908
Logistic Regression,81.712062,0.427644
