# **HEART DISEASE CLASSIFICATION**
## *About the dataset*

*This dataset contain patient records and my job is to train a model to detect whether a patient have a heart disease*

### **Features** 

1. Age: age of the patient [years]

2. Sex: sex of the patient [M: Male, F: Female]

3. ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]

4. RestingBP: resting blood pressure [mm Hg]

5. Cholesterol: serum cholesterol [mm/dl]

6. FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]

7. RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes' criteria]

8. MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]

9. ExerciseAngina: exercise-induced angina [Y: Yes, N: No]

10. Oldpeak: oldpeak = ST [Numeric value measured in depression]

11. ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]

12. HeartDisease: output class [1: heart disease, 0: Normal]

In [2]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import joblib


### **EXPLORATORY DATA ANALYSIS (EDA)**

In [3]:
# Read the data
data = pd.read_csv('Data\heart.csv')
data.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
data.describe(include= 'all')

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
count,918.0,918,918,918.0,918.0,918.0,918,918.0,918,918.0,918,918.0
unique,,2,4,,,,3,,2,,3,
top,,M,ASY,,,,Normal,,N,,Flat,
freq,,725,496,,,,552,,547,,460,
mean,53.510893,,,132.396514,198.799564,0.233115,,136.809368,,0.887364,,0.553377
std,9.432617,,,18.514154,109.384145,0.423046,,25.460334,,1.06657,,0.497414
min,28.0,,,0.0,0.0,0.0,,60.0,,-2.6,,0.0
25%,47.0,,,120.0,173.25,0.0,,120.0,,0.0,,0.0
50%,54.0,,,130.0,223.0,0.0,,138.0,,0.6,,1.0
75%,60.0,,,140.0,267.0,0.0,,156.0,,1.5,,1.0


### **DATA PREPROCESSING**

In [5]:
def preprocessing(data):
    
    # Convert binary features into 0 and 1
    binary_features = ['Sex', 'ExerciseAngina']
    for i in binary_features:
        data[i] = data[i].apply(lambda x: 1 if x in ['M', 'Y'] else 0)
    
    # Encoding features with more than 3 categories
    le = LabelEncoder()
    cat_features = ['ChestPainType', 'RestingECG', 'ST_Slope']
    for i in cat_features:
        data[i] = le.fit_transform(data[i])
    
    return data

data = preprocessing(data)

In [6]:
# Split data and scale it

X = data.drop('HeartDisease', axis= 1)
y = data.HeartDisease

# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size= 0.3, random_state= 42)

y_train = np.array(y_train)
y_test = np.array(y_test)

### **MODELLING**

In [8]:
class LogisticRegression:
    def __init__(self, learning_rate= 0.05, iter= 1000, print_cost= False):
        self.weight = None
        self.bias = None
        self.learning_rate = learning_rate
        self.iter = iter
        self.print_cost = print_cost

    def fit(self, X, y):
        m, n = X.shape
        self.weight = np.random.rand(n)
        self.bias = np.random.rand(1)

        for _ in range(self.iter):
            z = np.dot(X, self.weight) + self.bias
            g = 1 / (1 + np.exp(-z))

            if self.print_cost and _ % 50 == 0:
                cost = self.compute_cost(g, y) / 10
                
                print(f'Cost at iterations: {_} is {cost:.2f}')
            
            dw = np.dot(X.T, (g-y)) / m 
            db = np.sum(g-y) / m

            self.weight = self.weight - (self.learning_rate * dw)
            self.bias = self.bias - (self.learning_rate * db)


    def predict(self, X):
        z = np.dot(X, self.weight) + self.bias
        y_preds = 1 / (1 + np.exp(-z))
        
        return  (y_preds > 0.5).astype(int)

    def score(self, X, y):
        y_preds = self.predict(X)
        return np.mean(y_preds == y) * 100
    
    def compute_cost(self, y_preds, y):
        cost = - np.mean(np.dot(y, np.log(y_preds)) + np.dot((1 - y), np.log(1 - y_preds)))
        return cost

    
    def coofficient(self):
        w, b = self.weight, self.bias
        return w, b

### **TRAIN AND EVALUATE THE MODEL**

In [9]:
# Model Training
model = LogisticRegression(print_cost= True)

model.fit(X_train, y_train)

Cost at iterations: 0 is 78.35
Cost at iterations: 50 is 39.08
Cost at iterations: 100 is 30.48
Cost at iterations: 150 is 27.35
Cost at iterations: 200 is 25.87
Cost at iterations: 250 is 25.08
Cost at iterations: 300 is 24.62
Cost at iterations: 350 is 24.34
Cost at iterations: 400 is 24.17
Cost at iterations: 450 is 24.05
Cost at iterations: 500 is 23.98
Cost at iterations: 550 is 23.93
Cost at iterations: 600 is 23.90
Cost at iterations: 650 is 23.88
Cost at iterations: 700 is 23.86
Cost at iterations: 750 is 23.85
Cost at iterations: 800 is 23.85
Cost at iterations: 850 is 23.84
Cost at iterations: 900 is 23.84
Cost at iterations: 950 is 23.83


In [10]:
# Evaluate model performance

training_performance = model.score(X_train, y_train)
test_performance = model.score(X_test, y_test)

print(f'Model performance on training set: {training_performance:.2f}%')
print(f'Model performance on testing set: {test_performance:.2f}%')

Model performance on training set: 85.51%
Model performance on testing set: 87.68%


In [11]:
i = 24
predict = model.predict(X_test[i])
truth = y_test[i]

print(f'Model predict the patient {i+1} is {predict}')
print(f'The ground truth is {truth}')

Model predict the patient 25 is [1]
The ground truth is 1


In [241]:
joblib.dump(model, 'LogisticRegression_model.pkl')

['LogisticRegression_model.pkl']