# HEART DISEASE CLASSIFICATION


**We have a data which classified if patients have heart disease or not according to features in it. We will try to use this data to create a model which tries predict if a patient has this disease or not. We will use Machine Learning Algorithms.**

Data contains;

- age - age in years

- sex - (1 = male; 0 = female)

- cp - chest pain type

- trestbps - resting blood pressure (in mm Hg on admission to the hospital)

- chol - serum cholestoral in mg/dl

- fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

- restecg - resting electrocardiographic results

- thalach - maximum heart rate achieved

- exang - exercise induced angina (1 = yes; 0 = no)

- oldpeak - ST depression induced by exercise relative to rest

- slope - the slope of the peak exercise ST segment

- ca - number of major vessels (0-3) colored by flourosopy

- thal - 3 = normal; 6 = fixed defect; 7 = reversable defect

- target - have disease or not (1=yes, 0=no)

### Importing Datasets and Libraries

In [None]:
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

### READ DATA

In [None]:
df = pd.read_csv('../input/heart-disease-uci/heart.csv')

In [None]:
df.head(10)

### Data Exploration

In [None]:
df.target.value_counts()

In [None]:
sns.countplot(x='target', data=df, palette='bwr')
plt.show()

In [None]:
countNoDisease = len(df[df.target==0])
countHaveDisease = len(df[df.target==1])
print("Percentage of Patients Haven't Heart Disease: {:.2f}%".format((countNoDisease / (len(df.target))*100)))
print("Percentage of Patients Hav Heart Disease: {:.2f}%".format((countHaveDisease / (len(df.target))*100)))

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.groupby('target').mean()

In [None]:
# MALE vs FEMALE
pd.crosstab(df.sex, df.target).plot(kind='bar', figsize=(15,6),color=['green', 'red'])
plt.title('Heart Disease Frequency for Sex')
plt.xlabel('Sex(0 = Female, 1 = Male)')
plt.xticks(rotation=0)
plt.legend(["Haven't Disease", "Have Disease"])
plt.ylabel('Frequency')
plt.show()

In [None]:
plt.scatter(x=df.age[df.target==1], y=df.thalach[(df.target==1)], c='red')
plt.scatter(x=df.age[df.target==0], y=df.thalach[(df.target==0)])
plt.legend(['Disease', 'Not Disease'])
plt.xlabel('Age')
plt.ylabel('Maximum Heart Age')
plt.show()

**Creating Dummy Variable**

Since 'cp', 'thal'and 'slope' are categorical variables we'll turn them into dummy variables.

In [None]:
a = pd.get_dummies(df['cp'], prefix='cp')
b = pd.get_dummies(df['thal'], prefix='thal')
c = pd.get_dummies(df['slope'], prefix='slope')

In [None]:
frames = [df,a,b,c]
df = pd.concat(frames, axis=1)
df.head()

In [None]:
df = df.drop(columns = ['cp', 'thal', 'slope'])
df.head()

# Splitting the Data

In [None]:
X = df.drop(['target'], axis=1)
y = df['target']

In [None]:
X.columns

In [None]:
y

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

# Machine Learning Model

**Let's play with differet Machine Learning Algorithms,  from data we can say that this is a classification problem**

1. Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear')
lr.fit(X_train, y_train)
acc_lr = round(lr.score(X_train, y_train)*100, 2)
print(str(acc_lr)+ ' Percentage')

2. Support Vector Classifier

In [None]:
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train, y_train)
acc_svc = round(svc.score(X_train, y_train)*100, 2)
print(str(acc_svc)+' Percentage')

3. K-Nearest Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
acc_knn = round(knn.score(X_train, y_train)*100, 2)
print(str(acc_knn)+' Percentage')

4. Decision Tree Classifier

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
acc_dt = round(dt.score(X_train, y_train)*100, 2)
print(str(acc_dt)+' Percentage')

5. Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train, y_train)
acc_rf = round(rf.score(X_train, y_train)*100, 2)
print(str(acc_rf)+' Percentage')

6. Naive Bayes

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)
acc_nb = round(nb.score(X_train, y_train)*100, 2)
print(str(acc_nb)+' Percentage')

# Comparing Modela
Let's compare the accuracy score of all the models used above

In [None]:
models = pd.DataFrame({
    'Models':['Logistic Regression', 'Support Vector', 'KNN', 'Decision Tree', 'Random Forest', 'Naive Bayes'],
    'Score':[acc_lr, acc_svc, acc_knn, acc_dt, acc_rf, acc_nb]
})

models.sort_values(by='Score', ascending=False)

From above table we can see that Decision Tree and RandomForest have 100% accuracy, but above scores are based on Train datasets.

So, our task is now to check accuracy_score on TEST data.

So let's check

In [None]:
from sklearn.metrics import accuracy_score

Applying predict method to all algorithms

In [None]:
lr_pred = lr.predict(X_test) #Logistic Regression
svm_pred = svc.predict(X_test) #Support Vector
knn_pred = knn.predict(X_test) #K-Nearest
dt_pred = dt.predict(X_test) #Decision Tree
rf_pred = rf.predict(X_test) #Random Forest
nb_pred = rf.predict(X_test) #Naive Bayes

Time to check accuracy_score on test

In [None]:
test_lr = round(accuracy_score(lr_pred, y_test)*100,2)
test_svm = round(accuracy_score(svm_pred, y_test)*100,2)
test_knn = round(accuracy_score(knn_pred, y_test)*100,2)
test_dt = round(accuracy_score(dt_pred, y_test)*100,2)
test_rf = round(accuracy_score(rf_pred, y_test)*100,2)
test_nb = round(accuracy_score(nb_pred, y_test)*100,2)

test_models = pd.DataFrame({
    'Models':['Logistic Regression', 'Support Vector', 'KNN', 'Decision Tree', 'Random Forest', 'Naive Bayes'],
    'Score(Test Data)':[test_lr, test_svm, test_knn, test_dt, test_rf, test_nb]
})

test_models.sort_values(by='Score(Test Data)', ascending=False)

From Above two tables, we can see that **Logistic Regression** has better score on both test and train scores. 
So will evaluate our model on **Logistic Regression**.

# Model Evaluations

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

**Classification Report**

In [None]:
print(classification_report(y_test, lr_pred))

**Confusion Matrix**

In [None]:
cm = confusion_matrix(y_test, lr_pred)

In [None]:
cm

# Visualize

In [None]:
plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True)
plt.title('Confusion Matrix')

# Building Predictive Model

In [None]:
input_data = (34, 1, 140,230,0,1,170,1, 3.2, 1, 1,0,0,0,0,1,0,0,0,0,1)

# change the input data to a numpy array
input_data_as_numpy_array= np.asarray(input_data)

# reshape the numpy array as we are predicting for only on instance
input_data_reshaped = input_data_as_numpy_array.reshape(1,-1)

prediction = lr.predict(input_data_reshaped)
print(prediction)

if (prediction[0]== 0):
  print('The Person does not have a Heart Disease')
else:
  print('The Person has Heart Disease')

**DATA FOR PERSON DOES NOT HAVE A HEART DISEASE**


34, 1, 140,230,0,1,170,1, 3.2, 1, 1,0,0,0,0,1,0,0,0,0,1

**DATA FOR PERSON HAS A HEART DISEASE**

54, 0,132,200,1,0,220,0,4.2,0,0,1,1,1,1,0,1,1,1,1,0

# GREAT JOB !

**We are done with this Project**