# Heart disease prediction 


This data set dates from 1995 to 2021 and consists of four databases:  Long Beach V, Hungary, Switzerland, and Cleveland. It contains 76 attributes, including the predicted attribute, but all published experiments refer to using a subset of 14 of them. 

Project goal is to predict if someone has disease or not. I will explore what parameters affects on heart disease and visualize connections between them.

## Exploring dataset

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

In [None]:
df=pd.read_csv("/content/heart.csv")
# df=pd.read_csv("/content/data01.csv")
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [None]:
df.shape

(918, 12)

Attribute Information:
- age - age of patient
- sex - female(1) / male(0)
- cp -  chest pain type (4 values)
- trestbps - resting blood pressure
- chol - serum cholestoral in mg/dl
- fbs - fasting blood sugar > 120 mg/dl
- restecg - resting electrocardiographic results (values 0,1,2)
- thalach - maximum heart rate achieved
- exang - exercise induced angina
- oldpeak - ST depression induced by exercise relative to rest
- slope - the slope of the peak exercise ST segment
- ca - number of major vessels (0-3) colored by flourosopy
- thal -  0 = normal; 1 = fixed defect; 2 = reversable defect
- target - the "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no disease and 1 = disease

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB


Everything looks fine we dont'have missing values.

In [None]:
df.describe()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
count,918.0,918.0,918.0,918.0,918.0,918.0,918.0
mean,53.510893,132.396514,198.799564,0.233115,136.809368,0.887364,0.553377
std,9.432617,18.514154,109.384145,0.423046,25.460334,1.06657,0.497414
min,28.0,0.0,0.0,0.0,60.0,-2.6,0.0
25%,47.0,120.0,173.25,0.0,120.0,0.0,0.0
50%,54.0,130.0,223.0,0.0,138.0,0.6,1.0
75%,60.0,140.0,267.0,0.0,156.0,1.5,1.0
max,77.0,200.0,603.0,1.0,202.0,6.2,1.0


In [None]:
df.corr()

Unnamed: 0,Age,RestingBP,Cholesterol,FastingBS,MaxHR,Oldpeak,HeartDisease
Age,1.0,0.254399,-0.095282,0.198039,-0.382045,0.258612,0.282039
RestingBP,0.254399,1.0,0.100893,0.070193,-0.112135,0.164803,0.107589
Cholesterol,-0.095282,0.100893,1.0,-0.260974,0.235792,0.050148,-0.232741
FastingBS,0.198039,0.070193,-0.260974,1.0,-0.131438,0.052698,0.267291
MaxHR,-0.382045,-0.112135,0.235792,-0.131438,1.0,-0.160691,-0.400421
Oldpeak,0.258612,0.164803,0.050148,0.052698,-0.160691,1.0,0.403951
HeartDisease,0.282039,0.107589,-0.232741,0.267291,-0.400421,0.403951,1.0


### Making new columns

In [1]:
# New column of sex with female and male values for better understanding
df["gender"]=df["sex"].replace(1,"female").replace(0,"male")
# New column of patients stat
df["patient_stat"]=df["target"].replace(1,"disease").replace(0,"no disease")
df.head()

## Searchin for outliers

In [None]:
# Shape of dataset before cleaning outliers
df.shape

(918, 12)

In [2]:
df['trestbps'].describe()

In [None]:
#visualize outliers with boxplot
plt.boxplot(df['trestbps'])

In [None]:
# Upper outlier threshold  Q3 + 1.5(IQR)
max_threshold=140 + 1.5*(140 - 120)
max_threshold


In [None]:
#how many outliers do we have (values greater than max_threshold)
outliers=df[df.trestbps>max_threshold]
outliers.shape

In [None]:
# Removing outliers
df2=df[df.trestbps<max_threshold]
# recalculate summary statistics
df2['trestbps'].describe()

In [None]:
#boxplot after removing outliers
plt.boxplot(df2['trestbps'])

In [None]:
df['chol'].describe()

In [None]:
#visualize outliers with boxplot
plt.boxplot(df['chol'])

In [None]:
# Upper outlier threshold  Q3 + 1.5(IQR)
max_threshold=275 + 1.5*(275 - 211)
max_threshold

In [None]:
#how many outliers do we have (values greater than max_threshold)
outliers=df[df.chol>max_threshold]
outliers.shape

In [None]:
# Removing outliers
df3=df2[df2.chol<max_threshold]
# recaculate summary statistics
df3['chol'].describe()

In [None]:
#boxplot after removing outliers
plt.boxplot(df3['chol'])

In [None]:
df['thalach'].describe()

In [None]:
#visualize outliers with boxplot
plt.boxplot(df['thalach'])

In [None]:
#Lower Outlier Threshold = Q1 – 1.5(IQR)
lower_threshold= 132 - 1.5*(166-132)
lower_threshold

In [None]:
# Removing outliers
df4=df3[df3.thalach>lower_threshold]
# recaculate summary statistics
df4['thalach'].describe()

In [None]:
#boxplot after removing outliers
plt.boxplot(df4['thalach'])

In [None]:
df4['oldpeak'].describe()

In [None]:
#visualize outliers with boxplot
plt.boxplot(df4['oldpeak'])

In [None]:
# Upper outlier threshold  Q3 + 1.5(IQR)
max_threshold=275 + 1.5*(275 - 211)
max_threshold=1.6+1.5*(1.6)
max_threshold

In [None]:
# Removing outliers
df5=df4[df4.oldpeak<max_threshold]
# recaculate summary statistics
df5['oldpeak'].describe()

In [None]:
#visualize outliers with boxplot
plt.boxplot(df5['oldpeak'])

In [None]:
# shape of dataset after cleaning dataset
df5.shape

## Data visualization

In [None]:
# Percentage of female and male patient with heart disease
plt.pie(x=df5['patient_stat'].value_counts(), labels=df5['gender'].value_counts().index, autopct='%1.1f%%')
plt.show()

In [None]:
# female and male average cholesterol 
df5.groupby('gender')['chol'].mean().plot.bar()

We can see from the top diagram that males have little higher  cholesterol then females.

In [None]:
title = 'Chest pain by age'
plt.figure(figsize=(8,5))
sns.scatterplot(df5.age,df5.cp,hue=df5.patient_stat).set_title(title)
plt.ioff()

Patients with no disease have lowest  chest pain at almost every age. Patients with disease have chest pain even in younger ages but it's not high. The most patient with heart disease between 35 and 70 have chest pain value 1-2.

## Categorize "age" column

In [None]:
#Lenght of unique values in age column
len(df5["age"].unique())

In [None]:
df5["age"].describe()

In [None]:
df6=df5.copy()

In [None]:
# Defining function that will categorize age column into three groups
def age (row):
    if row["age"]<=35:
        return "Young"
    if(35< row["age"]<=55):
        return "Mid_age"
    else:
        return "Old"

df6["old"]=df6.apply(age,axis=1)

In [None]:
df6.head()

## Categorize "trestbps" column 

In [None]:
#Lenght of unique values in trestbps column
len(df6["trestbps"].unique())

In [None]:
df6["trestbps"].describe()

In [None]:
#Calculating average blood pressure
df6["trestbps"].mean()

In [None]:
df7=df6.copy()

In [None]:
# Defining function that will categorize blood_pressure into three groups
def blood_pressure (row):
    if row["trestbps"]<=115:
        return "Low_pressure"
    if(115< row["trestbps"]<=130):
        return "Normal_pressure"
    else:
        return "High"

df7["blood_pressure_stat"]=df7.apply(blood_pressure,axis=1)
df7.head()

## Categorize "chol" column

In [None]:
#Lenght of unique values in chol column
len(df7["chol"].unique())

In [None]:
df7["chol"].describe()

In [None]:
#calculating average cholesterol 
df7["chol"].mean()

In [None]:
df8=df7.copy()

In [None]:
#function for categorize chol column into 3 groups
def chol_stat (row):
    if row["chol"]<=160:
        return "Low_chol"
    if(160< row["chol"]<=250):
        return "mid_chol"
    else:
        return "High_chol"

df8["chol_stat"]=df8.apply(chol_stat,axis=1)
df8.head()

## Categorize "thalach" column

In [None]:
df8["thalach"].describe()

In [None]:
# Calculating average value of heart rate
df8["thalach"].mean()

In [None]:
df9=df8.copy()

In [None]:
#function to categorize thalach column into  three groups
def heart_rate (row):
    if row["thalach"]<=120:
        return "Low_rate"
    if(110< row["thalach"]<=160):
        return "mid_rate"
    else:
        return "High_rate"

df9["heart_rate_stat"]=df9.apply(heart_rate,axis=1)
df9.head()

## Categorize "oldpeak" column

In [None]:
df9["oldpeak"].describe()

In [None]:
#Calculating average depression level
df9["oldpeak"].mean()

In [None]:
df10=df9.copy()

In [None]:
#function that categorize oldpeak column into groups
def depression (row):
    if row["oldpeak"]<=1:
        return "Low_rate"
    if(1< row["oldpeak"]<=2):
        return "mid_rate"
    else:
        return "High_rate"

df10["depression_stat"]=df10.apply(depression,axis=1)
df10.head()

## Preparing dataset for model

In [None]:
#Droping columns that we don't need
df11=df10.drop(["age","trestbps","chol","thalach","oldpeak","gender","patient_stat"],axis=1)
df11.head()

In [None]:
# Correlation between columns
df11.corr()

In [None]:
import seaborn as sns
#Showing correlation between columns
plt.figure(figsize=(8,5))
sns.heatmap(df11.corr(), annot=True, cmap='coolwarm')

Everything looks fine. We don't need to drop any column.

### Creating dummy columns

In [None]:
#function for creating dummies
def create_dummies(df,column_name):
    dummies=pd.get_dummies(df[column_name],prefix=column_name)
    df=pd.concat([df,dummies],axis=1)
    return df

In [None]:
#Using function on dataframe and columns

df12=create_dummies(df11,"old")
df13=df12.drop(["old"],axis=1)

df14=create_dummies(df13,"blood_pressure_stat")
df15=df14.drop(["blood_pressure_stat"],axis=1)

df16=create_dummies(df15,"chol_stat")
df17=df16.drop(["chol_stat"],axis=1)

df18=create_dummies(df17,"heart_rate_stat")
df19=df18.drop(["heart_rate_stat"],axis=1)

df20=create_dummies(df19,"depression_stat")
df21=df20.drop(["depression_stat"],axis=1)

df22=create_dummies(df21,"slope")
df23=df22.drop(["slope"],axis=1)

df24=create_dummies(df23,"ca")
df25=df24.drop(["ca"],axis=1)

df26=create_dummies(df25,"thal")
df27=df26.drop(["thal"],axis=1)



In [None]:
df27.head()

### Spliting dataframe on train and test 

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

In [None]:
# Defining features (X) and target column(y)
X=df27.drop("target",axis=1)
y=df27["target"]

In [None]:
from sklearn.model_selection import train_test_split
train_X,test_X,train_y,test_y=train_test_split(X,y,train_size=0.7,random_state=1)

# Predictive models

### K Neighbors classifier model

In [None]:
# Importing KNeighborsClassifier model
from sklearn.neighbors import KNeighborsClassifier

In [None]:
# Defining model
kn=KNeighborsClassifier()

In [None]:
#Fitting model
kn.fit(train_X, train_y)

In [None]:
#Predicting values on test set
predictions_kn=kn.predict(test_X)

### K Neighbors classifier accuracy

In [None]:
#KNeighborsClassifier accuracy with accuracy_score
accuracy_kn=accuracy_score(test_y, predictions_kn)
accuracy_kn

In [None]:
#KNeighborsClassifier accuracy with cros_val_score
accuracy_cross_val_kn=cross_val_score(kn, X, y, cv=10)
accuracy_cross_val_kn


In [None]:
#calculating cross_val_score mean
accuracy_cross_val_knc=np.mean(accuracy_cross_val_kn)
accuracy_cross_val_knc

### Logistic Regression model

In [None]:
# Importing LogisticRegression model
from sklearn.linear_model import LogisticRegression

In [None]:
# Defininig model
lr=LogisticRegression()

In [None]:
# fitting the model
lr.fit(train_X, train_y)

In [None]:
# predicting values on test set
predictions_lr=lr.predict(test_X)

In [None]:
# predicting values on test set
predictions_l1r=lr.predict(train_X)

### Logistic Regression accuracy

In [None]:
# calculating accuracy with accuracy_score()
accuracy_lr=accuracy_score(test_y, predictions_lr)
accuracy_lr

In [None]:
accuracy_lr1=accuracy_score(train_y, predictions_l1r)
accuracy_lr1

In [None]:
# calculating accuracy result with cross_val_score()
accuracy_cross_val_lr=cross_val_score(lr, X, y, cv=10)
accuracy_cross_val_lr

In [None]:
#calculating cross_val_score mean
accuracy_cross_val_lr=np.mean(accuracy_cross_val_lr)
accuracy_cross_val_lr

## Finding log likelihood of the model

In [None]:
from sklearn.metrics import log_loss
log_likelihood = -log_loss(test_y, predictions_lr)*len(test_y)
log_likelihood

###  Random Forest classifier model

In [None]:
#Importing RandomForestClassifier model
from sklearn.ensemble import RandomForestClassifier

In [None]:
#Defining model
rf=RandomForestClassifier(n_estimators=5, random_state=1, min_samples_leaf=2)

In [None]:
# fitting the model
rf.fit(train_X, train_y)

In [None]:
# predicting values on test set
predictions_rf=rf.predict(test_X)

###  Random Forest classifier accuracy


In [None]:
# calculating accuracy with accuracy_score()
accuracy_rf=accuracy_score(test_y, predictions_rf)
accuracy_rf

In [None]:
# calculating accuracy result with cross_val_score()
accuracy_cross_val_rf=cross_val_score(rf, X, y, cv=10)
accuracy_cross_val_rf

In [None]:
#calculating cross_val_score mean
accuracy_cross_val_rf=np.mean(accuracy_cross_val_rf)
accuracy_cross_val_rf

# Nested model of the above model (by feature selection)

In [None]:
df=pd.read_csv("C:\\Users\\supri\\Downloads\\Project_heart_disease-main\\Project_heart_disease-main\\heart.csv")
df.head()

In [None]:
corr_matrix = df.corr()
f, ax = plt.subplots(figsize=(13, 10))
sns.heatmap(corr_matrix, annot = True);

After observing Correlatio heatmat, Choosing threshold value as -0.3

In [None]:
threshold=0.2
a=abs(corr_matrix['target'])
result=a[a>0.2]
result

In [None]:
a=abs(corr_matrix['target'])
result=a[a<0.2]
result

In [None]:
df1=df.drop(["trestbps","chol","fbs","restecg"],axis=1)
df1.head()

In [None]:
corr_matrix = df1.corr()
f, ax = plt.subplots(figsize=(13, 10))
sns.heatmap(corr_matrix, annot = True);

In [None]:
x = df1.drop('target', axis=1)
Y = df1["target"]

In [None]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score 
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc, log_loss
x_train, x_test, Y_train, Y_test = train_test_split(x, Y, test_size=0.7, random_state=2)

In [None]:
from sklearn.linear_model import LogisticRegression
regressor=LogisticRegression()
regressor.fit(x_train,Y_train)
Y_pred=regressor.predict(x_test)
Y_pred

In [None]:
import sklearn
from sklearn.metrics import accuracy_score
sklearn.metrics.accuracy_score(Y_test,Y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, Y_pred)

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='red')
plt.show()

## Calculating log likelihood

In [None]:
log_likelihood = -log_loss(Y_test,Y_pred)*len(Y_test)
log_likelihood

In [None]:
TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1]

In [None]:
Accuracy = (TP+TN)/float(TP+TN+FP+FN)
Sensitivity = TP/float(TP+FN)
Specificity = TN/float(TN+FP)
Positive_Value = TP/float(TP+FP)
print('The acuuracy of the model = ', Accuracy)
print('True Positive Rate = ', Sensitivity)
print('True Negative Rate = ', Specificity)
print('TPositive Predicted Value = ', Positive_Value)

In [None]:
# Predicted probabilities of 0 and 1 for the test data 
prob=regressor.predict_proba(x_test)
prob_data=pd.DataFrame(data=prob)
prob_data

In [None]:
# Predicted probabilities of 0 and 1 for the test data 
probability=regressor.predict_proba(x_train)[:,:]
prob_data=pd.DataFrame(data=probability)
prob_data

In [None]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(Y_test, prob[:,1])
plt.plot(fpr,tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title('ROC curve')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate ')
plt.grid(True)

In [None]:
sklearn.metrics.roc_auc_score(Y_test,prob[:,1])

In [None]:
from sklearn.metrics import precision_score
Precision = precision_score(Y_test, Y_pred)
Precision

In [None]:
from sklearn.metrics import recall_score
Recall = recall_score(Y_test, Y_pred)
Recall

In [None]:
from sklearn.metrics import f1_score
f1_score = f1_score(Y_test, Y_pred)
f1_score

In [None]:
Y_pred

## Calculting Log Likelihood for full model

In [None]:
from sklearn.metrics import log_loss
import numpy as np

true_y = test_y
predictions_lr = predictions_lr

In [None]:
log_likelihood = -log_loss(true_y,predictions_lr)*len(true_y)
log_likelihood

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(test_y, predictions_lr)

fig, ax = plt.subplots(figsize=(8, 8))
ax.imshow(cm)
ax.grid(False)
ax.xaxis.set(ticks=(0, 1), ticklabels=('Predicted 0s', 'Predicted 1s'))
ax.yaxis.set(ticks=(0, 1), ticklabels=('Actual 0s', 'Actual 1s'))
ax.set_ylim(1.5, -0.5)
for i in range(2):
    for j in range(2):
        ax.text(j, i, cm[i, j], ha='center', va='center', color='red')
plt.show()


In [None]:
Precision_Score = precision_score(test_y, predictions_lr)
Precision_Score

In [None]:
Recall_Score = recall_score(test_y, predictions_lr)
Recall_Score

In [None]:
from sklearn.metrics import f1_score
f1 = f1_score(test_y, predictions_lr)
f1

In [None]:
TN=cm[0,0]
TP=cm[1,1]
FN=cm[1,0]
FP=cm[0,1]

In [None]:
Accuracy = (TP+TN)/float(TP+TN+FP+FN)
Sensitivity = TP/float(TP+FN)
Specificity = TN/float(TN+FP)
Positive_Value = TP/float(TP+FP)
print('The acuuracy of the model = ', Accuracy)
print('True Positive Rate = ', Sensitivity)
print('True Negative Rate = ', Specificity)
print('TPositive Predicted Value = ', Positive_Value)