Predicting Heart Disease using Machine Learning

Problem Statement:
Provided various clinical parameters of any patient, we have to predict whether a patient has heart disease or not?

Age(in years),

sex - (1 = male; 0 = female),

cp - chest pain type
0: Typical angina: chest pain related decrease blood supply to the heart
1: Atypical angina: chest pain not related to heart
2: Non-anginal pain: typically esophageal spasms (nonheart related)
3: Asymptomatic: chest pain not showing signs of disease

trestbps - resting blood pressure (in mm Hg on admission to the hospital) anything above 130-140 is typically cause for concern,

chol - serum cholestoral in mg/dl
serum = LDL + HDL + .2 * triglycerides
above 200 is cause for concern

fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
'>126' mg/dL signals diabetes

restecg - resting electrocardiographic results
0: Nothing to note
1: ST-T Wave abnormality
can range from mild symptoms to severe problems
signals non-normal heart beat
2: Possible or definite left ventricular hypertrophy
Enlarged heart's main pumping chamber

thalach - maximum heart rate achieved

exang - exercise induced angina (1 = yes; 0 = no)

oldpeak - ST depression induced by exercise relative to rest looks at the stress of the heart during exercise unhealthy heart will stress more

slope - the slope of the peak exercise ST segment
0: Upsloping: better heart rate with exercise (uncommon)
1: Flatsloping: minimal change (typical healthy heart)
2: Downslopins: signs of an unhealthy heart

ca - number of major vessels (0-3) colored by fluoroscopy
colored vessel means the doctor can see the blood passing through
the more blood movement the better (no clots)

thal - thalium stress result
1,3: normal,
6: fixed defect: used to be defective but ok now,
7: reversible defect: no proper blood movement when exercising

target - have disease or not (1=yes, 0=no) (= the predicted attribute)


#https://www.kaggle.com/code/faressayah/predicting-heart-disease-using-machine-learning/notebook

In [152]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

In [153]:
import hvplot.pandas
#Machine learning libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score, confusion_matrix, precision_score, f1_score, classification_report
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
import fastparquet as fp


In [154]:
# read csv (comma separated value) into data
data = pd.read_csv('heart.csv')
# Write the dataframe to a Parquet file
fp.write('example.parquet', data)
# Read the Parquet file into a new dataframe
data = fp.ParquetFile('example.parquet').to_pandas()

# Print the new dataframe
# print(new_df)

# EXPLORATORY DATA ANALYSIS (EDA)

In [155]:
data.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [156]:
%matplotlib inline
sns.set_style("whitegrid")
plt.style.use("fivethirtyeight")

In [157]:
# To find out whether there is any NaN value, the length of this data, and the data types
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [158]:
data.shape

(303, 14)

The dataset does not seem to possess any missing values as we see from the .info() method above,
but yes there are variables like exang, slope, sex, etc. which are categorical but they 
are identified as integer. The target variable is also categorical and we have to predict the category of each patient.

In [159]:
data.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


In [160]:
data['target'].value_counts()

1    165
0    138
Name: target, dtype: int64

In [161]:
# Labeling the data fields to text values for convenience
data_eda = data.copy()

#sex: 0 female, 1 male
data_eda["sex"] = data["sex"].map({1: "Male", 0: "Female"})

#1 Chest pain type (0:Typical angina, 1: Atypical angina, 2: Non-anginal pain, 3: Asymptomatic)
data_eda["chest_pain_type"] = data["cp"].map({0:"Typical angina", 1: "Atypical angina", 2: "Non-anginal pain", 3: "Asymptomatic"})

#exercise induced angina (1 = yes; 0 = no)
data_eda["exercise_induced_angina"] = data["exang"].map({1:"Yes", 0: "No"})

#slope (0: Upsloping, 1: Flatsloping, 2: Downslopins)
data_eda["slope"] = data["slope"].map({0: "Upsloping", 1: "Flatsloping", 2: "Downsloping"})

#fasting blood sugar (0: No, 1: Yes)
data_eda["fasting_blood_sugar"] = data["fbs"].map({0: "No", 1: "Yes"})

#target (have disease or not) (1=yes, 0=no) (= the predicted attribute)
#data_eda["target"] = data["target"].map({1: "Yes", 0: "No"})

In [162]:
#Cohort analysis of age with output

def age_cohort(age):
    if age >= 0 and age <= 20:
        return "0-20"
    elif age > 20 and age <= 40:
        return "20-40"
    elif age > 40 and age <= 50:
        return "40-50"
    elif age > 50 and age <= 60:
        return "50-60"
    elif age > 60:
        return "60+"
    
data_eda['age_group'] = data_eda['age'].apply(age_cohort)
data_eda.sort_values('age_group', inplace = True)

In [163]:
age_group = ['0-20', '20-40', '40-50', '50-60', '60+']

In [164]:
import plotly.graph_objects as go
from aquarel import load_theme

theme = load_theme("boxy_dark")
#https://github.com/lgienapp/aquarel

theme.apply()

  
theme.apply_transforms()

In [165]:

def pie_graph(df,title,values):   
    labels = df[values].value_counts().index
    values = df[values].value_counts()

    fig = go.Figure(data = [
        go.Pie(
        labels = labels,
        values = values,
        hole = .5)
    ])

    fig.update_layout(title_text = title)
    fig.show()

In [166]:
data_eda.drop(['cp','exang', 'fbs'] , axis = 1, inplace = True) 

In [167]:
data_eda.head()

Unnamed: 0,age,sex,trestbps,chol,restecg,thalach,oldpeak,slope,ca,thal,target,chest_pain_type,exercise_induced_angina,fasting_blood_sugar,age_group
65,35,Female,138,183,1,182,1.4,Downsloping,0,2,1,Typical angina,No,No,20-40
212,39,Male,118,219,1,140,1.2,Flatsloping,0,3,0,Typical angina,No,No,20-40
259,38,Male,120,231,1,182,3.8,Flatsloping,0,3,0,Asymptomatic,Yes,No,20-40
175,40,Male,110,167,0,114,2.0,Flatsloping,0,3,0,Typical angina,Yes,No,20-40
115,37,Female,120,215,1,170,0.0,Downsloping,0,2,1,Non-anginal pain,No,No,20-40


In [168]:
data_eda.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 303 entries, 65 to 151
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   age                      303 non-null    int64  
 1   sex                      303 non-null    object 
 2   trestbps                 303 non-null    int64  
 3   chol                     303 non-null    int64  
 4   restecg                  303 non-null    int64  
 5   thalach                  303 non-null    int64  
 6   oldpeak                  303 non-null    float64
 7   slope                    303 non-null    object 
 8   ca                       303 non-null    int64  
 9   thal                     303 non-null    int64  
 10  target                   303 non-null    int64  
 11  chest_pain_type          303 non-null    object 
 12  exercise_induced_angina  303 non-null    object 
 13  fasting_blood_sugar      303 non-null    object 
 14  age_group                

In [169]:
theme = load_theme("boxy_dark")
#https://github.com/lgienapp/aquarel

theme.apply()

pie_graph(data_eda,"Gender",'sex')
theme.apply_transforms()

In [170]:
pie_graph(data_eda,"Age Group Distribution",'age_group')

In [171]:
pie_graph(data_eda,"Slope",'slope')

In [172]:
pie_graph(data_eda,"Chest pain type",'chest_pain_type')

In [173]:
pie_graph(data_eda,"Exercise induced angina",'exercise_induced_angina')

In [174]:
pie_graph(data_eda,"Fasting blood sugar",'fasting_blood_sugar')

In [175]:
pie_graph(data_eda,"Thalium stress results",'thal')

In [176]:
pie_graph(data_eda,"Number of blood vessels",'ca')

In [177]:
# Cross tabulation between Gender and presence of Heart disease 
CrosstabResult=pd.crosstab(index=data_eda['sex'],columns=data_eda['target']).apply(lambda r: r/r.sum(), axis=1)
print(CrosstabResult)
 
# importing the required function
from scipy.stats import chi2_contingency
 
# Performing Chi-sq test
#ChiSqResult = chi2_contingency(CrosstabResult)
 
# P-Value is the Probability of H0 being True
# If P-Value&gt;0.05 then only we Accept the assumption(H0)
 
#print('The P-Value of the ChiSq Test is:', ChiSqResult[1])

target         0         1
sex                       
Female  0.250000  0.750000
Male    0.550725  0.449275


In [178]:
pd.crosstab(index=data_eda['age_group'],columns=data_eda['target'], normalize = 'index')

target,0,1
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1
20-40,0.315789,0.684211
40-50,0.302632,0.697368
50-60,0.503876,0.496124
60+,0.556962,0.443038


The highest probability of heart disease occuring is in age groups of (20 to 40 and 40 to 50) as we
can see above, although the number is higher for age group 50 to 60.

In [179]:
theme = load_theme("boxy_dark")
#https://github.com/lgienapp/aquarel
theme.apply()

#Heart disease by age comparison

fig = sns.countplot(x='age_group', hue="target", data=data_eda)
fig.set_title("Heart Disease distributed by Age");

theme.apply_transforms()

The above graphs shows that Age group (50 to 60) has highest number of patients and the count is around 65 for patients with heart diseases, it is relatively lower for patients with age 60+ and age group (40 to 50) has around 53 patients who have heart disease, age group (20 to 40) are too low in
comparison to the counterparts.

In [180]:
fig = sns.countplot(x='sex', hue="target", data=data_eda)
fig.set_title("Heart Disease distributed by Sex");

In [181]:
pd.crosstab(index=data_eda['sex'],columns=data_eda['target'], normalize = 'index')

target,0,1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Female,0.25,0.75
Male,0.550725,0.449275


Males have less percentage of heart diseases than females as shown in the data

In [182]:
pd.crosstab(index=data_eda['chest_pain_type'],columns=data_eda['target'], normalize = 'index')

target,0,1
chest_pain_type,Unnamed: 1_level_1,Unnamed: 2_level_1
Asymptomatic,0.304348,0.695652
Atypical angina,0.18,0.82
Non-anginal pain,0.206897,0.793103
Typical angina,0.727273,0.272727


In [183]:
fig = sns.countplot(x='chest_pain_type', hue="target", data=data_eda)
fig.set_title("Heart Disease distributed by Chest pain type");

From the crosstab table it is clear that Atypical and non-anginal pain are main contributors 
of heart disease, asymptomatic a little and typical angina has minimum effect on a patient,
and from the graph above it is clear too that typical angina has less heart disease effect than the
others.

In [184]:
pd.crosstab(index=data_eda['fasting_blood_sugar'],columns=data_eda['target'], normalize = 'index')

target,0,1
fasting_blood_sugar,Unnamed: 1_level_1,Unnamed: 2_level_1
No,0.449612,0.550388
Yes,0.488889,0.511111


In [185]:
fig = sns.countplot(x='fasting_blood_sugar', hue="target", data=data_eda)
fig.set_title("Heart Disease distributed by Fasting blood sugar");

Fasting blood sugar indicates the person having diabetes and around 52% people who have diabetes 
are having heart disease and 55% people who do not have diabetes suffer from heart disease, so here
the blood sugar factor does not contribute heavily to people having heart disease, it can also be seen from the graph above.

In [186]:
pd.crosstab(index=data_eda['exercise_induced_angina'],columns=data_eda['target'], normalize = 'index')

target,0,1
exercise_induced_angina,Unnamed: 1_level_1,Unnamed: 2_level_1
No,0.303922,0.696078
Yes,0.767677,0.232323


In [187]:
fig = sns.countplot(x='exercise_induced_angina', hue="target", data=data_eda)
fig.set_title("Heart Disease distributed by Exercise induced Angina");

People who exercise has better chances of not having heart disease as shown 23% people suffer from
heart disease who exercise and 77% doesnt, on the contrary who don't exercise have 69% of suffering
from heart disease

In [188]:
#https://www.kaggle.com/code/faressayah/predicting-heart-disease-using-machine-learning/notebook

In [189]:
pd.crosstab(index=data_eda['slope'],columns=data_eda['target'], normalize = 'index')

target,0,1
slope,Unnamed: 1_level_1,Unnamed: 2_level_1
Downsloping,0.246479,0.753521
Flatsloping,0.65,0.35
Upsloping,0.571429,0.428571


In [190]:
fig = sns.countplot(x='slope', hue="target", data=data_eda)
fig.set_title("Heart Disease distributed by Slope");

Downsloping indicates bad situation in terms of the heart as indicated (75% patients suffer in this case), Upsloping is still fine with approx 43% patients having problems with the heart, and flatsloping has only, 35% people having heart problems.

In [191]:
data_eda.head()

Unnamed: 0,age,sex,trestbps,chol,restecg,thalach,oldpeak,slope,ca,thal,target,chest_pain_type,exercise_induced_angina,fasting_blood_sugar,age_group
65,35,Female,138,183,1,182,1.4,Downsloping,0,2,1,Typical angina,No,No,20-40
212,39,Male,118,219,1,140,1.2,Flatsloping,0,3,0,Typical angina,No,No,20-40
259,38,Male,120,231,1,182,3.8,Flatsloping,0,3,0,Asymptomatic,Yes,No,20-40
175,40,Male,110,167,0,114,2.0,Flatsloping,0,3,0,Typical angina,Yes,No,20-40
115,37,Female,120,215,1,170,0.0,Downsloping,0,2,1,Non-anginal pain,No,No,20-40


# Analysis of Numeric features (Regression Plot)

In [192]:
# use plot
sns.lmplot(x="trestbps", y="chol", hue="target", data=data_eda,
               markers=["o", "x"], palette="Set1");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

In [193]:
sns.lmplot(x="chol", y="thalach", hue="target", data=data_eda,
               markers=["o", "x"], palette="Set1");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

Correlation Matrix (heatmap)

In [194]:
f, ax = plt.subplots(figsize = (12,10))
sns.heatmap(data_eda.corr(),
            annot = True,
            linecolor = 'r',
            linewidths = .5,
            fmt = '.1f',
            ax = ax);



findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found becau

In [195]:
#https://medium.com/swlh/regression-and-matrix-plots-in-seaborn-python-186864679534

In [196]:
data_eda.corr()['target'].sort_values(ascending = False)





target      1.000000
thalach     0.421741
restecg     0.137230
chol       -0.085239
trestbps   -0.144931
age        -0.225439
thal       -0.344029
ca         -0.391724
oldpeak    -0.430696
Name: target, dtype: float64

In [197]:
object_col = ["sex", "slope" ,"chest_pain_type", "exercise_induced_angina", "fasting_blood_sugar"]
label_encoder = preprocessing.LabelEncoder()
for col in object_col:
    data_eda[col]=  label_encoder.fit_transform(data_eda[col])

In [198]:
data_eda = pd.get_dummies(data_eda)
data_eda.head()

Unnamed: 0,age,sex,trestbps,chol,restecg,thalach,oldpeak,slope,ca,thal,target,chest_pain_type,exercise_induced_angina,fasting_blood_sugar,age_group_20-40,age_group_40-50,age_group_50-60,age_group_60+
65,35,0,138,183,1,182,1.4,0,0,2,1,3,0,0,1,0,0,0
212,39,1,118,219,1,140,1.2,1,0,3,0,3,0,0,1,0,0,0
259,38,1,120,231,1,182,3.8,1,0,3,0,0,1,0,1,0,0,0
175,40,1,110,167,0,114,2.0,1,0,3,0,3,1,0,1,0,0,0
115,37,0,120,215,1,170,0.0,0,0,2,1,2,0,0,1,0,0,0


In [199]:
data_eda.shape

(303, 18)

# Pre Modelling steps

In [200]:
#Declare feature vector and target variable
X = data_eda.drop(columns = ['target'])
y = data_eda['target']

In [201]:
X.head()

Unnamed: 0,age,sex,trestbps,chol,restecg,thalach,oldpeak,slope,ca,thal,chest_pain_type,exercise_induced_angina,fasting_blood_sugar,age_group_20-40,age_group_40-50,age_group_50-60,age_group_60+
65,35,0,138,183,1,182,1.4,0,0,2,3,0,0,1,0,0,0
212,39,1,118,219,1,140,1.2,1,0,3,3,0,0,1,0,0,0
259,38,1,120,231,1,182,3.8,1,0,3,0,1,0,1,0,0,0
175,40,1,110,167,0,114,2.0,1,0,3,3,1,0,1,0,0,0
115,37,0,120,215,1,170,0.0,0,0,2,2,0,0,1,0,0,0


In [202]:
y.head()

65     1
212    0
259    0
175    0
115    1
Name: target, dtype: int64

In [203]:
#Split data into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size = .2,
    random_state = 777)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((242, 17), (242,), (61, 17), (61,))

In [204]:
X_train.head()

Unnamed: 0,age,sex,trestbps,chol,restecg,thalach,oldpeak,slope,ca,thal,chest_pain_type,exercise_induced_angina,fasting_blood_sugar,age_group_20-40,age_group_40-50,age_group_50-60,age_group_60+
29,53,1,130,197,0,152,1.2,2,0,2,2,0,1,0,0,1,0
158,58,1,125,220,1,144,0.4,1,4,3,1,0,0,0,0,1,0
198,62,1,120,267,1,99,1.8,1,2,3,3,1,0,0,0,0,1
134,41,0,126,306,1,163,0.0,0,0,2,1,0,0,0,1,0,0
185,44,1,112,290,0,153,0.0,0,1,2,3,0,0,0,1,0,0


In [205]:
#Feature scaling

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [206]:
X_train

array([[-0.10686804,  0.67663234, -0.06892861, ..., -0.59325821,
         1.16168609, -0.56780663],
       [ 0.43874517,  0.67663234, -0.35652731, ..., -0.59325821,
         1.16168609, -0.56780663],
       [ 0.87523575,  0.67663234, -0.644126  , ..., -0.59325821,
        -0.86081775,  1.76116294],
       ...,
       [-0.65248126,  0.67663234, -0.06892861, ...,  1.68560667,
        -0.86081775, -0.56780663],
       [-0.43423597, -1.47790748, -0.644126  , ...,  1.68560667,
        -0.86081775, -0.56780663],
       [ 0.32962253,  0.67663234,  1.94426225, ..., -0.59325821,
         1.16168609, -0.56780663]])

In [207]:
type(X_test)

numpy.ndarray

# Applying Machine Learning models to classify

In [208]:
#Random forest model and evaluate
clf_rf = RandomForestClassifier(random_state=777)
clf_rf = clf_rf.fit(X_train,y_train)
y_pred_rf = clf_rf.predict(X_test)
acc = accuracy_score(y_test, y_pred_rf)
print('Testing-set Accuracy score is:', acc)
print('Training-set Accuracy score is:',accuracy_score(y_train,clf_rf.predict(X_train)))
report = classification_report(y_test, y_pred_rf)
print(report)
cm = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm, annot = True, fmt = "d");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

Testing-set Accuracy score is: 0.819672131147541
Training-set Accuracy score is: 1.0
              precision    recall  f1-score   support

           0       0.79      0.81      0.80        27
           1       0.85      0.82      0.84        34

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61



In [209]:
#AdaBoost classifier and evaluate
abc = AdaBoostClassifier(n_estimators = 50,
                         learning_rate = 1, 
                         random_state = 777)
abc.fit(X_train,y_train)
y_pred_abc = abc.predict(X_test)
acc = accuracy_score(y_test, y_pred_abc)
print('AdaBoost Classifier Model Accuracy is:',acc)
cm = confusion_matrix(y_test, y_pred_abc)
sns.heatmap(cm, annot = True, fmt="d");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

AdaBoost Classifier Model Accuracy is: 0.7868852459016393


In [210]:
#Gradient Boost classifier and evaluate
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)
acc = accuracy_score(y_test, gb_pred)
print("Gradient Boosting Classifier Model Accuracy score is:", acc)
cm = confusion_matrix(y_test, gb_pred)
sns.heatmap(cm, annot = True, fmt="d");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

Gradient Boosting Classifier Model Accuracy score is: 0.7868852459016393


In [211]:
#KNN classifer and evaluate
knn = KNeighborsClassifier(n_neighbors = 8)
knn.fit(X_train, y_train)
knn_pred = knn.predict(X_test)
acc = knn.score(X_test, y_test)
print("KNN Model Acuuracy is:", acc)
report = classification_report(y_test, knn_pred)
print(report)
cm = confusion_matrix(y_test, knn_pred)
sns.heatmap(cm, annot = True, fmt="d");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

KNN Model Acuuracy is: 0.8852459016393442
              precision    recall  f1-score   support

           0       0.81      0.96      0.88        27
           1       0.97      0.82      0.89        34

    accuracy                           0.89        61
   macro avg       0.89      0.89      0.89        61
weighted avg       0.90      0.89      0.89        61



findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

In [212]:
#SVC model classifier and evaluate

svc = SVC(random_state = 777)
svc.fit(X_train, y_train)
svc_pred = svc.predict(X_test)
acc = svc.score(X_test, y_test)
print("SVC Accuracy score is:", acc)
cm = confusion_matrix(y_test, svc_pred)
sns.heatmap(cm, annot = True, fmt = "d");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

SVC Accuracy score is: 0.8524590163934426


findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial


In [213]:
#Logistic regression classifier and evaluation
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
acc = lr.score(X_test, y_test)
print("LogisticRegression accuracy score is:",acc)
report = classification_report(y_test, lr_pred)
print(report)
cm = confusion_matrix(y_test, lr_pred)
sns.heatmap(cm, annot = True, fmt = "d");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

LogisticRegression accuracy score is: 0.8688524590163934
              precision    recall  f1-score   support

           0       0.85      0.85      0.85        27
           1       0.88      0.88      0.88        34

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61



findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

In [214]:
#Decision Tree classifier and evaluation

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
dt_pred = dt.predict(X_test)
acc = accuracy_score(y_test, dt_pred)
print("Decision Tree accuracy score is :",acc)
cm = confusion_matrix(y_test, dt_pred)
sns.heatmap(cm, annot = True, fmt = "d");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

Decision Tree accuracy score is : 0.7868852459016393


findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

#Build Voting classifier and evaluate the model
We will select 3 best models according to the accuracy values:

- LogisticRegression accuracy score is: 0.8688
- KNN Model Acuuracy is: 0.8852
- Random Forest Classifier Model Accuracy score is: 0.8196



In [215]:
clf1 = LogisticRegression()
clf2 =KNeighborsClassifier(n_neighbors = 8)
clf3 = RandomForestClassifier(random_state=777)
eclf1 = VotingClassifier(estimators=[('lr', clf1), ('knn', clf2), ('rfc', clf3)], voting='soft')
eclf1.fit(X_train, y_train)
predictions = eclf1.predict(X_test)
print("Voting Classifier Accuracy Score is: ")
print(accuracy_score(y_test, predictions))
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot = True, fmt="d");

findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

Voting Classifier Accuracy Score is: 
0.9016393442622951


findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because none of the following families were found: Arial
findfont: Generic family 'sans-serif' not found because

The Voting Classifier comprising of the Logistic Regression, K-Nearest Neighbors, and Random Forest Classifier and the accuracy is (0.90) 90% now and performs the best among all the classifiers.