The Data

We shall be using the Heart Disease Cleveland UCI Dataset.

The columns indicate-

1. age-   age in years
2. sex-   sex (1: male; 0: female)
3. cp-   pain type (0: typical angina 1: atypical angina 2: non-anginal pain 3: asymptomatic)
4. trestbps-   resting blood pressure
5. chol-   cholestoral in mg/dl
6. fbs-   blood sugar > 120 mg/dl (1: true; 0: false)
7. restecg-   electrocardiographic results (0: normal 1: having ST-T wave abnormality 
(T wave inversions and/or ST elevation or depression of > 0.05 mV) 2: showing probable or definite left ventricular hypertrophy by Estes' criteria)

8. thalach-   maximum heart rate achieved
9. exang-   exercise induced angina (1: yes; 0: no)
10. oldpeak-   ST depression induced by exercise relative to rest
11. slope-   the slope of the peak exercise ST segment (0: upsloping 1: flat 2: downsloping)
12. ca-   number of major vessels (0-3) colored by flourosopy
13. thal-   thal (0: normal; 1: fixed defect; 2: reversable defect)
14. target-   disease condition (0: No disease; 1: disease)

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df = pd.read_csv('heart.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
age         303 non-null int64
sex         303 non-null int64
cp          303 non-null int64
trestbps    303 non-null int64
chol        303 non-null int64
fbs         303 non-null int64
restecg     303 non-null int64
thalach     303 non-null int64
exang       303 non-null int64
oldpeak     303 non-null float64
slope       303 non-null int64
ca          303 non-null int64
thal        303 non-null int64
target      303 non-null int64
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


In [4]:
#all are numeric data so we can use these values

In [5]:
#value normalization

In [6]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [7]:
#scaling all the values between 0 to 1 by dividing each value by the maximum value of the column

max_age=max(df["age"])
print("Maximum value of age=", max_age)
df["age"]=df['age']/max_age

Maximum value of age= 77


In [8]:
#scaling all the values between 0 to 1 by dividing each value by the maximum value of the column

max_trestbps=max(df["trestbps"])
print("Maximum value of resting blood pressure=", max_trestbps)
df["trestbps"]=df['trestbps']/max_trestbps

Maximum value of resting blood pressure= 200


In [9]:
#scaling all the values between 0 to 1 by dividing each value by the maximum value of the column

max_chol=max(df["chol"])
print("Maximum value of cholestoral in mg/dl=", max_chol)
df["chol"]=df['chol']/max_chol

Maximum value of cholestoral in mg/dl= 564


In [10]:
#scaling all the values between 0 to 1 by dividing each value by the maximum value of the column

max_thalach=max(df["thalach"])
print("Maximum value of maximum heart rate achieved=", max_thalach)
df["thalach"]=df['thalach']/max_thalach

Maximum value of maximum heart rate achieved= 202


In [11]:
#data after values have been normalized
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,0.818182,1,3,0.725,0.413121,1,0,0.742574,0,2.3,0,0,1,1
1,0.480519,1,2,0.65,0.443262,0,1,0.925743,0,3.5,0,0,2,1
2,0.532468,0,1,0.65,0.361702,0,0,0.851485,0,1.4,2,0,2,1
3,0.727273,1,1,0.6,0.41844,0,1,0.881188,0,0.8,2,0,2,1
4,0.74026,0,0,0.6,0.62766,0,1,0.806931,1,0.6,2,0,2,1


In [12]:
#taking the X by removing the target column

X=df.drop(['target'], axis=1)

In [13]:
#taking the y by taking the target column

y=df["target"]

In [14]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,0.818182,1,3,0.725,0.413121,1,0,0.742574,0,2.3,0,0,1
1,0.480519,1,2,0.65,0.443262,0,1,0.925743,0,3.5,0,0,2
2,0.532468,0,1,0.65,0.361702,0,0,0.851485,0,1.4,2,0,2
3,0.727273,1,1,0.6,0.41844,0,1,0.881188,0,0.8,2,0,2
4,0.74026,0,0,0.6,0.62766,0,1,0.806931,1,0.6,2,0,2


In [15]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [16]:
#splitting the data into train and test groups

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=66)

In [17]:
#Random Forest Classifier Model

from sklearn.ensemble import RandomForestClassifier

# Create the model with 100 trees
model = RandomForestClassifier(n_estimators=500, criterion="entropy", min_samples_split=40 ,min_samples_leaf=7,
                               bootstrap = True)
# Fit on training data
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=7, min_samples_split=40,
                       min_weight_fraction_leaf=0.0, n_estimators=500,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [18]:
#checking the model score

model.score(X_test,y_test)

0.8901098901098901

In [19]:
#saving the model to disk
import pickle

filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))