# Predict Heart Disease

### For this assignment we are presented with numerical data telling us about patients with and without 

### heartdisease. We only have numerical data which means we will not need to worry about encoding or 

### dicarding non-numeric data. We also have no missing or NaN values. Since we do not really expect there to

### be a supper straightforward relationship (eg. linear), we will create a Random Forest model.

## First, import libraries and read in data

In [48]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

In [49]:
# Since zeros and ones are grouped, we use .sample to randomize (mix) records
df = pd.read_csv('heart_disease.csv').sample(frac=1)

## Select Data

In [50]:
# Select Target, this tells us if the person has or does not have heart disease
y = df.loc[:, ['target']]
y.head()

Unnamed: 0,target
155,1
182,0
189,0
288,0
124,1


In [51]:
# Select columns to use for model, all cols are numeric, we can select all cols except target
x = df.drop(columns=['target'])
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
155,58,0,0,130,197,0,1,131,0,0.6,1,0,2
182,61,0,0,130,330,0,0,169,0,0.0,2,0,2
189,41,1,0,110,172,0,0,158,0,0.0,2,0,3
288,57,1,0,110,335,0,1,143,1,3.0,1,1,3
124,39,0,2,94,199,0,1,179,0,0.0,2,0,2


## Build Random Forest Model

In [57]:
# Split data into train and test, 80% train 20% test
xtr, xtst, ytr, ytst = train_test_split(x, y, test_size = 0.2, random_state=1)
max_correct = 0
params = [0, 0, 0, 0]
ytst = ytst.values.ravel()

# Use for loops to test out various values for these 4 parameters
for n_est in range(1, 100, 10):
    for depth in range(1, 10):
        for min_samples_s in range(2, 10):
            for min_samples_l in range(1, 10):

                # Use random forest to make the model
                model = RandomForestClassifier(random_state = 1, max_depth = depth, n_estimators=n_est, min_samples_split=min_samples_s, min_samples_leaf=min_samples_l)
    
                # Use data to fit RF model, Y-train must be convereted to array of right shape
                model.fit(xtr, ytr.values.ravel())
    
                # Predict y-values - Heart Disease Yes(1) or No(0)
                ypred = model.predict(xtst)
                
                # Here we will count how many correct predictions are made
                correct_pred = 0
                for i in range(len(ypred)):
                    if ypred[i] == ytst[i]:
                        correct_pred += 1
                
                # We will keep prediction with highest amount of correct predictions
                if correct_pred > max_correct:
                    max_correct = correct_pred
                    params = [n_est, depth, min_samples_s, min_samples_l]

# Calculate what Percentage of our Calculations are Correct, Print out Params used
print(str(correct_pred / len(ypred) * 100)+ '% of predictions are correct')
print('Parameters used:', params)
    

90.1639344262295% of predictions are correct
Parameters used: [41, 4, 2, 1]
