The following notebook uses pandas, numpy, and sklearn to examine the public dataset available from http://archive.ics.uci.edu/ml/datasets/heart+disease.


We will examine this dataset for predictive ML possibilities (classification) using
* Logistic Regression (for classification)
* Random Forest
* Decision Trees
* Support Vector Machines


The dataset contains the following:

CONTINUOUS FEATURES
 * age 
 * trestbps – resting blood pressure (in mm Hg on admission to the hospital)
 * chol – serum cholesterol in mg/dl
 * thalach – maximum heart rate achieved
 * oldpeak – ST depression induced by exercise relative to rest
 * ca – (I guess) number of major vessels (0-3) colored by fluoroscopy

CATEGORICAL FEATURES
 * sex – 1/0 for male/female
 * cp – chest pain type, Value 1: typical angina, Value 2: atypical angina, Value 3: non-anginal pain, Value 4: asymptomatic
 * fbs – fasting blood sugar > 120 mg/dl  (1 = true; 0 = false)
 * restecg – resting electrocardiographic results, Value 0: normal, Value 1: having ST-T wave abnormality – T wave inversions and/or ST elevation or depression of > 0.05 mV, Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
 * exang  – exercise induced angina (1 = yes; 0 = no)
 * slope – the slope of the peak exercise ST segment, Value 1: upsloping, Value 2: flat, Value 3: downsloping
 * thal – 3 = normal; 6 = fixed defect; 7 = reversable defect
 * num – diagnosis of heart disease (angiographic disease status), Value 0: < 50% diameter narrowing, Value 1: > 50% diameter narrowing, in any major vessel: attributes 59 through 68 are vessels (0 is absent, 1 is present); this is our RESPONSE

## Pre-Processing

We will need the following libraries:

In [1]:
import pandas as pd
import numpy as np

Below we ingest our cleaned dataset, removing any unnecessary features. We also "categorize" our nonbinary features (one hot encode).

In [2]:
df = pd.read_csv("view_processed_cleveland.txt")
df = df.drop(["num"], axis=1)  # use num_binary not num

# create one-hot encoding on needed categorical
df_old = df  # save old df
cats = pd.get_dummies(df[['cp','restecg','slope','thal']].astype('category'))
df = df.drop(['cp','restecg','slope','thal'], axis=1)  # drop categorical cols
df = df.join(cats)  # join to add cats
df.head()


Unnamed: 0,age,sex,trestbps,chol,fbs,thalach,exang,oldpeak,ca,num_binary,...,cp_4,restecg_0,restecg_1,restecg_2,slope_1,slope_2,slope_3,thal_3,thal_6,thal_7
0,63,1,145,233,1,150,0,2.3,0,0,...,0,0,0,1,0,0,1,0,1,0
1,67,1,160,286,0,108,1,1.5,3,1,...,1,0,0,1,0,1,0,1,0,0
2,67,1,120,229,0,129,1,2.6,2,1,...,1,0,0,1,0,1,0,0,0,1
3,37,1,130,250,0,187,0,3.5,0,0,...,0,1,0,0,0,0,1,1,0,0
4,41,0,130,204,0,172,0,1.4,0,0,...,0,0,0,1,1,0,0,1,0,0


In [3]:
df.describe()

Unnamed: 0,age,sex,trestbps,chol,fbs,thalach,exang,oldpeak,ca,num_binary,...,cp_4,restecg_0,restecg_1,restecg_2,slope_1,slope_2,slope_3,thal_3,thal_6,thal_7
count,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,...,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0,297.0
mean,54.542088,0.676768,131.693603,247.350168,0.144781,149.599327,0.326599,1.055556,0.676768,0.461279,...,0.478114,0.494949,0.013468,0.491582,0.468013,0.461279,0.070707,0.552189,0.060606,0.387205
std,9.049736,0.4685,17.762806,51.997583,0.352474,22.941562,0.469761,1.166123,0.938965,0.49934,...,0.500364,0.500818,0.115462,0.500773,0.499818,0.49934,0.256768,0.498108,0.239009,0.487933
min,29.0,0.0,94.0,126.0,0.0,71.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,48.0,0.0,120.0,211.0,0.0,133.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,56.0,1.0,130.0,243.0,0.0,153.0,0.0,0.8,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
75%,61.0,1.0,140.0,276.0,0.0,166.0,1.0,1.6,1.0,1.0,...,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0
max,77.0,1.0,200.0,564.0,1.0,202.0,1.0,6.2,3.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Pandas Profiling

In [4]:
#!pip install pandas_profiling

In [5]:
import pandas_profiling as pp

pp.ProfileReport(df_old)

0,1
Number of variables,14
Number of observations,297
Total Missing (%),0.0%
Total size in memory,32.6 KiB
Average record size in memory,112.3 B

0,1
Numeric,10
Categorical,0
Boolean,4
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,41
Unique (%),13.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,54.542
Minimum,29
Maximum,77
Zeros (%),0.0%

0,1
Minimum,29
5-th percentile,40
Q1,48
Median,56
Q3,61
95-th percentile,68
Maximum,77
Range,48
Interquartile range,13

0,1
Standard deviation,9.0497
Coef of variation,0.16592
Kurtosis,-0.52175
Mean,54.542
MAD,7.4331
Skewness,-0.21977
Sum,16199
Variance,81.898
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
58,18,6.1%,
57,17,5.7%,
54,16,5.4%,
59,14,4.7%,
51,12,4.0%,
60,12,4.0%,
62,11,3.7%,
44,11,3.7%,
52,11,3.7%,
56,11,3.7%,

Value,Count,Frequency (%),Unnamed: 3
29,1,0.3%,
34,2,0.7%,
35,4,1.3%,
37,2,0.7%,
38,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
70,4,1.3%,
71,3,1.0%,
74,1,0.3%,
76,1,0.3%,
77,1,0.3%,

0,1
Distinct count,4
Unique (%),1.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.67677
Minimum,0
Maximum,3
Zeros (%),58.6%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,3
Maximum,3
Range,3
Interquartile range,1

0,1
Standard deviation,0.93896
Coef of variation,1.3874
Kurtosis,0.23523
Mean,0.67677
MAD,0.79298
Skewness,1.1795
Sum,201
Variance,0.88165
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
0,174,58.6%,
1,65,21.9%,
2,38,12.8%,
3,20,6.7%,

Value,Count,Frequency (%),Unnamed: 3
0,174,58.6%,
1,65,21.9%,
2,38,12.8%,
3,20,6.7%,

Value,Count,Frequency (%),Unnamed: 3
0,174,58.6%,
1,65,21.9%,
2,38,12.8%,
3,20,6.7%,

0,1
Distinct count,152
Unique (%),51.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,247.35
Minimum,126
Maximum,564
Zeros (%),0.0%

0,1
Minimum,126.0
5-th percentile,175.8
Q1,211.0
Median,243.0
Q3,276.0
95-th percentile,327.6
Maximum,564.0
Range,438.0
Interquartile range,65.0

0,1
Standard deviation,51.998
Coef of variation,0.21022
Kurtosis,4.4441
Mean,247.35
MAD,39.507
Skewness,1.1181
Sum,73463
Variance,2703.7
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
234,6,2.0%,
197,6,2.0%,
212,5,1.7%,
269,5,1.7%,
254,5,1.7%,
204,5,1.7%,
177,4,1.3%,
239,4,1.3%,
240,4,1.3%,
226,4,1.3%,

Value,Count,Frequency (%),Unnamed: 3
126,1,0.3%,
131,1,0.3%,
141,1,0.3%,
149,2,0.7%,
157,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
394,1,0.3%,
407,1,0.3%,
409,1,0.3%,
417,1,0.3%,
564,1,0.3%,

0,1
Distinct count,4
Unique (%),1.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.1582
Minimum,1
Maximum,4
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,3
Median,3
Q3,4
95-th percentile,4
Maximum,4
Range,3
Interquartile range,1

0,1
Standard deviation,0.96486
Coef of variation,0.3055
Kurtosis,-0.41092
Mean,3.1582
MAD,0.80491
Skewness,-0.84441
Sum,938
Variance,0.93095
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
4,142,47.8%,
3,83,27.9%,
2,49,16.5%,
1,23,7.7%,

Value,Count,Frequency (%),Unnamed: 3
1,23,7.7%,
2,49,16.5%,
3,83,27.9%,
4,142,47.8%,

Value,Count,Frequency (%),Unnamed: 3
1,23,7.7%,
2,49,16.5%,
3,83,27.9%,
4,142,47.8%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.3266

0,1
0,200
1,97

Value,Count,Frequency (%),Unnamed: 3
0,200,67.3%,
1,97,32.7%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.14478

0,1
0,254
1,43

Value,Count,Frequency (%),Unnamed: 3
0,254,85.5%,
1,43,14.5%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.46128

0,1
0,160
1,137

Value,Count,Frequency (%),Unnamed: 3
0,160,53.9%,
1,137,46.1%,

0,1
Distinct count,40
Unique (%),13.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.0556
Minimum,0
Maximum,6.2
Zeros (%),32.3%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.8
Q3,1.6
95-th percentile,3.4
Maximum,6.2
Range,6.2
Interquartile range,1.6

0,1
Standard deviation,1.1661
Coef of variation,1.1047
Kurtosis,1.511
Mean,1.0556
MAD,0.93513
Skewness,1.2471
Sum,313.5
Variance,1.3598
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,96,32.3%,
1.2,17,5.7%,
0.6,14,4.7%,
1.0,13,4.4%,
0.8,13,4.4%,
1.4,13,4.4%,
0.2,12,4.0%,
1.6,11,3.7%,
1.8,10,3.4%,
2.0,9,3.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,96,32.3%,
0.1,6,2.0%,
0.2,12,4.0%,
0.3,3,1.0%,
0.4,8,2.7%,

Value,Count,Frequency (%),Unnamed: 3
4.0,3,1.0%,
4.2,2,0.7%,
4.4,1,0.3%,
5.6,1,0.3%,
6.2,1,0.3%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.99663
Minimum,0
Maximum,2
Zeros (%),49.5%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,2
95-th percentile,2
Maximum,2
Range,2
Interquartile range,2

0,1
Standard deviation,0.99491
Coef of variation,0.99828
Kurtosis,-1.9997
Mean,0.99663
MAD,0.98657
Skewness,0.0067678
Sum,296
Variance,0.98985
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
0,147,49.5%,
2,146,49.2%,
1,4,1.3%,

Value,Count,Frequency (%),Unnamed: 3
0,147,49.5%,
1,4,1.3%,
2,146,49.2%,

Value,Count,Frequency (%),Unnamed: 3
0,147,49.5%,
1,4,1.3%,
2,146,49.2%,

0,1
Distinct count,2
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0

0,1
Mean,0.67677

0,1
1,201
0,96

Value,Count,Frequency (%),Unnamed: 3
1,201,67.7%,
0,96,32.3%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.6027
Minimum,1
Maximum,3
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,2
Q3,2
95-th percentile,3
Maximum,3
Range,2
Interquartile range,1

0,1
Standard deviation,0.61819
Coef of variation,0.38572
Kurtosis,-0.6273
Mean,1.6027
MAD,0.56414
Skewness,0.51044
Sum,476
Variance,0.38215
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
1,139,46.8%,
2,137,46.1%,
3,21,7.1%,

Value,Count,Frequency (%),Unnamed: 3
1,139,46.8%,
2,137,46.1%,
3,21,7.1%,

Value,Count,Frequency (%),Unnamed: 3
1,139,46.8%,
2,137,46.1%,
3,21,7.1%,

0,1
Distinct count,3
Unique (%),1.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.7306
Minimum,3
Maximum,7
Zeros (%),0.0%

0,1
Minimum,3
5-th percentile,3
Q1,3
Median,3
Q3,7
95-th percentile,7
Maximum,7
Range,4
Interquartile range,4

0,1
Standard deviation,1.9386
Coef of variation,0.4098
Kurtosis,-1.9157
Mean,4.7306
MAD,1.9113
Skewness,0.24777
Sum,1405
Variance,3.7583
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
3,164,55.2%,
7,115,38.7%,
6,18,6.1%,

Value,Count,Frequency (%),Unnamed: 3
3,164,55.2%,
6,18,6.1%,
7,115,38.7%,

Value,Count,Frequency (%),Unnamed: 3
3,164,55.2%,
6,18,6.1%,
7,115,38.7%,

0,1
Distinct count,91
Unique (%),30.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,149.6
Minimum,71
Maximum,202
Zeros (%),0.0%

0,1
Minimum,71
5-th percentile,108
Q1,133
Median,153
Q3,166
95-th percentile,182
Maximum,202
Range,131
Interquartile range,33

0,1
Standard deviation,22.942
Coef of variation,0.15335
Kurtosis,-0.051849
Mean,149.6
MAD,18.5
Skewness,-0.53654
Sum,44431
Variance,526.32
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
162,11,3.7%,
160,9,3.0%,
163,9,3.0%,
152,8,2.7%,
172,7,2.4%,
125,7,2.4%,
132,7,2.4%,
150,7,2.4%,
143,6,2.0%,
173,6,2.0%,

Value,Count,Frequency (%),Unnamed: 3
71,1,0.3%,
88,1,0.3%,
90,1,0.3%,
95,1,0.3%,
96,2,0.7%,

Value,Count,Frequency (%),Unnamed: 3
190,1,0.3%,
192,1,0.3%,
194,1,0.3%,
195,1,0.3%,
202,1,0.3%,

0,1
Distinct count,50
Unique (%),16.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,131.69
Minimum,94
Maximum,200
Zeros (%),0.0%

0,1
Minimum,94.0
5-th percentile,108.0
Q1,120.0
Median,130.0
Q3,140.0
95-th percentile,160.8
Maximum,200.0
Range,106.0
Interquartile range,20.0

0,1
Standard deviation,17.763
Coef of variation,0.13488
Kurtosis,0.81498
Mean,131.69
MAD,13.781
Skewness,0.70007
Sum,39113
Variance,315.52
Memory size,2.4 KiB

Value,Count,Frequency (%),Unnamed: 3
120,37,12.5%,
130,36,12.1%,
140,32,10.8%,
110,19,6.4%,
150,17,5.7%,
160,11,3.7%,
138,10,3.4%,
125,10,3.4%,
128,10,3.4%,
112,9,3.0%,

Value,Count,Frequency (%),Unnamed: 3
94,2,0.7%,
100,4,1.3%,
101,1,0.3%,
102,2,0.7%,
104,1,0.3%,

Value,Count,Frequency (%),Unnamed: 3
174,1,0.3%,
178,2,0.7%,
180,3,1.0%,
192,1,0.3%,
200,1,0.3%,

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,num_binary
0,63,1,1,145,233,1,2,150,0,2.3,3,0,6,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,7,1
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


## Train/Test Split

Below we create our train/test data w/ a 70%/30% split.

In [6]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.3)
train.shape, test.shape

((207, 23), (90, 23))

In [7]:
TRAIN_X = train.loc[:,df.columns!="num_binary"]
TRAIN_Y = train["num_binary"]
TEST_X = test.loc[:,df.columns!="num_binary"]
TEST_Y = test["num_binary"]

In [8]:
type(TRAIN_X), type(TRAIN_Y), type(TEST_X), type(TEST_Y)

(pandas.core.frame.DataFrame,
 pandas.core.series.Series,
 pandas.core.frame.DataFrame,
 pandas.core.series.Series)

In [9]:
len(TRAIN_X), len(TRAIN_Y), len(TEST_X), len(TEST_Y)

(207, 207, 90, 90)

In [10]:
len(TRAIN_X.columns), len(TEST_X.columns)  # note: Series do not have .columns (only 1 col)

(22, 22)

In [None]:

# df.plot.bar(x='Start (year)', y='Sleep quality (dec)')

## Feature Selection

https://machinelearningmastery.com/feature-selection-machine-learning-python/

univeriate feature selection w/ SelectKBest:

In [11]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [12]:
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(TRAIN_X, TRAIN_Y)

In [13]:
# summarize scores
np.set_printoptions(precision=3)
print(dict(zip(list(TRAIN_X.columns), fit.scores_)))  # maybe???
features = fit.transform(TRAIN_X)

{'age': 19.754853147337002, 'sex': 5.263277263277266, 'trestbps': 12.951950628942736, 'chol': 21.81618649853428, 'fbs': 0.15836148648648682, 'thalach': 128.8940593114351, 'exang': 18.40674844926894, 'oldpeak': 42.749856072940446, 'ca': 64.92192778716218, 'cp_1': 4.1964527027027, 'cp_2': 6.008711389961389, 'cp_3': 10.290118243243244, 'cp_4': 21.35406074391223, 'restecg_0': 3.9914004914004897, 'restecg_1': 0.49662162162162166, 'restecg_2': 3.31534749034749, 'slope_1': 15.52126999448428, 'slope_2': 13.625924608819345, 'slope_3': 0.6525096525096524, 'thal_3': 22.550854615208422, 'thal_6': 3.533059845559846, 'thal_7': 26.46496621621622}


In [14]:
# summarize selected features
print(features[0:5,:])

[[150.    1.9   2.    1. ]
 [123.    0.6   0.    0. ]
 [145.    4.2   0.    1. ]
 [178.    0.8   0.    0. ]
 [144.    1.2   1.    0. ]]


In [15]:
# yeah, so they didn't actually explain any of this in the example...

## Logistic Regression

In [16]:
# lbfgs
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='lbfgs', max_iter=10000)  # tune max_iter hyperparameter to achieve convergence
model = model.fit(TRAIN_X, TRAIN_Y)

Basic prediction, confusion matrix, and model score.

In [17]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(TEST_X)  # calc predictions
print(confusion_matrix(pred, TEST_Y))  # confusion matrix
score = model.score(TEST_X, TEST_Y)
print(f'{score*100:.5}%')


[[45  8]
 [ 4 33]]
86.667%


In [18]:
# sag
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='sag', max_iter=10000)  # tune max_iter hyperparameter to achieve convergence
model = model.fit(TRAIN_X, TRAIN_Y)

Basic prediction, confusion matrix, and model score.

In [19]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(TEST_X)  # calc predictions
print(confusion_matrix(pred, TEST_Y))  # confusion matrix
score = model.score(TEST_X,TEST_Y)
print(f'{score*100:.5}%')

[[45  8]
 [ 4 33]]
86.667%


# Random Forest

## feature selection

In [20]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score as acc
from mlxtend.feature_selection import SequentialFeatureSelector as sfs

In [21]:
import datetime

print(f"START: {datetime.datetime.now()}")

results = {}

clf = RandomForestClassifier(n_estimators=100, n_jobs=-1)

for i in range(5,15):
    # Build step forward feature selection
    sfs1 = sfs(clf,
               k_features=i,
               forward=True,
               floating=False,
               verbose=2,
               scoring='accuracy',
               cv=5) # 5 fold cross-val

    # Perform SFFS
    sfs1 = sfs1.fit(TRAIN_X, TRAIN_Y)
    
    print(list(sfs1.k_feature_idx_))
    
    # save results in dict
#    results[str(i)] = sfs1.k_feature_idx_
    results[str(i)] = sfs1

print(f"END: {datetime.datetime.now()}")

print(results)

START: 2019-06-11 19:06:55.189931


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.8s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   24.9s finished

[2019-06-11 19:07:20] Features: 1/5 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   21.0s finished

[2019-06-11 19:07:41] Features: 2/5 -- score: 0.7867271695972773[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   20.1s finished

[2019-06-11 19:08:01] Features: 3/5 -- score: 0.820873511060692[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 o

[8, 12, 14, 20, 21]


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   27.5s finished

[2019-06-11 19:09:16] Features: 1/6 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   22.7s finished

[2019-06-11 19:09:38] Features: 2/6 -- score: 0.7867271695972773[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   22.9s finished

[2019-06-11 19:10:01] Features: 3/6 -- score: 0.8009075439591605[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 o

[1, 8, 9, 16, 17, 21]


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   25.2s finished

[2019-06-11 19:11:29] Features: 1/7 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   26.3s finished

[2019-06-11 19:11:55] Features: 2/7 -- score: 0.7916052183777651[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   27.2s finished

[2019-06-11 19:12:22] Features: 3/7 -- score: 0.8013613159387407[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.5s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 o

[1, 8, 9, 13, 15, 18, 21]


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   24.5s finished

[2019-06-11 19:14:11] Features: 1/8 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.6s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   28.6s finished

[2019-06-11 19:14:39] Features: 2/8 -- score: 0.7867271695972773[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   20.6s finished

[2019-06-11 19:15:00] Features: 3/8 -- score: 0.8159954622802041[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 o

[1, 8, 12, 14, 18, 19, 20, 21]


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   21.5s finished

[2019-06-11 19:16:49] Features: 1/9 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   20.6s finished

[2019-06-11 19:17:09] Features: 2/9 -- score: 0.7867271695972773[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   19.8s finished

[2019-06-11 19:17:29] Features: 3/9 -- score: 0.820873511060692[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19 ou

[1, 4, 7, 8, 12, 14, 18, 19, 21]


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   22.3s finished

[2019-06-11 19:19:30] Features: 1/10 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   23.0s finished

[2019-06-11 19:19:53] Features: 2/10 -- score: 0.7867271695972773[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   21.1s finished

[2019-06-11 19:20:14] Features: 3/10 -- score: 0.8111174134997163[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  1

[1, 4, 7, 8, 12, 14, 18, 19, 20, 21]


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   23.4s finished

[2019-06-11 19:22:38] Features: 1/11 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   24.7s finished

[2019-06-11 19:23:03] Features: 2/11 -- score: 0.7867271695972773[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   22.6s finished

[2019-06-11 19:23:25] Features: 3/11 -- score: 0.8159954622802041[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  1

[1, 4, 8, 9, 11, 12, 14, 16, 19, 20, 21]


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   23.1s finished

[2019-06-11 19:26:03] Features: 1/12 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   22.9s finished

[2019-06-11 19:26:26] Features: 2/12 -- score: 0.7869540555870674[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   25.3s finished

[2019-06-11 19:26:51] Features: 3/12 -- score: 0.8013613159387407[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.4s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  1

[1, 4, 6, 8, 9, 11, 13, 14, 15, 16, 17, 21]


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   21.6s finished

[2019-06-11 19:29:31] Features: 1/13 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   20.6s finished

[2019-06-11 19:29:52] Features: 2/13 -- score: 0.7867271695972773[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   19.6s finished

[2019-06-11 19:30:12] Features: 3/13 -- score: 0.820873511060692[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  19

[1, 2, 3, 4, 8, 9, 12, 14, 16, 18, 19, 20, 21]


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  22 out of  22 | elapsed:   22.6s finished

[2019-06-11 19:33:08] Features: 1/14 -- score: 0.7676687464549065[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  21 out of  21 | elapsed:   21.5s finished

[2019-06-11 19:33:30] Features: 2/14 -- score: 0.7867271695972773[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  20 out of  20 | elapsed:   21.3s finished

[2019-06-11 19:33:51] Features: 3/14 -- score: 0.7964832671582529[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  1

[1, 2, 5, 8, 9, 11, 13, 14, 15, 16, 17, 18, 20, 21]
END: 2019-06-11 19:36:40.114296
{'5': SequentialFeatureSelector(clone_estimator=True, cv=5,
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
             floating=False, forward=True, k_features=5, n_jobs=1,
             pre_dispatch='2*n_jobs', scoring='accuracy', verbose=2), '6': SequentialFeatureSelector(clone_estimator=True, cv=5,
             estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_i

[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:   11.1s finished

[2019-06-11 19:36:40] Features: 14/14 -- score: 0.8306296086216676

In [22]:
help(results['5'])
# k_feature_idx_, k_feature_names_,  k_score_, subsets_

Help on SequentialFeatureSelector in module mlxtend.feature_selection.sequential_feature_selector object:

class SequentialFeatureSelector(mlxtend.utils.base_compostion._BaseXComposition, sklearn.base.MetaEstimatorMixin)
 |  SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2*n_jobs', clone_estimator=True)
 |  
 |  Sequential Feature Selection for Classification and Regression.
 |  
 |  Parameters
 |  ----------
 |  estimator : scikit-learn classifier or regressor
 |  k_features : int or tuple or str (default: 1)
 |      Number of features to select,
 |      where k_features < the full feature set.
 |      New in 0.4.2: A tuple containing a min and max value can be provided,
 |          and the SFS will consider return any feature combination between
 |          min and max that scored highest in cross-validtion. For example,
 |          the tuple (1, 4) will return any combination from
 |          1 

In [23]:
r = results['8']
r.k_feature_idx_, r.k_feature_names_, r.k_score_, r.subsets_

((1, 8, 12, 14, 18, 19, 20, 21),
 ('sex', 'ca', 'cp_4', 'restecg_1', 'slope_3', 'thal_3', 'thal_6', 'thal_7'),
 0.8304027226318775,
 {1: {'feature_idx': (8,),
   'cv_scores': array([0.814, 0.732, 0.805, 0.829, 0.659]),
   'avg_score': 0.7676687464549065,
   'feature_names': ('ca',)},
  2: {'feature_idx': (8, 21),
   'cv_scores': array([0.86 , 0.78 , 0.805, 0.756, 0.732]),
   'avg_score': 0.7867271695972773,
   'feature_names': ('ca', 'thal_7')},
  3: {'feature_idx': (8, 12, 21),
   'cv_scores': array([0.86 , 0.829, 0.78 , 0.756, 0.854]),
   'avg_score': 0.8159954622802041,
   'feature_names': ('ca', 'cp_4', 'thal_7')},
  4: {'feature_idx': (8, 12, 18, 21),
   'cv_scores': array([0.86 , 0.829, 0.78 , 0.756, 0.878]),
   'avg_score': 0.820873511060692,
   'feature_names': ('ca', 'cp_4', 'slope_3', 'thal_7')},
  5: {'feature_idx': (1, 8, 12, 18, 21),
   'cv_scores': array([0.93 , 0.829, 0.756, 0.78 , 0.878]),
   'avg_score': 0.834826999432785,
   'feature_names': ('sex', 'ca', 'cp_4', 'slo

In [24]:
out = []
for i in range(5,15):
    r = results[str(i)]
    out.append((r.k_score_, r.k_feature_names_))

sorted(out, key=lambda x: x[0], reverse=True)

[(0.8445830969937607, ('sex', 'ca', 'cp_1', 'slope_1', 'slope_2', 'thal_7')),
 (0.8306296086216676,
  ('sex',
   'trestbps',
   'thalach',
   'ca',
   'cp_1',
   'cp_3',
   'restecg_0',
   'restecg_1',
   'restecg_2',
   'slope_1',
   'slope_2',
   'slope_3',
   'thal_6',
   'thal_7')),
 (0.8304027226318775,
  ('sex', 'ca', 'cp_4', 'restecg_1', 'slope_3', 'thal_3', 'thal_6', 'thal_7')),
 (0.8301758366420874,
  ('sex',
   'fbs',
   'exang',
   'ca',
   'cp_1',
   'cp_3',
   'restecg_0',
   'restecg_1',
   'restecg_2',
   'slope_1',
   'slope_2',
   'thal_7')),
 (0.8255246738513897, ('ca', 'cp_4', 'restecg_1', 'thal_6', 'thal_7')),
 (0.8252977878615996,
  ('sex', 'ca', 'cp_1', 'restecg_0', 'restecg_2', 'slope_3', 'thal_7')),
 (0.8206466250709019,
  ('sex',
   'fbs',
   'ca',
   'cp_1',
   'cp_3',
   'cp_4',
   'restecg_1',
   'slope_1',
   'thal_3',
   'thal_6',
   'thal_7')),
 (0.8171298922291548,
  ('sex',
   'fbs',
   'oldpeak',
   'ca',
   'cp_4',
   'restecg_1',
   'slope_3',
   'th

## model

In [25]:
model = RandomForestClassifier(n_estimators = 5000, random_state = 51)
model = model.fit(TRAIN_X, TRAIN_Y)

Basic prediction, confusion matrix, and model score.

In [26]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(TEST_X)  # calc predictions
print(confusion_matrix(pred, TEST_Y))  # confusion matrix
score = model.score(TEST_X,TEST_Y)
print(f'{score*100:.5}%')

[[42  7]
 [ 7 34]]
84.444%


model w/ feature selection

In [27]:
TRAIN_X2 = TRAIN_X[['age',
   'trestbps',
   'fbs',
   'exang',
   'oldpeak',
   'ca',
   'cp_1',
   'cp_2',
   'cp_4',
   'restecg_0',
   'restecg_1',
   'restecg_2',
   'thal_3']]

TEST_X2 = TEST_X[['age',
   'trestbps',
   'fbs',
   'exang',
   'oldpeak',
   'ca',
   'cp_1',
   'cp_2',
   'cp_4',
   'restecg_0',
   'restecg_1',
   'restecg_2',
   'thal_3']]

In [28]:
model = RandomForestClassifier(n_estimators = 5000, random_state = 51)
model = model.fit(TRAIN_X2, TRAIN_Y)

In [29]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(TEST_X2)  # calc predictions
print(confusion_matrix(pred, TEST_Y))  # confusion matrix
score = model.score(TEST_X2,TEST_Y)
print(f'{score*100:.5}%')

[[43  7]
 [ 6 34]]
85.556%


gives the same... ???

## SVM

In [30]:
from sklearn import svm
model = svm.SVC(gamma='scale', kernel='linear')  # need linear
model = model.fit(train.loc[:,df.columns!="num_binary"], train["num_binary"])

Basic prediction, confusion matrix, and model score.

In [32]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(test.loc[:,df.columns!="num_binary"])  # calc predictions
print(confusion_matrix(pred, test["num_binary"]))  # confusion matrix
score = model.score(test.loc[:,df.columns!="num_binary"],test["num_binary"])
print(f'{score*100:.5}%')

AttributeError: 'SelectKBest' object has no attribute 'loc'

## Decision Trees

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion = "gini", random_state = 101,
                               max_depth=10, min_samples_leaf=10)
model = model.fit(train.loc[:,df.columns!="num_binary"], train["num_binary"])

Basic prediction, confusion matrix, and model score.

In [None]:
# prediction, confusion matrix, model score ("model score" here is calc of confusion matrix)
from sklearn.metrics import confusion_matrix
pred = model.predict(test.loc[:,df.columns!="num_binary"])  # calc predictions
print(confusion_matrix(pred, test["num_binary"]))  # confusion matrix
score = model.score(test.loc[:,df.columns!="num_binary"],test["num_binary"])
print(f'{score*100:.5}%')