# Predicting hart disease

## Introduction

Relevance. According to a report of the American Heart Association Statistics (2016), heart disease is the leading cause of death for both men and women and responsible for 1 in every 4 deaths, even modest improvements in prognostic models of heart event and complications could save literally hundreds of lives and help to significantly reduce the cost of health care services, medications, and lost productivity.


file:///C:/Users/User/Downloads/350-904-1-PB%20(1).pdf
http://inpressco.com/wp-content/uploads/2017/10/Paper271842-1853.pdf

## Methods 

Deep neural networks (DNN) represents a set of modern machine learning (ML) models that have gain widespread recognition because they were behind the first FDA (US food and drug administration) approved machine learning application in healthcare; to be approved it had to pass tests to show it can produce results at least as accurately as humans are currently able to. Recently, such ML models were also used to detect with cardiologist-level accuracy 14 types of arrhythmias (sometime life-threatening heart beats) form ECG-electrocardiogram signals generated by wearable monitors. 

## Original contribution

Studies exploring the potential of this technology for the prognosis of cardiovascular events/complications from risk factors have been limited; events/complications are, for example, coronary artery disease, stroke and congestive heart failure, and risk factors are those established by the American College of Cardiology/American Heart Association (ACC/AHA) such as age, high blood pressure, high LDL cholesterol, and smoking and others, such as, systolic blood pressure variability, kidney disease, and ethnicity. 

Most of previous studies have either used logistic regression or classical machine learning algorithms such as random forest, gradient boosting and neural networks (non-deep); in addition, comparison studies of the cited algorithms with deep learning models in the specific prognosis context under consideration are not readily available. 

## Research objectives

Establish the relative performance of deep learning models, such as deep belief networks and convolutional neural networks, and ensembles with respect to classical machine learning algorithms (including logistic regression) using cases studies built from well-known heart disease data sets such as the Cleveland set available from the UCI repository. Research questions of interest are, for example, for what would be the threshold of sample size in heart disease studies where the more complex but potentially more effective deep learning models would be recommended?, would ensembles of machine learning models be able to provide more robust predictions as it has been the case in other knowledge domains?, does the ACC/AHA list of eight risk factors should be updated with other genetic or lifestyle factors?. The deep learning models will be implemented in Tensorflow (originally from Google, now open source) and healthcare.ai, an open source that facilitate the development of machine learning in healthcare, with the prevision that can handle so called big data by using the Hadoop/Spark platform.   

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

from imblearn.over_sampling import SMOTE #for SMOTE -> install package using: conda install -c conda-forge imbalanced-learn 


In [1]:
from scipy import stats, integrate
import matplotlib.pyplot as plt
import ggplot
import scipy
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn.svm import LinearSVC
from sklearn.decomposition import PCA, NMF

from sklearn.datasets import load_digits
from sklearn.feature_selection import SelectKBest, chi2

import pylab as pl
from itertools import cycle
from sklearn import cross_validation
from sklearn.svm import SVC

You can access Timestamp as pandas.Timestamp
  pd.tslib.Timestamp,
  from pandas.lib import Timestamp
  from pandas.core import datetools


In [2]:
features_list = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','class']
dataset1=pd.read_csv("PCA Dataset original.csv")
dataset2=pd.read_csv("PCA Dataset original.csv")

dataset1 = dataset1.convert_objects(convert_numeric=True)
dataset1.astype('float')



# SVM requires that each data instance is represented as a vector of real numbers

NameError: name 'pd' is not defined

In [None]:
#### count missing value in terms of colunms #######
#dataset.shape[0] - dataset.count()
dataset1.isnull()
dataset1.isnull().any()

In [None]:
dataset1.duplicated().any()

There are some values and no duplicated data in this dataset.

In [None]:
def checkforoutlier(df):
    outliersnumbers = 0
    for column in df:
        for number in df[column]:
            if number < np.percentile(
                df[column], 25)-(np.percentile(
                df[column], 75)-np.percentile(
                df[column], 25)) or number > np.percentile(
                df[column], 75)+(np.percentile(
                df[column], 75)-np.percentile(
                df[column], 25)):
                    print("outlier: ", number)
                    outliersnumbers += 1
    return outliersnumbers, 'outliers. That is', round(float(outliersnumbers)/float(len(df[column]))*100, 0), 'percent of the total list'

print(checkforoutlier(dataset1))

In [None]:
dataset1 = dataset1.fillna(value=0)

# 1. Preliminary description of the data.
Box plots and histograms were used for continuous and categorical variables.
Basic statistics are also available.

# Univariate analysis

# Continuous variables 
basic statistics + Box plots + histograms 

In [None]:
## basic statistic descriptions
continuas=["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]
dataset1[continuas].describe()

In [None]:
## age histogram
ag= np.array(dataset1['age']) 
plt.hist(ag, bins = 6) 
plt.title("age") 
plt.show()

In [None]:
## trestbps histogram
tp= np.array(dataset1['trestbps']) 
plt.hist(tp, bins=6) 
plt.title("Trestbps:blood pressure") 
plt.show()

In [None]:
## chol histogram
ch= np.array(dataset1['chol'])
sns.distplot(ch, rug=True, hist=True)


In [None]:
## thalach histogram and distribution 
tl= np.array(dataset1['thalach'])
sns.distplot(tl, rug=True, hist=True)

In [None]:
## oldpeak histogram and distribution 
op= np.array(dataset1['oldpeak'])
sns.distplot(op, rug=True, hist=True)

In [None]:
## thalach histogram
ca= np.array(dataset1['ca']) 
plt.hist(tp, bins=4) 
plt.title("histogram of ca") 
plt.show()

### Categorical variables 
##### Histograms + basic statistics

In [None]:
#Sex: sex (1 = male; 0 = female) 
tempo5 = dataset1['sex']
tempo5.value_counts().plot(kind="bar")

In [None]:
#Fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
tempo6 = dataset1['fbs']
tempo6.value_counts().plot(kind="bar")

In [None]:
#Slope: the slope of the peak exercise ST segment  
#Value 1: upsloping 
#Value 2: flat 
#Value 3: downsloping
tempo7 = dataset1['slop']
tempo7.value_counts().plot(kind="bar")

In [None]:
#Cp: chest pain type
#Value 1: typical angina 
#Value 2: atypical angina 
#Value 3: non-anginal pain 
#Value 4: asymptomatic 
tempo8 = dataset1['cp']
tempo8.value_counts().plot(kind="bar")

In [None]:
#Exang: exercise induced angina (1 = yes; 0 = no) 
tempo9 = dataset1['exang']
tempo9.value_counts().plot(kind="bar")

In [None]:
#Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect 
tempo10 = dataset1['thal']
tempo10.value_counts().plot(kind="bar")

In [None]:
#Restecg: resting electrocardiographic results 
#Value 0: normal 
#Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 
#Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria 
tempo11 = dataset1['restecg']
tempo11.value_counts().plot(kind="bar")

In [None]:
#Class: diagnosis of heart disease (angiographic disease status) 
#Value 0: < 50% diameter narrowing (Healthy)
#Value 1: > 50% diameter narrowing (Sick)
tempo12 = dataset1['pred_attribute']
tempo12.value_counts().plot(kind="bar")

# Multivariate analysis

In [None]:
from ggplot import *
p = ggplot(dataset1,aes(x='pred_attribute'))
p + geom_bar()+facet_wrap('sex')  #### relationship between pred_attribute and sex

In [None]:
##### relationship between age and trestbps
p = ggplot(dataset1,aes(x='age'))
p+geom_histogram()+facet_wrap('pred_attribute')


In [None]:
p=ggplot(dataset1,aes(x='age',y='trestbps'))
p+geom_line()+facet_wrap('pred_attribute')

In [None]:
p=ggplot(dataset1,aes(x='age',y='thalach'))
p+geom_line()+facet_wrap('pred_attribute')

In [None]:
p = ggplot(dataset1,aes(x='pred_attribute'))
p + geom_bar()+facet_wrap('cp') 

In [None]:
# p = ggplot(dataset1,aes(x='pred_attribute'))
# p + geom_bar()+facet_wrap('fbs') 
# ggplot(data)+geom_histogram(aes(x=price, fill=cut), position="dodge")

# data processing
 outlier and balance

In [None]:
# ## Boxplots of all continuous variable
# continuas=["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]
# dataset[continuas].boxplot(return_type='axes', figsize=(12,8))
# plt.show()

In [None]:
# ### show the patients whose trestbps above 180
# print(dataset[dataset['trestbps']>=180])


In [None]:
# ### show the patients whose chol above 370
# print(dataset[dataset['chol']>400])

In [None]:
# ### show the patients whose thalach below 180
# print(dataset[dataset['thalach']<90])

In [None]:
# ### show the patients whose thalach below 180
# print(dataset[dataset['oldpeak']>5])


In [None]:
# #### delete outliers by mean
# dataset1=dataset1.drop([83,126,188,201,231,48,121,152,181,175,245,91,123])


In [None]:
#import matplotlib.pyplot as plt
#import numpy as np
#from sklearn.cluster import KMeans
#data=dataset1
#estimator = KMeans(n_clusters=3)
#estimator.fit(data)#聚类
#label_pred = estimator.labels_ #获取聚类标签
#centroids = estimator.cluster_centers_ 
#inertia = estimator.inertia_ 
#mark = ['or', 'ob', 'og', 'ok', '^r', '+r', 'sr', 'dr'] 
#color = 0
#j = 0 
#for i in label_pred:
    #plt.plot([data[j:j+1,0]], [data[j:j+1,1]], mark[i], markersize = 5)
    #j +=1
#plt.show()''''''

# correlation between variables

In [None]:
dataset2.corr()

### 2. Define training and test samples. 

The Cleveland data set available from the UCI repository has 303 samples; the training and test data sets were randomly selected with 30% of the original data set corresponding to the test data set.  The relative proportions of the classes of interest (disease/no disease) in both sets were checked to be similar.

In [None]:
### Extract features and labels from dataset for local testing:

dataset1.dropna(inplace=True, axis=0, how="any")
X=dataset1.loc[:, "age":"thal" ]
Y=dataset1["pred_attribute"]

In [None]:
# evaluate the model by splitting into train and test sets  #Edit by ryan, we aim to do 3 traditional sets in the end, this first split is 80/20
features_train, features_test, labels_train, labels_test = train_test_split(X, Y, test_size=0.2, random_state=0)


In [None]:
import collections

list1 = []
for i in labels_train:
    list1.append(i)
counter=collections.Counter(list1)
print(counter)

list2 = []
for i in labels_test:
    list2.append(i)
counter=collections.Counter(list2)
print(counter)

In [None]:
# Check
print(len(features_train)/(len(features_train)+ len(features_test)))

We have an relatively small dataset. Therefore, we should do our feature selection based on a cross-
validated set. We will check this assumption by comparing the scores on a cross-validated set vs the simple split.

In [None]:
features_train_cross, features_test_cross, labels_train_cross, labels_test_cross = train_test_split(X, Y, test_size=0.15, random_state=0)



### SMOTE for SVM - Balancing only on the training set, not the validation set  [This is for the traditional training -not the cross validated one]

In [None]:
#further divide the 'traditional' non-cross set into training 80/20  for pure training and cross validation  
features_train_notoversampled, features_validate, labels_train_notoversampled, labels_validate = train_test_split(features_train, labels_train, test_size = .15, random_state=0)

sm = SMOTE(random_state=0, ratio = 1.0, kind= 'svm' )
#x_train_res, y_train_res = sm.fit_sample(x_train, y_train)
features_train_oversampled, labels_train_oversampled = sm.fit_sample(features_train_notoversampled, labels_train_notoversampled)

#re-enter into original variables
##features_train = features_train_oversampled
##labels_train = labels_train_oversampled

#Below 2 lines if we want to want to force the array back into dataframe    
##features_train = pd.DataFrame(features_train_oversampled,columns=["age","sex","cp","trestbps","chol","fbs","restecg","thalach","exang","oldpeak","slop","ca","thal"])
##labels_train = pd.DataFrame(labels_train_oversampled,columns=["pred_attribute"])

### 3. Make models

#### 3.1 make pipeline

##### scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing.data import QuantileTransformer

SVM needs a scalar: https://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

MinMaxScaler rescales the data set such that all feature values are in the range [0, 1]. However, this scaling compress all inliers in the narrow ranges. Minmaxscalar also suffers from the presence of large outliers. 

In [None]:
scaler = preprocessing.MinMaxScaler()

Unlike the previous scaler, the centering and scaling statistics of Robust scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers. Consequently, the resulting range of the transformed feature values is larger than for the previous scalers and, more importantly, are approximately similar: for both features most of the transformed values lie in a [-2, 3] range as seen in the zoomed-in figure. Note that the outliers themselves are still present in the transformed data. If a separate outlier clipping is desirable, a non-linear transformation is required (see below).

In [None]:
Robust_scaler = preprocessing.RobustScaler(quantile_range=(25, 75))

QuantileTransformer applies a non-linear transformation such that the probability density function of each feature will be mapped to a uniform distribution. In this case, all the data will be mapped in the range [0, 1], even the outliers which cannot be distinguished anymore from the inliers. QuantileTransformer has an additional output_distribution parameter allowing to match a Gaussian distribution instead of a uniform distribution


In [None]:
Quantile_scalar = preprocessing.QuantileTransformer(output_distribution='normal')

##### Selecting features

SelectKBest is way to select the most powerfull features. It is also possible to do this manually, in my experience this may improve the results drastically.

In [None]:
from sklearn.feature_selection import SelectKBest

skb = SelectKBest(k = 2)
skb3 = SelectKBest(k = 3)
skb4 = SelectKBest(k = 4)
skb7 = SelectKBest(k = 7)

#### 3.2 selecting a Kernel & it's parameters

##### start with RGB, because we don't have a lot of features.

In [None]:
# Nice visualisations:
    
# http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html 

import itertools
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# import data
X = np.array(features_test)
y = np.array(labels_test)
class_names = np.array(['No disease', 'Disease 1', 'Disease 2', 'Disease 3', 'Disease 4'],
      dtype='<U10')

# # Split the data into a training set and a test set
# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# # Run classifier, using a model that is too regularized (C too low) to see
# # the impact on the results
# classifier = svm.SVC(kernel='linear', C=0.01)
# y_pred = classifier.fit(X_train, y_train).predict(X_test)


def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
#     plt.ylabel('True label')
#     plt.xlabel('Predicted label')

In [None]:
from sklearn.metrics import accuracy_score
from sklearn import grid_search
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import KFold
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn import svm

def checkmetrics(pred, labels_test, name):
    print('The accuracy is of a', name, 'is: ', accuracy_score(pred, labels_test))
    # print 'if everyone had 0 score: ', float(float(len(pred))-float(numberpoi))/float(len(pred))
#     matrix = confusion_matrix(labels_test, pred)
#     print('There are', matrix[0][0], 'healthy people correctly identified vs', matrix[2][2] +matrix[3][3] +matrix[4][4] +matrix[1][1], 'sick ones. See:\n', matrix)
    print(classification_report(pred, labels_test))
    
   

#     print('precision score:', precision_score( pred, labels_test))
#     if precision_score(pred, labels_test) < recall_score(pred, labels_test):
#         print('precision < recall, so higher chance on POIs get identified, but also more false positives')
#     if precision_score(pred, labels_test) > recall_score(pred, labels_test):
#         print('precision > recall, so lower chance on POIs get identified, but also less false positives')
#     print('f1 score: ', f1_score(pred, labels_test), '\n\n')

In [None]:
### DONE

clf = SVC(kernel="rbf")
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE for training- Validation - support vector machine, Radial Basis Function')

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_validate, pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Validation Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Validation Normalized confusion matrix')
plt.show()

In [None]:
#-----------------------------------------

pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'Without SMOTE for training - Test - support vector machine, Radial Basis Function')

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test, pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')
plt.show()


#-----------------------------------------

In [None]:
### DONE

clf = SVC(kernel="rbf")
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'With SMOTE for training- Validation - support vector machine, Radial Basis Function')

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_validate, pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Validation Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Validation Normalized confusion matrix')
plt.show()

In [None]:
#-----------------------------------------

pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'With SMOTE for training - Test - support vector machine, Radial Basis Function')

# Compute confusion matrix
cnf_matrix = confusion_matrix(labels_test, pred)
np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')
# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')
plt.show()


#-----------------------------------------

The other ones seem to be just as bad:

In [None]:
clf =  Pipeline(steps=[('scaling', scaler), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE - Validate support vector machine, Radial Basis Function scaled')

clf =  Pipeline(steps=[('scaling',Robust_scaler), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE - Validate support vector machine, Radial Basis Function Robust scaled')

clf =  Pipeline(steps=[('scaling', Quantile_scalar), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE - Validate support vector machine, Radial Basis Function Normal scaled')

clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE - Validate support vector machine, Radial Basis Function scaled & selected')

In [None]:
clf =  Pipeline(steps=[('scaling', scaler), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'Without SMOTE - Test support vector machine, Radial Basis Function scaled')

clf =  Pipeline(steps=[('scaling',Robust_scaler), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'Without SMOTE - Test support vector machine, Radial Basis Function Robust scaled')

clf =  Pipeline(steps=[('scaling', Quantile_scalar), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'Without SMOTE - Test support vector machine, Radial Basis Function Normal scaled')

clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'Without SMOTE - Test support vector machine, Radial Basis Function scaled & selected')

In [None]:
clf =  Pipeline(steps=[('scaling', scaler), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate support vector machine, Radial Basis Function scaled')

clf =  Pipeline(steps=[('scaling',Robust_scaler), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate support vector machine, Radial Basis Function Robust scaled')

clf =  Pipeline(steps=[('scaling', Quantile_scalar), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate support vector machine, Radial Basis Function Normal scaled')

clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate support vector machine, Radial Basis Function scaled & selected')

In [None]:
clf =  Pipeline(steps=[('scaling', scaler), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test support vector machine, Radial Basis Function scaled')

clf =  Pipeline(steps=[('scaling',Robust_scaler), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test support vector machine, Radial Basis Function Robust scaled')

clf =  Pipeline(steps=[('scaling', Quantile_scalar), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test support vector machine, Radial Basis Function Normal scaled')

clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ('support vector machine', SVC(kernel="rbf"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test support vector machine, Radial Basis Function scaled & selected')

In [None]:
# trying poly
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ('support vector machine', SVC(kernel="poly"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE - Validate -support vector machine, Poly, scaled & clustered')

In [None]:
# trying poly
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ('support vector machine', SVC(kernel="poly"))])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'Without SMOTE - Test -support vector machine, Poly, scaled & clustered')

In [None]:
# trying poly
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ('support vector machine', SVC(kernel="poly"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate -support vector machine, Poly, scaled & clustered')

In [None]:
# trying poly
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ('support vector machine', SVC(kernel="poly"))])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test -support vector machine, Poly, scaled & clustered')

There seems to lay far too much weight on the first category due to their high number. 

##### Try other peoples classifiers:

In [None]:
#used some ideas from http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

#linear kernel
def svm_linear(features_train, features_test, labels_train, labels_test):

    clf = svm.SVC(
        C=1.0, 
        kernel='linear', 
        probability=False, 
        shrinking=True, 
        tol=1e-3, 
        verbose=False, 
        max_iter=-1, 
        decision_function_shape=None,
        random_state=None)

    clf.fit(features_train, labels_train)
    pred = clf.predict(features_test)
    print("Kernel: Linear")
    print("Performance: "  + str(clf.score(features_test, labels_test)))
    print("")
    return pred

#polynomial kernel from degrees 2 to 5
def svm_poly(features_train, features_test, labels_train, labels_test):

    for d in [2, 3, 4, 5]:

        clf = svm.SVC(
            C=1.0,
            kernel='poly', 
            degree=d,
            gamma='auto',
            coef0=0.0,
            probability=False,
            shrinking=True,
            tol=1e-3,
            verbose=False,
            max_iter=400000,
            decision_function_shape=None,
            random_state=None)
        clf.fit(features_train, labels_train)
        pred = clf.predict(features_test)
        print("Kernel: Polynomial")
        print("Degree: " + str(d))
        print("Performance: "  + str(clf.score(features_test, labels_test)))
        print("")
    return pred

#radial basis function kernel
def svm_rbf(features_train, features_test, labels_train, labels_test):

    clf = svm.SVC(
        C=1.0,
        kernel='rbf',
        gamma='auto',
        probability=False,
        shrinking=True,
        tol=1e-3,
        verbose=False,
        max_iter=-1,
        decision_function_shape=None,
        random_state=None)

    clf.fit(features_train, labels_train)
    pred = clf.predict(features_test)
    print("Kernel: Radial Basis Function")
    print("Performance: "  + str(clf.score(features_test, labels_test)))
    print("")
    return pred

#sigmoid function kernel
def svm_sigmoid(features_train, features_test, labels_train, labels_test):

    clf = svm.SVC(
        C=1.0,
        kernel='sigmoid',
        gamma='auto',
        coef0=0.0,
        probability=False,
        shrinking=True,
        tol=1e-3,
        verbose=False,
        max_iter=-1,
        decision_function_shape=None,
        random_state=None)
    clf.fit(features_train, labels_train)
    pred = clf.predict(features_test)
    print("Kernel: Sigmoid")
    print("Performance: "  + str(clf.score(features_test, labels_test)))
    print("")
    return pred

In [None]:
#Run SVM with a linear kernel
pred = svm_linear(features_train_notoversampled, features_validate, labels_train_notoversampled, labels_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE - Validate - linearSVM, from function')

#Run SVM with a polynomial kernel
pred = svm_poly(features_train_notoversampled, features_validate, labels_train_notoversampled, labels_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE - Validate - polySVM, from function')

#Run SVM with a radial basis function kernel
pred = svm_rbf(features_train_notoversampled, features_validate, labels_train_notoversampled, labels_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE - Validate - RBFSVM, from function')

#Run SVM with a sigmoid kernel
pred = svm_sigmoid(features_train_notoversampled, features_validate, labels_train_notoversampled, labels_validate)
checkmetrics(pred, labels_validate, 'Without SMOTE - Validate - SIGSVM, from function')

In [None]:
#Run SVM with a linear kernel
pred = svm_linear(features_train_notoversampled, features_test, labels_train_notoversampled, labels_test)
checkmetrics(pred, labels_test, 'Without SMOTE - Test - linearSVM, from function')

#Run SVM with a polynomial kernel
pred = svm_poly(features_train_notoversampled, features_test, labels_train_notoversampled, labels_test)
checkmetrics(pred, labels_test, 'Without SMOTE - Test -polySVM, from function')

#Run SVM with a radial basis function kernel
pred = svm_rbf(features_train_notoversampled, features_test, labels_train_notoversampled, labels_test)
checkmetrics(pred, labels_test, 'Without SMOTE - Test -RBFSVM, from function')

#Run SVM with a sigmoid kernel
pred = svm_sigmoid(features_train_notoversampled, features_test, labels_train_notoversampled, labels_test)
checkmetrics(pred, labels_test, 'Without SMOTE - Test -SIGSVM, from function')

In [None]:
#Run SVM with a linear kernel
pred = svm_linear(features_train_oversampled, features_validate, labels_train_oversampled, labels_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate -linearSVM, from function')

#Run SVM with a polynomial kernel
pred = svm_poly(features_train_oversampled, features_validate, labels_train_oversampled, labels_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate -polySVM, from function')

#Run SVM with a radial basis function kernel
pred = svm_rbf(features_train_oversampled, features_validate, labels_train_oversampled, labels_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate -RBFSVM, from function')

#Run SVM with a sigmoid kernel
pred = svm_sigmoid(features_train_oversampled, features_validate, labels_train_oversampled, labels_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate -SIGSVM, from function')

In [None]:
#Run SVM with a linear kernel
pred = svm_linear(features_train_oversampled, features_test, labels_train_oversampled, labels_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test -linearSVM, from function')

#Run SVM with a polynomial kernel
pred = svm_poly(features_train_oversampled, features_test, labels_train_oversampled, labels_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test -polySVM, from function')

#Run SVM with a radial basis function kernel
pred = svm_rbf(features_train_oversampled, features_test, labels_train_oversampled, labels_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test -RBFSVM, from function')

#Run SVM with a sigmoid kernel
pred = svm_sigmoid(features_train_oversampled, features_test, labels_train_oversampled, labels_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test -SIGSVM, from function')

Although the Linear one is performing a little bit better here, there seems to be a very big focus on the non-disease. Still not good: try some automatic tuning: trying gridsearch

In [None]:
#### Can we use gridsearch for feature selection?

# still not good: try some automatic tuning:
# trying gridsearch
parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler), ("Grid", grid)])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'No SMOTE - Validate - support vector machine, with gridsearch')


parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ("Grid", grid)])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'No SMOTE - Validate - support vector machine, with gridsearch & only the best 2 features')

parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb7), ("Grid", grid)])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'No SMOTE - Validate - support vector machine, with gridsearch & only the best 7 features')

In [None]:
#### Can we use gridsearch for feature selection?  

# still not good: try some automatic tuning:
# trying gridsearch
parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler), ("Grid", grid)])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'No SMOTE - Test - support vector machine, with gridsearch')


parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ("Grid", grid)])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'No SMOTE - Test - support vector machine, with gridsearch & only the best 2 features')

parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb7), ("Grid", grid)])
clf.fit(features_train_notoversampled, labels_train_notoversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'No SMOTE - Test - support vector machine, with gridsearch & only the best 7 features')

In [None]:
#### Can we use gridsearch for feature selection?

# still not good: try some automatic tuning:
# trying gridsearch
parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler), ("Grid", grid)])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate -support vector machine, with gridsearch')


parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ("Grid", grid)])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate -support vector machine, with gridsearch & only the best 2 features')

parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb7), ("Grid", grid)])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_validate)
checkmetrics(pred, labels_validate, 'With SMOTE - Validate -support vector machine, with gridsearch & only the best 7 features')

In [None]:
#### Can we use gridsearch for feature selection?

# still not good: try some automatic tuning:
# trying gridsearch
parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler), ("Grid", grid)])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test - support vector machine, with gridsearch')


parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb), ("Grid", grid)])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test - support vector machine, with gridsearch & only the best 2 features')

parameters = {'kernel':('poly', 'rbf'), 'C':[1, 10]}
svr = svm.SVC()
grid = grid_search.GridSearchCV(svr, parameters)
clf =  Pipeline(steps=[('scaling',scaler),("SKB", skb7), ("Grid", grid)])
clf.fit(features_train_oversampled, labels_train_oversampled)
pred = clf.predict(features_test)
checkmetrics(pred, labels_test, 'With SMOTE - Test - support vector machine, with gridsearch & only the best 7 features')

Maybe we were just unlucky, had a difficult split? We have a relatively small dataset. Therefore, we should do our feature selection based on a cross-validated set. Let's check if the scoring is the same on a cross validated set.

In [None]:
# We have an relatively small dataset. Therefore, we should do our feature selection based on a cross-
# validated set. Let's check if the scoring is the same on a cross validated set.

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
# split into 5
scores = cross_val_score(clf, features_train_cross, labels_train_cross, cv=5)
                                            
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=0.2)
# split into 5
scores = cross_val_score(clf, features_train_cross, labels_train_cross, cv=5)
                                            
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=0.1)
# split into 5
scores = cross_val_score(clf, features_train_cross, labels_train_cross, cv=5)
                                            
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [None]:
# That accuracy seems a bit higher. We have to find a way to get the matrix from it and run it on 
# all the above methods (using gridsearch)

# if that does not work, we need to somehow balance the data.

from sklearn.model_selection import cross_val_predict

clf = svm.SVC(kernel='linear', C=1)
pred = cross_val_predict(clf, features_train_cross, labels_train_cross, cv=5)

len(pred)
# clf.fit(features_train, labels_train)
# pred = clf.predict(features_test)
# checkmetrics(pred, labels_test_cross, 'support vector machine, with gridsearch')

# pred = cross_val_predict(clf, features_train, .target, cv=10)
# # split into 5
# scores = cross_val_score(clf, features_train_cross, labels_train_cross, cv=5)
# clf.fit(features_train, labels_train)                                      
# print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

# pred = clf.predict(features_test_cross)
# checkmetrics(pred, labels_test_cross, 'using cross validation')

In [None]:
feature.names=names(heart.data)

for (f in feature.names) {
  if (class(heart.data[[f]])=="factor") {
    levels <- unique(c(heart.data[[f]]))
    heart.data[[f]] <- factor(heart.data[[f]],
                   labels=make.names(levels))
  }
}
set.seed(10)
inTrainRows <- createDataPartition(heart.data$num,p=0.7,list=FALSE)
trainData2 <- heart.data[inTrainRows,]
testData2 <-  heart.data[-inTrainRows,]


fitControl <- trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 10,
                           ## Estimate class probabilities
                           classProbs = TRUE,
                           ## Evaluate performance using
                           ## the following function
                           summaryFunction = twoClassSummary)

set.seed(10)
gbmModel <- train(num ~ ., data = trainData2,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE,
                 tuneGrid = gbmGrid,
                 ## Specify which metric to optimize
                 metric = "ROC")
gbmPrediction <- predict(gbmModel, testData2)
gbmPredictionprob <- predict(gbmModel, testData2, type='prob')[2]
gbmConfMat <- confusionMatrix(gbmPrediction, testData2[,"num"])
#ROC Curve
AUC$gbm <- roc(as.numeric(testData2$num),as.numeric(as.matrix((gbmPredictionprob))))$auc
Accuracy$gbm <- gbmConfMat$overall['Accuracy']

In [None]:
set.seed(10)
svmModel <- train(num ~ ., data = trainData2,
                 method = "svmRadial",
                 trControl = fitControl,
                 preProcess = c("center", "scale"),
                 tuneLength = 8,
                 metric = "ROC")
svmPrediction <- predict(svmModel, testData2)
svmPredictionprob <- predict(svmModel, testData2, type='prob')[2]
svmConfMat <- confusionMatrix(svmPrediction, testData2[,"num"])
#ROC Curve
AUC$svm <- roc(as.numeric(testData2$num),as.numeric(as.matrix((svmPredictionprob))))$auc
Accuracy$svm <- svmConfMat$overall['Accuracy'] 