# Question 1. Predicting Heart Disease

### a. Read the dataset file “Hearts_s.csv” (you should download it from CSNS), and assign it to a Pandas DataFrame.

In [1]:
import pandas as pd

file = pd.read_csv('Heart_s.csv')

### b. Check out the dataset. As you see, the dataset contains a number of features including both contextual and biological factors (e.g. age, gender, vital signs, ...). The last column “AHD” is the label with “Yes” meaning that a human subject has Heart Disease, and “No” meaning that the subject does not have Heart Disease.

In [2]:
print(file)

     Age Gender     ChestPain  RestBP  Chol  RestECG  MaxHR  Oldpeak  \
0     63      f       typical     145   233        2    150      2.3   
1     67      f  asymptomatic     160   286        2    108      1.5   
2     67      f  asymptomatic     120   229        2    129      2.6   
3     37      f    nonanginal     130   250        0    187      3.5   
4     41      m    nontypical     130   204        2    172      1.4   
5     56      f    nontypical     120   236        0    178      0.8   
6     62      m  asymptomatic     140   268        2    160      3.6   
7     57      m  asymptomatic     120   354        0    163      0.6   
8     63      f  asymptomatic     130   254        2    147      1.4   
9     53      f  asymptomatic     140   203        2    155      3.1   
10    57      f  asymptomatic     140   192        0    148      0.4   
11    56      m    nontypical     140   294        2    153      1.3   
12    56      f    nonanginal     130   256        2    142     

### c. As you see, there are at least 3 categorical features in the dataset (Gender, ChestPain, Thal). Let’s ignore these categorical features for now, only keep the numerical features and build your feature matrix and label vector.

In [3]:
feature_col = ['Age', 'RestBP', 'Chol', 'RestECG', 'MaxHR', 'Oldpeak' ]

x = file[feature_col]

y = file['AHD']
#print(x.head())
print(x)
print(y)





     Age  RestBP  Chol  RestECG  MaxHR  Oldpeak
0     63     145   233        2    150      2.3
1     67     160   286        2    108      1.5
2     67     120   229        2    129      2.6
3     37     130   250        0    187      3.5
4     41     130   204        2    172      1.4
5     56     120   236        0    178      0.8
6     62     140   268        2    160      3.6
7     57     120   354        0    163      0.6
8     63     130   254        2    147      1.4
9     53     140   203        2    155      3.1
10    57     140   192        0    148      0.4
11    56     140   294        2    153      1.3
12    56     130   256        2    142      0.6
13    44     120   263        0    173      0.0
14    52     172   199        0    162      0.5
15    57     150   168        0    174      1.6
16    48     110   229        0    168      1.0
17    54     140   239        0    160      1.2
18    48     130   275        0    139      0.2
19    49     130   266        0    171  

### d. Split the dataset into testing and training sets with the following parameters: test_size=0.3, random_state=3.

In [4]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 3)

### e. Use KNN (with k=5), Decision Tree, and Logistic Regression Classifiers to predict Heart Disease based on the training/testing datasets that you built in part (d). Then check, compare, and report the accuracy of these 3 classifiers. Which one is the best? Which one is the worst?

In [67]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(x_train, y_train)
y_predict = knn.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Knn: ", accuracy)

logreg = LogisticRegression()
logreg.fit(x_train,y_train)
y_predict = logreg.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Log Reg: ", accuracy)

decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train,y_train)
y_predict = decisiontree.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Decision Tree: ", accuracy)

#The best one was the Logistic Regression. The worse would be KNN but there is a possiblity where decision tree might
#be better than both since it'll fluctuate based on the training and testing values.


Knn:  0.626373626374
Log Reg:  0.846153846154
Decision Tree:  0.725274725275


### f. Now, we want to use the categorical features as well! To this end, we have to perform a feature engineering process called OneHotEncoding for the categorical features. To do this, each categorical feature should be replaced with dummy columns in the feature table (one column for each possible value of a categorical feature), and then encode it in a binary manner such that only one of the dummy columns can take “1” at a time (and zero for the rest). For example, “Gender” can take two values “m” and “f”. Thus, we need to replace this feature (in the feature table) by 2 columns titled “m” and “f”. Wherever we have a male subject, we can put “1” and ”0” in the columns “m” and “f”. Wherever we have a female subject, we can put “0” and ”1” in the columns “m” and “f”. (Hint: you will need 4 columns to encode “ChestPain” and 3 columns to encode “Thal”).

In [68]:
new_data_frame = pd.get_dummies(file, columns = ['Gender', 'ChestPain', 'Thal'])
print(file.head())
print(new_data_frame.head())

   Age Gender     ChestPain  RestBP  Chol  RestECG  MaxHR  Oldpeak  \
0   63      f       typical     145   233        2    150      2.3   
1   67      f  asymptomatic     160   286        2    108      1.5   
2   67      f  asymptomatic     120   229        2    129      2.6   
3   37      f    nonanginal     130   250        0    187      3.5   
4   41      m    nontypical     130   204        2    172      1.4   

         Thal  AHD  
0       fixed   No  
1      normal  Yes  
2  reversable  Yes  
3      normal   No  
4      normal   No  
   Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  AHD  Gender_f  Gender_m  \
0   63     145   233        2    150      2.3   No         1         0   
1   67     160   286        2    108      1.5  Yes         1         0   
2   67     120   229        2    129      2.6  Yes         1         0   
3   37     130   250        0    187      3.5   No         1         0   
4   41     130   204        2    172      1.4   No         0         1   

   Chest

### g. Repeat parts (d) and (e) with the new dataset that you built in part (f). How does the prediction accuracy change for each method?

In [73]:
feature_cols = ['Age', 'RestBP', 'Chol', 'RestECG', 'MaxHR', 'Oldpeak', 'Gender_f', 'Gender_m', 'ChestPain_asymptomatic', 'ChestPain_nonanginal', 'ChestPain_nontypical', 'ChestPain_typical', 'Thal_fixed', 'Thal_normal', 'Thal_reversable']

x = new_data_frame[feature_cols]
y = new_data_frame['AHD']

print(x.head())
print(y.head())

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 3)

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(x_train, y_train)
y_predict = knn.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Knn: ", accuracy)

logreg = LogisticRegression()
logreg.fit(x_train,y_train)
y_predict = logreg.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Log Reg: ", accuracy)

decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train,y_train)
y_predict = decisiontree.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Decision Tree: ", accuracy)

#the values stay more or less the same

   Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  Gender_f  Gender_m  \
0   63     145   233        2    150      2.3         1         0   
1   67     160   286        2    108      1.5         1         0   
2   67     120   229        2    129      2.6         1         0   
3   37     130   250        0    187      3.5         1         0   
4   41     130   204        2    172      1.4         0         1   

   ChestPain_asymptomatic  ChestPain_nonanginal  ChestPain_nontypical  \
0                       0                     0                     0   
1                       1                     0                     0   
2                       1                     0                     0   
3                       0                     1                     0   
4                       0                     0                     1   

   ChestPain_typical  Thal_fixed  Thal_normal  Thal_reversable  
0                  1           1            0                0  
1               

### h. Now, repeat part (e) with the new dataset that you built in part (f), this time using Cross- Validation. Thus, rather than splitting the dataset into testing and training, use 10-fold Cross-Validation (as we learned in Lab4) to evaluate the classification methods and report the final prediction accuracy

In [74]:
from sklearn.cross_validation import cross_val_score

feature_cols = ['Age', 'RestBP', 'Chol', 'RestECG', 'MaxHR', 'Oldpeak', 'Gender_f', 'Gender_m', 'ChestPain_asymptomatic', 'ChestPain_nonanginal', 'ChestPain_nontypical', 'ChestPain_typical', 'Thal_fixed', 'Thal_normal', 'Thal_reversable']

x = new_data_frame[feature_cols]
y = new_data_frame['AHD']


knn = KNeighborsClassifier(n_neighbors = 5)

accuracy = cross_val_score(knn, x, y, cv = 10)
print("knn: ", accuracy.mean())

logreg = LogisticRegression()
accuracy = cross_val_score(logreg, x, y, cv = 10)
print("Logistic Regression: ", accuracy.mean())

decision_tree = DecisionTreeClassifier()
accuracy = cross_val_score(decision_tree, x, y, cv = 10)
print("Decision tree: ", accuracy.mean())




knn:  0.643926585095
Logistic Regression:  0.811568409344
Decision tree:  0.719410456062
