# Question 1. Predicting Heart Disease

### a. Read the dataset file “Hearts_s.csv” (you should download it from CSNS), and assign it to a Pandas DataFrame.

In [32]:
import pandas as pd

file = pd.read_csv('Heart_s.csv')

### b. Check out the dataset. As you see, the dataset contains a number of features including both contextual and biological factors (e.g. age, gender, vital signs, ...). The last column “AHD” is the label with “Yes” meaning that a human subject has Heart Disease, and “No” meaning that the subject does not have Heart Disease.

In [33]:
file[::20]

Unnamed: 0,Age,Gender,ChestPain,RestBP,Chol,RestECG,MaxHR,Oldpeak,Thal,AHD
0,63,f,typical,145,233,2,150,2.3,fixed,No
20,64,f,typical,110,211,2,144,1.8,normal,No
40,65,m,asymptomatic,150,225,2,114,1.0,reversable,Yes
60,51,m,asymptomatic,130,305,0,142,1.2,reversable,Yes
80,45,f,asymptomatic,104,208,2,148,3.0,normal,No
100,45,f,asymptomatic,115,260,2,185,0.0,normal,No
120,48,f,asymptomatic,130,256,2,150,0.0,reversable,Yes
140,59,f,nontypical,140,221,0,164,0.0,normal,No
160,46,f,nontypical,101,197,0,156,0.0,reversable,No
180,48,f,asymptomatic,124,274,2,166,0.5,reversable,Yes


### c. As you see, there are at least 3 categorical features in the dataset (Gender, ChestPain, Thal). Let’s ignore these categorical features for now, only keep the numerical features and build your feature matrix and label vector.

In [34]:
# [[]] returns column(s) as a new DataFrame
new_df = file[ ['Age', 'RestBP', 'Chol', 'RestECG', 'MaxHR', 'Oldpeak', 'AHD'] ] # just for cleaner visual
new_df[::20]

Unnamed: 0,Age,RestBP,Chol,RestECG,MaxHR,Oldpeak,AHD
0,63,145,233,2,150,2.3,No
20,64,110,211,2,144,1.8,No
40,65,150,225,2,114,1.0,Yes
60,51,130,305,0,142,1.2,Yes
80,45,104,208,2,148,3.0,No
100,45,115,260,2,185,0.0,No
120,48,130,256,2,150,0.0,Yes
140,59,140,221,0,164,0.0,No
160,46,101,197,0,156,0.0,No
180,48,124,274,2,166,0.5,Yes


### d. Split the dataset into testing and training sets with the following parameters: test_size=0.3, random_state=3.

In [35]:
from sklearn.model_selection import train_test_split

feature_col = ['Age', 'RestBP', 'Chol', 'RestECG', 'MaxHR', 'Oldpeak' ]
x = new_df[feature_col]
y = new_df['AHD']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 3)

### e. Use KNN (with k=5), Decision Tree, and Logistic Regression Classifiers to predict Heart Disease based on the training/testing datasets that you built in part (d). Then check, compare, and report the accuracy of these 3 classifiers. Which one is the best? Which one is the worst?

In [36]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(x_train, y_train)
y_predict = knn.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Knn: ", accuracy)

logreg = LogisticRegression()
logreg.fit(x_train,y_train)
y_predict = logreg.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Log Reg: ", accuracy)

decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train,y_train)
y_predict = decisiontree.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Decision Tree: ", accuracy)

print("\nThe best one was the Logistic Regression. \n \
    The worse would be KNN but there is a possiblity where decision tree might\n \
    be better than both since it'll fluctuate based on the training and testing values.\n")


Knn:  0.626373626374
Log Reg:  0.725274725275
Decision Tree:  0.582417582418

The best one was the Logistic Regression. 
     The worse would be KNN but there is a possiblity where decision tree might
     be better than both since it'll fluctuate based on the training and testing values.



### f. Now, we want to use the categorical features as well! To this end, we have to perform a feature engineering process called OneHotEncoding for the categorical features. To do this, each categorical feature should be replaced with dummy columns in the feature table (one column for each possible value of a categorical feature), and then encode it in a binary manner such that only one of the dummy columns can take “1” at a time (and zero for the rest). For example, “Gender” can take two values “m” and “f”. Thus, we need to replace this feature (in the feature table) by 2 columns titled “m” and “f”. Wherever we have a male subject, we can put “1” and ”0” in the columns “m” and “f”. Wherever we have a female subject, we can put “0” and ”1” in the columns “m” and “f”. (Hint: you will need 4 columns to encode “ChestPain” and 3 columns to encode “Thal”).

In [37]:
new_data_frame = pd.get_dummies(file, columns = ['Gender', 'ChestPain', 'Thal'])
print(file.head())
print("\n\n")
print(new_data_frame.head())

   Age Gender     ChestPain  RestBP  Chol  RestECG  MaxHR  Oldpeak  \
0   63      f       typical     145   233        2    150      2.3   
1   67      f  asymptomatic     160   286        2    108      1.5   
2   67      f  asymptomatic     120   229        2    129      2.6   
3   37      f    nonanginal     130   250        0    187      3.5   
4   41      m    nontypical     130   204        2    172      1.4   

         Thal  AHD  
0       fixed   No  
1      normal  Yes  
2  reversable  Yes  
3      normal   No  
4      normal   No  



   Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  AHD  Gender_f  Gender_m  \
0   63     145   233        2    150      2.3   No         1         0   
1   67     160   286        2    108      1.5  Yes         1         0   
2   67     120   229        2    129      2.6  Yes         1         0   
3   37     130   250        0    187      3.5   No         1         0   
4   41     130   204        2    172      1.4   No         0         1   

   Ch

### g. Repeat parts (d) and (e) with the new dataset that you built in part (f). How does the prediction accuracy change for each method?

In [38]:
feature_cols = ['Age', 'RestBP', 'Chol', 'RestECG', 'MaxHR', 'Oldpeak', 'Gender_f', 'Gender_m', 'ChestPain_asymptomatic', 'ChestPain_nonanginal', 'ChestPain_nontypical', 'ChestPain_typical', 'Thal_fixed', 'Thal_normal', 'Thal_reversable']

x = new_data_frame[feature_cols]
y = new_data_frame['AHD']

print(x.head())
print(y.head())

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 3)

knn = KNeighborsClassifier(n_neighbors = 5)
knn.fit(x_train, y_train)
y_predict = knn.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Knn: ", accuracy)

logreg = LogisticRegression()
logreg.fit(x_train,y_train)
y_predict = logreg.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Log Reg: ", accuracy)

decisiontree = DecisionTreeClassifier()
decisiontree.fit(x_train,y_train)
y_predict = decisiontree.predict(x_test)
accuracy = accuracy_score(y_test, y_predict)
print("Decision Tree: ", accuracy)

#the values stay more or less the same

   Age  RestBP  Chol  RestECG  MaxHR  Oldpeak  Gender_f  Gender_m  \
0   63     145   233        2    150      2.3         1         0   
1   67     160   286        2    108      1.5         1         0   
2   67     120   229        2    129      2.6         1         0   
3   37     130   250        0    187      3.5         1         0   
4   41     130   204        2    172      1.4         0         1   

   ChestPain_asymptomatic  ChestPain_nonanginal  ChestPain_nontypical  \
0                       0                     0                     0   
1                       1                     0                     0   
2                       1                     0                     0   
3                       0                     1                     0   
4                       0                     0                     1   

   ChestPain_typical  Thal_fixed  Thal_normal  Thal_reversable  
0                  1           1            0                0  
1               

### h. Now, repeat part (e) with the new dataset that you built in part (f), this time using Cross- Validation. Thus, rather than splitting the dataset into testing and training, use 10-fold Cross-Validation (as we learned in Lab4) to evaluate the classification methods and report the final prediction accuracy

In [39]:
from sklearn.cross_validation import cross_val_score

#feature_cols = ['Age', 'RestBP', 'Chol', 'RestECG', 'MaxHR', 'Oldpeak', 'Gender_f', 'Gender_m', 'ChestPain_asymptomatic', 'ChestPain_nonanginal', 'ChestPain_nontypical', 'ChestPain_typical', 'Thal_fixed', 'Thal_normal', 'Thal_reversable']

#x = new_data_frame[feature_cols]
#y = new_data_frame['AHD']


knn = KNeighborsClassifier(n_neighbors = 5)

accuracy = cross_val_score(knn, x, y, cv = 10)
print("knn: ", accuracy.mean())

logreg = LogisticRegression()
accuracy = cross_val_score(logreg, x, y, cv = 10)
print("Logistic Regression: ", accuracy.mean())

decision_tree = DecisionTreeClassifier()
accuracy = cross_val_score(decision_tree, x, y, cv = 10)
print("Decision tree: ", accuracy.mean())




knn:  0.643926585095
Logistic Regression:  0.811568409344
Decision tree:  0.712076381164




# Question 2: Debt Prediction

### a. Read the dataset file “Credit.csv” (you should download it from CSNS), and assign it to a Pandas DataFrame.

In [66]:
import pandas as pd

credits_df = pd.read_csv('Credit.csv')

### b. Check out the dataset. The “Credit” dataset includes “balance” column (average credit card debt for a number of individuals) as target, as well as several features: age, cards (number of credit cards), education (years of education), income (in thousands of dollars), limit (credit limit), marital status, and rating (credit rating).

In [67]:
credits_df[::30]

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Married,Balance
0,14.891,3606,283,2,34,11,1,333
30,34.142,5666,413,4,47,5,1,863
60,35.51,5198,364,2,35,20,0,631
90,20.191,5767,431,4,42,16,1,1023
120,27.241,1402,128,2,67,15,1,0
150,63.931,5728,435,3,28,14,1,581
180,10.635,3584,294,5,69,16,1,423
210,24.543,3206,243,2,62,12,1,95
240,29.705,3351,262,5,71,14,1,148
270,15.866,3085,217,1,39,13,0,136


### c. Generate the feature matrix and target vector (target is “balance” in this dataset). Then, normalize (scale) the features (note: don’t normalize the target vector!).

In [79]:
from sklearn import preprocessing

feature_cols = ['Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education', 'Married']

feat_matrix = credits_df[feature_cols]
print(feat_matrix.head())
print()

target = credits_df['Balance']
print(target.head())
print()

scale = preprocessing.scale(feat_matrix)

#print (scale.mean(axis = 0) )

'''
print(credits_df['Income'][0])
print(credits_df['Income'].mean())
print(credits_df['Income'].std())

print((credits_df['Income'][0]-credits_df['Income'].mean())/credits_df['Income'].std()) '''

    Income  Limit  Rating  Cards  Age  Education  Married
0   14.891   3606     283      2   34         11        1
1  106.025   6645     483      3   82         15        1
2  104.593   7075     514      4   71         11        0
3  148.924   9504     681      3   36         11        0
4   55.882   4897     357      2   68         16        1

0    333
1    903
2    580
3    964
4    331
Name: Balance, dtype: int64



"\nprint(credits_df['Income'][0])\nprint(credits_df['Income'].mean())\nprint(credits_df['Income'].std())\n\nprint((credits_df['Income'][0]-credits_df['Income'].mean())/credits_df['Income'].std()) "

### d. Split the dataset into testing and training sets with the following parameters: test_size=0.2, random_state=2.

In [81]:
from sklearn.model_selection import train_test_split

x = scale
y = target

print()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 2)




### e. Use Linear Regression to train a linear model on the training set. Check the coefficients of the linear regression model. Which feature is the most important? Which feature is the least important?

In [86]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression(n_jobs=4)
lin_reg.fit(x_train, y_train)


print("Coefficients:\n", lin_reg.coef_)

print("\nIs largest feature most important? - tbd")
print("\nIs smallest feature least important? - tbd")


Coefficients:
 [-266.58857043  224.9727995   384.17073326   15.75381497  -17.93570224
    6.29572687  -23.60638685]

Is largest feature most important? - tbd

Is smallest feature least important? - tbd


### f. Predict “balance” for the users in testing set. Then, compare the predicted balance with the actual balance by calculating and reporting the RMSE (as we saw in lab tutorial 4).

### g. Now, use 10-fold Cross-Validation to evaluate the performance of a linear regression in predicting the balance. Thus, rather than splitting the dataset into testing and training, use Cross-Validation to evaluate the regression performance. What is the RMSE when you use cross validation?