# Assignment 1: Intro to ML classifiers using scikit and Jupyter Notebook

The following is the remainder of A1. Please edit this notebook so you complete the assigned tasks and answer the assigned questions.

If you haven't used Jupyter Notebooks before, a good first step for you would be to take the User Interface Tour in the Help menu. You can put code in any cell in this notebook and run it. Some of the cells are currently set to be "Markdown" cells, which basically means they're for text, headings, etc. This cell is a markdown cell. Cells can be changed from code cells to markdown cells in the Cell menu -> Cell Type. You won't need to do anything with markdown for this assignment, but if you want to learn about it [here](https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet) is a cheat sheet for the types of things you can type in a markdown cell and what they will output.

You can run a cell by typing Shift+Enter, and you can rerun individual cells whenever you like. Try running the following cell:

In [1]:
print ("Hello World!")

Hello World!


Unlike in many other environments, Jupyter Notebooks will remember what you set variables to even after the code has finished running (until you clear outputs or change the variable in some way). The following code will find the current time when you run it:

In [2]:
import time
current_time = time.gmtime()
print (current_time)

time.struct_time(tm_year=2018, tm_mon=9, tm_mday=19, tm_hour=3, tm_min=50, tm_sec=5, tm_wday=2, tm_yday=262, tm_isdst=0)


If you print current_time in another cell later on, it will keep the value from when you originally set it. If you go back and rerun the above code, it will update.

In [3]:
# For example
print (current_time)

time.struct_time(tm_year=2018, tm_mon=9, tm_mday=19, tm_hour=3, tm_min=50, tm_sec=5, tm_wday=2, tm_yday=262, tm_isdst=0)


You do have to be careful about this if you run something later on in your code and then come back to an earlier part because variables may not have the values you expect at that point in the code. The following three blocks of code would always print out 1 if you ran them consecutively in regular code, but if you run them backwards in this notebook, you'll get an output of 2. Feel free to try:

In [4]:
a = 1

In [5]:
print (a)

1


In [6]:
a = 2

You'll be working with all of the files in the assignment folder. It's easiest for all of your files for this assignment to be in the working directory. You can list the working directory with the "%pwd" command and the files in it with the "%ls% command.

In [7]:
%pwd

'/Users/jessli/Documents/05-499/a1'

In [8]:
%ls

[31mHAIIF18 A1.docx[m[m*            [31mloantrainingdata.csv[m[m*
[31mHAIIF18 Assignment 1.ipynb[m[m* [31mpredictiondata.csv[m[m*
jli6_haiif18a1.html         [31mvariables.rtf[m[m*
[31mloantestset.csv[m[m*


The "%ls" command should show you the three .csv data files, the .txt variable explanations files, and this jupyter notebook. If it doesn't, you may need to move those files to the working directory manually or change the working directory.

Next we have to actually import the training data. We'll do this using pandas and numpy.

In [9]:
import pandas
import numpy
import sklearn
import scipy
import matplotlib

The code below will import the .csv as a dataframe type object, which we need it to be in to clean it a bit. It will then get rid of the non-numeric columns (in this case the "Loan_ID" field) because we need our data to be numeric to run the classifier.

In [10]:
trainingdata = pandas.read_csv("loantrainingdata.csv", header = 0)
trainingdata = trainingdata._get_numeric_data()

In [11]:
#If you want to see your data
trainingdata

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Urban,Semiurban,Rural,Loan_Status
0,0.0,1.0,0.0,1,1.0,5849,0.0,,360.0,1.0,1,0,0,1
1,0.0,1.0,1.0,1,1.0,4583,1508.0,128.0,360.0,1.0,0,0,1,0
2,0.0,1.0,0.0,1,1.0,3000,0.0,66.0,360.0,1.0,1,0,0,1
3,0.0,1.0,0.0,0,1.0,2583,2358.0,120.0,360.0,1.0,1,0,0,1
4,0.0,1.0,0.0,1,1.0,6000,0.0,141.0,360.0,1.0,1,0,0,1
5,0.0,1.0,2.0,1,1.0,5417,4196.0,267.0,360.0,1.0,1,0,0,1
6,0.0,1.0,0.0,0,1.0,2333,1516.0,95.0,360.0,1.0,1,0,0,1
7,0.0,1.0,3.5,1,1.0,3036,2504.0,158.0,360.0,0.0,0,1,0,0
8,0.0,1.0,2.0,1,1.0,4006,1526.0,168.0,360.0,1.0,1,0,0,1
9,0.0,1.0,1.0,1,1.0,12841,10968.0,349.0,360.0,1.0,0,1,0,0


You probably noticed that there are a bunch of missing values in the dataset. The following tells you how many are missing in each variable:

In [12]:
#If you want to see how many values are missing
trainingdata.isnull().sum()

Gender                9
Married               3
Dependents           12
Education             0
Self_Employed        27
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           18
Loan_Amount_Term     14
Credit_History       38
Urban                 0
Semiurban             0
Rural                 0
Loan_Status           0
dtype: int64

You're going to need to figure out how to deal with these missing values. [Hint](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html)

**(1) Remove all the rows that have missing values. For simplicity, also call your new dataframe 'trainingdata'.** (5 pts)

In [13]:
#Enter code here to remove the rows with missing values
trainingdata = trainingdata.dropna()

If you have correctly removed all rows with missing values, the following code should return "True":

In [14]:
#DO NOT EDIT
not trainingdata.isnull().values.any()

True

The next two lines split the data into two chunks. Train_X is the features and their values from the training dataset. Train_y is the class value that we want to predict.

In [15]:
#This creates an array of features that includes all features except 'Loan_Status' which is what we want to predict
Train_X = numpy.array(trainingdata.drop(['Loan_Status'],1))

#This creates an array that includes ONLY 'Loan_Status'
Train_y = numpy.array(trainingdata['Loan_Status'])

Ideally we would preprocess the data to make sure that the different distributions of values don't skew the relative importance of the features. For example, 'ApplicantIncome' ranges into the tens of thousands while binary variables are just zero or one. We're not going to do this here, but we will in the future. Just know that it puts the values of most variables in the general range of -1 to 1. If you want to read more about this, go [here](http://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py).

Next we actually make the classifier. We're using a Gaussian Naive Bayes classifier ([documentation if you're interested](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)) within sklearn, and there are a ton of parameters you can tune, but we'll go simple to start.

In [16]:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(Train_X,Train_y)

GaussianNB(priors=None)

Let's test it out. The following code takes a hand-entered set of data and predicts whether the loan will be approved:

In [17]:
#You can play with this
prediction = classifier.predict(numpy.asarray([0,1,2,0,1,5000,1255,100,360,1,0,0,1]).reshape(1,-1))

print('Prediction:', prediction)

Prediction: [1]


The classifier predicts that the above person would be approved, as the output is a 1. If the output were 0, that would mean it predicts that they would be declined.

**(2) Switch the zeros in the following code for your own values to find three cases where changing a single variable makes the difference between approved and declined. See case0a and case0b below for examples- an applicant might be approved if they have a monthly income of \$5000 in Case0a, but an otherwise identical applicant would be declined if they only had a monthly income of \$1000, as in Case0b. Use a different variable (e.g., Gender, ApplicantIncome, etc) for each of the three cases.** (10 pts)

In [18]:
#Example:

prediction0a = classifier.predict(numpy.asarray([1,1,0,1,0,5000,0,300,360,1,0,0,1]).reshape(1,-1))
print('Case0a:', prediction0a)

prediction0b = classifier.predict(numpy.asarray([1,1,0,1,0,1000,0,300,360,1,0,0,1]).reshape(1,-1))
print('Case0b:', prediction0b)


#Switch the zeros with your own values below:

#change loan amount
prediction1a = classifier.predict(numpy.asarray([1,1,0,1,0,5000,0,300,360,1,0,0,1]).reshape(1,-1))
print('Case1a:', prediction1a)

prediction1b = classifier.predict(numpy.asarray([1,1,0,1,0,5000,0,1000,360,1,0,0,1]).reshape(1,-1))
print('Case1b:', prediction1b)

#change loan amount time
prediction2a = classifier.predict(numpy.asarray([1,1,0,1,0,5000,0,300,360,1,0,0,1]).reshape(1,-1))
print('Case2a:', prediction2a)

prediction2b = classifier.predict(numpy.asarray([1,1,0,1,0,5000,0,300,360,5,0,0,1]).reshape(1,-1))
print('Case2b:', prediction2b)

#change credit history
prediction3a = classifier.predict(numpy.asarray([1,1,0,1,0,5000,0,300,360,1,0,0,1]).reshape(1,-1))
print('Case3a:', prediction3a)

prediction3b = classifier.predict(numpy.asarray([1,1,0,1,0,5000,0,300,360,1,1,0,1]).reshape(1,-1))
print('Case3b:', prediction3b)


#output below should be 
#Case0a: [1]
#Case0b: [0]
#Case1a: [1]
#Case1b: [0]
#Case2a: [1]
#Case2b: [0]
#Case3a: [1]
#Case3b: [0]

Case0a: [1]
Case0b: [0]
Case1a: [1]
Case1b: [0]
Case2a: [1]
Case2b: [0]
Case3a: [1]
Case3b: [0]


**(3) The next thing we need to do is import the test set (loantestset.csv) and see how good our classifier is at predicting the values there. Do this, following the same steps as above to import and convert to arrays. Put the features in an array called 'Test_X' and the class (Loan_Status) in an array called 'Test_y'.** (5 pts)

In [19]:
# Write code here to import the test data and convert into arrays
loantestdata = pandas.read_csv("loantestset.csv", header = 0)
loantestdata = loantestdata._get_numeric_data().dropna()

Test_X = numpy.array(loantestdata.drop(['Loan_Status'],1))
Test_y = numpy.array(loantestdata['Loan_Status'])

If you've imported them correctly, the following code should report the arrays as having dimensions 100x13 and 100x1:

In [20]:
# DO NOT EDIT THIS
# Should return (100, 13) (100,)
print (Test_X.shape, Test_y.shape)

(100, 13) (100,)


Once you've properly imported the test set data, cleaned it, and put it into arrays, the following code should predict whether loans were approved using the classifier you trained on the training data as applied to the attributes in the testing data. It will output your predictions and then the actual values.

In [21]:
#DO NOT EDIT THIS
Test_y_predicted = classifier.predict(Test_X)

Test_y_predicted, Test_y

(array([0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
        1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0,
        1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
        1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0]),
 array([0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0,
        1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,
        1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0,
        1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]))

**(3a) A "false positive" is when the classifier predicts a 1 but the real value is a 0. A "false negative" is when the classifier predicts a 0 but the real value is a 1. In this case, a false positive means the classifier predicts that the person would be approved but they were actually denied, and a false negative means the classifier predicts that the person would be denied but they were actually approved.**

**You could count up by hand how many true negatives, false negatives, false positives, and true positives you had with this classifier, but the [confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) function in sklearn creates a handy table of this for you. Use the confusion_matrix function in the box below to see how many of each you had.** (3 pts)

In [22]:
# Use a confusion matrix here for Test_y and Test_y_predicted to see how many true neg, false neg, false pos, 
# and true pos you had.
from sklearn.metrics import confusion_matrix

confusion_matrix(Test_y,Test_y_predicted)

array([[15, 14],
       [ 9, 62]])

In [23]:
# Write the numbers of each in below. MAKE SURE you understand which cell in the confusion matrix has each of these.
false_negative = 9     # predict:n, actual:y
false_positive = 14    # predict:y, actual:n
true_negative = 15     # predict:n, actual:n
true_positive = 62     # predict:y, actual:y

print (false_negative, false_positive, true_negative, true_positive)

9 14 15 62


**(3b) How does a false positive in this situation hurt** ***the bank?*** (no more than 20 words). (3 pts)

Note that we're using the len function here for grading purposes, just to quickly make sure you haven't gone over the word limit for each.

In [24]:
len(("The bank would have granted the loan but would possibly not get money back.").split())

14

**(3c) How does a false negative in this situation hurt** ***the bank?*** (no more than 20 words). (3 pts)

In [25]:
len(("Less people getting loans from the bank means less profit from interest.").split())

14

**(3d) How does a false positive in this situation hurt** ***the customer?*** (no more than 20 words). (3 pts)

In [26]:
len(("The person could fall into debt if they cannot pay back the loan in time.").split())

15

**(3e) How does a false negative in this situation hurt** ***the customer?*** (no more than 20 words). (3 pts)

In [27]:
len(("The customer would not get the loan when they could have gotten it and needed it.").split())

16

**(4) In the following box, write code to calculate and print out your "accuracy" for the model. This is just the proportion of how many the classifier got correct. You could do it manually if you want, but there's an easy [function](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) built into sklearn to do it.** (5 pts)

In [28]:
#CALCULATE AND PRINT OUT YOUR ACCURACY SCORE
from sklearn.metrics import accuracy_score

accuracy_score(Test_y, Test_y_predicted)

0.77

Having decent accuracy is nice, but it doesn't mean much on its own. **(5) Go back to the test data spreadsheet - counting only rows with no missing values, what would your accuracy have been if you just guessed that every loan would be approved?** (5 pts)

In [29]:
#Your response
import numpy as np
Test_all_y = np.ones((100,), dtype=int)

accuracy_score(Test_y, Test_all_y)


0.71

*Sanity check - if your classifier is getting lower accuracy than you got just by guessing everything would be approved, you probably made a mistake in your classifier somewhere. If it's a little bit better, you're on the right track.*

**(6) In the box below, create a classifier for the test data based on the training data using the exact same steps, except this time use a decision tree classifier from sklearn.tree instead of a GaussianNB classifier. Print out your predicted values, the real values, and your overall accuracy.** (15 pts)

In [30]:
# This code should output predicted values, real values, and overall accuracy using a GaussianNB classifier

from sklearn.tree import DecisionTreeClassifier
tree_classifier = DecisionTreeClassifier()
tree_classifier.fit(Train_X,Train_y)

Test_y_predicted_tree = tree_classifier.predict(Test_X)

print("Predicted:\n", Test_y_predicted_tree, end="\n")
print("Actual:\n", Test_y, end="\n")
print("Accuracy: ", accuracy_score(Test_y, Test_y_predicted_tree))


Predicted:
 [0 1 0 1 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 0 0
 1 1 1 1 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 1 1 1 0 1 0 1 0 1 0 0 1 1 1 1 1 1 1
 0 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 0 1 1]
Actual:
 [0 1 0 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 1 1 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 0 1 1 0 1 1 0 1 0 1 1 0 0 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1 1 1
 0 1 0 0 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0]
Accuracy:  0.72


(OPTIONAL, 5 bonus pts) **A "Pipeline" in sklearn is a simpler way to create and organize classifiers. If you'd like to learn more about what these advantages are, [this](https://www.youtube.com/watch?v=URdnFlZnlaE) tutorial does a decent job explaining them. We will be using Pipelines in future assignments to simplify code. If you'd like to get a head start, try replacing your code from part (6) above with a Pipeline. There are tons of different ways to format a Pipeline, but for simplicity please use the second format on the slide presented in the above tutorial at 14:30 (the one that starts with `Pipeline([` ). Note that this Pipeline should have very few steps; you just have to select the numeric data, deal with missing values, separate it into features and class variables, and then train your classifier.**

In [31]:
# This is for the Pipeline code for the OPTIONAL part above
# Write your Pipeline code here

In [32]:
# This is for the Pipeline code for the OPTIONAL part above

# In order to prove that your Pipeline does the same thing as your code from part (6),
# print out your predictions from the code in part (6) in this box followed by the predictions from
# your pipeline. The predictions should be identical.


**(7) Using the structure from part (6) and whichever type of classifier you like, write code to...**

**(a) develop a classifier based on the training data csv** (5 pts)

**(b) import the 'predictiondata.csv' file** (5 pts)

**(c) make predictions for these new instances and print them out here** (10 pts)

Note that you can't calculate accuracy for the predictiondata.csv predictions because we don't actually know whether these applications were approved or denied. The purpose of this exercise is just to have a script that can automatically approve or deny applications without a human even having to open the file.

In [33]:
### Your code goes here
from sklearn.gaussian_process import GaussianProcessClassifier

## a
my_classifier = GaussianProcessClassifier()
my_classifier.fit(Train_X, Train_y)


## b
predictiondata = pandas.read_csv("predictiondata.csv", header = 0)
predictiondata = predictiondata._get_numeric_data().dropna()

## c
Test_prediction = my_classifier.predict(predictiondata)
print("Predicted:\n", Test_prediction)

# for clarity on pt. d
#for i, row in predictiondata.iterrows():
#    print("PREDICTION: ", Test_prediction[i], "\n", row, "\n")


Predicted:
 [1 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
 0 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]


**(d) Look at the predictions you've made for the predictiondata.csv file. Find three where you're pretty sure your classifier got it wrong. List their IDs below, and explain why you think these loans really should or shouldn't get approved, contrary to the prediction of your classifier.** Use no more than 25 words per loan to explain. (10 pts)

In [34]:
#1. LP001203
print (len(("They SHOULD: The person's income to loan ratio is very high, has education, good credit history, and lives in semiurban place.").split()))
#2. LP001450
print (len(("They SHOULD: The person has good credit history, high applicant income, semiurban area, and loan amount is low.").split()))
#3. LP001607
print (len(("They SHOULD NOT. The person has bad credit history, no personal income [coapplicant income is necessary], not married, and not educated.").split()))


21
18
21


**(e) In this assignment, we have tried to predict whether a loan** ***would*** **be approved or denied, but this is not the same thing as whether a loan** ***should*** **be approved or denied. If we wanted to look at whether a loan** ***should be approved or denied***, **what might we want to build a classifier to predict?** (no more than 20 words) (5 pts)

In [1]:
len(("They need a classifier that will measure whether the customer will be able to pay back the loan.").split())

18

**(f) What factors do you think might be worthwhile to consider when approving or denying loans, but weren't in this dataset?** (20 words or less) (5 pts)

In [36]:
len(("Age. Too young the person might be financially unstable, too old and the of time before the impossibility of repayment is thin.").split())

22

Once you've completed all of the above, you're done with assignment 1! You might want to double check that your code works like you expect. You can do this by choosing "Restart & Run All" in the Kernel menu. If it outputs errors, you may want to go back and check what you've done.

Once you think everything is set, please download your final notebook as HTML, and submit to the A1 folder on the Canvas site with name `[yourandrewid]_haiif18a[assignmentnumber]`, e.g., `jseering_haiif18a1.`