# Identifying safe loans with decision trees

The [LendingClub](https://www.lendingclub.com/) is a peer-to-peer leading company that directly connects borrowers and potential lenders/investors. In this notebook, you will build a classification model to predict whether or not a loan provided by LendingClub is likely to [default](https://en.wikipedia.org/wiki/Default_(finance)).

In this notebook you will use data from the LendingClub to predict whether a loan will be paid off in full or the loan will be [charged off](https://en.wikipedia.org/wiki/Charge-off) and possibly go into default. In this assignment you will:

* Use SFrames to do some feature engineering.
* Train a decision-tree on the LendingClub dataset.
* Visualize the tree.
* Predict whether a loan will default along with prediction probabilities (on a validation set).
* Train a complex tree model and compare it to simple tree model.


## Fire up Graphlab Create

Make sure you have the latest version of GraphLab Create. If you don't find the decision tree module, then you would need to upgrade GraphLab Create using

```
   pip install graphlab-create --upgrade
```

In [59]:
import graphlab as gl
import numpy as np
gl.canvas.set_target('ipynb')
from __future__ import division  #ensures floating point division

# Load LendingClub dataset

**1.** We will be using a dataset from the [LendingClub](https://www.lendingclub.com/). A parsed and cleaned form of the dataset is availiable [here](https://github.com/learnml/machine-learning-specialization-private). Make sure you **download the dataset** before running the following command.

In [60]:
loans = gl.SFrame('lending-club-data.gl/')

## Exploring some features

**2.** Let's quickly explore what the dataset looks like. First, let's print out the column names to see what features we have in this dataset.

In [61]:
print loans.column_names()

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'is_inc_v', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'not_compliant', 'status', 'inactive_loans', 'bad_loans', 'emp_length_num', 'grade_num', 'sub_grade_num', 'delinq_2yrs_zero', 'pub_rec_zero', 'collections_12_mths_zero', 'short_emp', 'payment_in

Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.

In [62]:
loans['grade'].show()

We can see that over half of the loan grades are assigned values `B` or `C`. Each loan is assigned one of these grades, along with a more finely discretized feature called `sub_grade` (feel free to explore that feature column as well!). These values depend on the loan application and credit report, and determine the interest rate of the loan. More information can be found [here](https://www.lendingclub.com/public/rates-and-fees.action).

Now, let's look at a different feature.

In [63]:
loans['home_ownership'].show()

This feature describes whether the loanee is mortaging, renting, or owns a home. We can see that a small percentage of the loanees own a home.

## Exploring the target column

The target column (response label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

**3.** We create a new column called `safe_loans`, and store loan outcome there.

In [64]:
loans[['id', 'member_id', 'loan_amnt', 'bad_loans']].head(2)

id,member_id,loan_amnt,bad_loans
1077501,1296599,5000,0
1077430,1314167,2500,1


In [65]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] = loans['bad_loans'].apply(lambda x : +1 if x==0 else -1)

# just check if conversion done properly, before removing the 'bad_loans' col
print loans[['id', 'member_id', 'loan_amnt', 'bad_loans', 'safe_loans']].head(2)

loans = loans.remove_column('bad_loans')

+---------+-----------+-----------+-----------+------------+
|    id   | member_id | loan_amnt | bad_loans | safe_loans |
+---------+-----------+-----------+-----------+------------+
| 1077501 |  1296599  |    5000   |     0     |     1      |
| 1077430 |  1314167  |    2500   |     1     |     -1     |
+---------+-----------+-----------+-----------+------------+
[2 rows x 5 columns]



**4.** Now, let us explore the distribution of the column `safe_loans`. This gives us a sense of how many safe and risky loans are present in the dataset. Print out the percentage of safe loans and risky loans in the data frame. seel also [frequency tables](http://hamelg.blogspot.com.au/2015/11/python-for-data-analysis-part-19_17.html). A frequency table is just a data table that shows the counts of one or more categorical variables. Create frequency tables (also known as crosstabs) in pandas using the pd.crosstab() function. The function takes one or more array-like objects as indexes or columns and then constructs a new DataFrame of variable counts based on the supplied arrays.

[This](https://www.analyticsvidhya.com/blog/2016/01/12-pandas-techniques-python-data-manipulation/) tutorial is based on same dataset, good set of tools for data manipulation.

In [66]:
import pandas as pd
def percConvert(ser):
    return ser/float(ser[-1])

pd.crosstab(index=loans['safe_loans'],  # Make a crosstab
                  columns="count",      # Name the count column
                        margins=True)   # Include row and column totals

#pd.crosstab(index=loans['safe_loans'],columns="count").apply(percConvert, axis=1)

col_0,count,All
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
-1,23150,23150
1,99457,99457
All,122607,122607


In [67]:
loans['safe_loans'].show(view = 'Categorical')

You should have:
* Around 81% safe loans
* Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

## Features for the classification algorithm

**5.** In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are **described in the code comments** below. If you are a finance geek, the [LendingClub](https://www.lendingclub.com/) website has a lot more details about these features.

In [68]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]] #wrap target inside [] to create list on the fly

This **subset of features** and the respnse or **target** is what we will use for the rest of this notebook. 

If you are using SFrame, proceed to the section "Sample data to balance classes".

If you are NOT using SFrame, download the list of indices for the training and validation sets: module-5-assignment-1-train-idx.json, module-5-assignment-1-validation-idx.json. Then follow the following steps:

*    Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding. Alternatively, see #7 for implementation hints.
*    Load the JSON files into the lists train_idx and validation_idx.
*    Perform train/validation split using train_idx and validation_idx. In Pandas, for instance:

In [69]:
#import json
#with open('module-5-assignment-1-train-idx.json', 'r') as file1: 
#    train_idx = json.load(file1)   #reads entire file in one go 

#with open('module-5-assignment-1-validation-idx.json', 'r') as file2: 
#    valid_idx = json.load(file2)    #reads entire file in one go 
 
#train_data      =  products.iloc[train_idx, :]
#validation_data =  products.iloc[valid_idx, :]

#print 'Training set   : %d data points' % len(train_data)
#print 'Validation set : %d data points' % len(validation_data)

## Sample data to balance classes

**6.** As we explored above, our data is disproportionally full of safe loans.  Let's create two datasets: one with just the safe loans (`safe_loans_raw`) and one with just the risky loans (`risky_loans_raw`).

In [70]:
safe_loans_raw = loans[loans[target] == +1]
risky_loans_raw = loans[loans[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)
print "Total number of loans : %s" % len(loans)

Number of safe loans  : 99457
Number of risky loans : 23150
Total number of loans : 122607


Now, write some code to compute below the percentage of safe and risky loans in the dataset and validate these numbers against what was given using `.show` earlier in the assignment:

In [71]:
print "Percentage of safe loans  :", len(safe_loans_raw)/len(loans) *100
print "Percentage of risky loans :", len(risky_loans_raw)/len(loans) *100

Percentage of safe loans  : 81.1185331996
Percentage of risky loans : 18.8814668004


One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used `seed=1` so everyone gets the same results.

In [72]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))

risky_loans = risky_loans_raw
safe_loans = safe_loans_raw.sample(percentage, seed=1)

# Append the risky_loans with the downsampled version of safe_loans
loans_data = risky_loans.append(safe_loans)

Now, let's verify that the resulting percentage of safe and risky loans are each nearly 50%.

In [73]:
print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))
print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))
print "Total number of loans in our new dataset :", len(loans_data)

Percentage of safe loans                 : 0.502236174422
Percentage of risky loans                : 0.497763825578
Total number of loans in our new dataset : 46508


**Note:** There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this [paper](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5128907&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F69%2F5173046%2F05128907.pdf%3Farnumber%3D5128907 ). For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

## One-hot encoding

**7.** For scikit-learn's decision tree implementation, it requires numerical values for it's data matrix. This means you will have to turn categorical variables into binary features via one-hot encoding. The next assignment has more details about this.

If you are using SFrame, feel free to use this piece of code as is. Refer to the SFrame API documentation for a deeper understanding. If you are using different machine learning software, make sure you prepare the data to be passed to the learning software.

In [74]:
'''  we are doing this lab using graphlb so no need for this

loans_data = risky_loans.append(safe_loans)

categorical_variables = []
for feature_name, feature_type in zip(loans_data.column_names(), loans_data.column_types()):
    if feature_type == str:
        categorical_variables.append(feature_name)

for feature in categorical_variables:
    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})
    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)

    # Change None's to 0's
    for column in loans_data_unpacked.column_names():
        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)

    loans_data.remove_column(feature)
    loans_data.add_columns(loans_data_unpacked)
'''

"  we are doing this lab using graphlb so no need for this\n\nloans_data = risky_loans.append(safe_loans)\n\ncategorical_variables = []\nfor feature_name, feature_type in zip(loans_data.column_names(), loans_data.column_types()):\n    if feature_type == str:\n        categorical_variables.append(feature_name)\n\nfor feature in categorical_variables:\n    loans_data_one_hot_encoded = loans_data[feature].apply(lambda x: {x: 1})\n    loans_data_unpacked = loans_data_one_hot_encoded.unpack(column_name_prefix=feature)\n\n    # Change None's to 0's\n    for column in loans_data_unpacked.column_names():\n        loans_data_unpacked[column] = loans_data_unpacked[column].fillna(0)\n\n    loans_data.remove_column(feature)\n    loans_data.add_columns(loans_data_unpacked)\n"

## Split data into training and validation sets

**8.** We split the data into training and validation sets using an 80/20 split and specifying `seed=1` so everyone gets the same results.

**Note**: In previous assignments, we have called this a **train-test split**. However, the portion of data that we don't train on will be used to help **select model parameters** (this is known as model selection). Thus, this portion of data should be called a **validation set**. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.

In [116]:
train_data, validation_data = loans_data.random_split(.8, seed=1)

print 'Training set   : %d data points' % len(train_data)
print 'Validation set : %d data points' % len(validation_data)

Training set   : 37224 data points
Validation set : 9284 data points


In [117]:
train_data.head(2)

grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none
C,C4,1,1,RENT,1.0,car,60 months,1
F,F2,0,5,OWN,5.55,small_business,60 months,1

last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
1,9.4,0.0,-1
1,32.6,0.0,-1


In [118]:
validation_data.head(2)

grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none
D,D2,0,3,RENT,13.97,other,60 months,0
A,A5,0,11,MORTGAGE,16.33,debt_consolidation,36 months,1

last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
1,59.5,0.0,-1
1,62.1,0.0,-1



# Use decision tree to build a classifier

**9.** Now, let's use the built-in GraphLab Create decision tree learner to create a loan prediction model on the training data. (In the next assignment, you will implement your own decision tree learning algorithm.)  Our feature columns and target column have already been decided above. Use `validation_set=None` to get the same results as everyone else.

If you are using sklearn, use the built-in scikit learn decision tree learner ([sklearn.tree.DecisionTreeClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)) to create a loan prediction model on the training data. To do this, you will need to import sklearn, sklearn.tree, and numpy.

Note: You will have to first convert the SFrame into a numpy data matrix, and extract the target labels as a numpy array (Hint: you can use the .to_numpy() method call on SFrame to turn SFrames into numpy arrays). See the [API](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) for more information. Make sure to set max_depth=6.

Call this model decision_tree_model.

In [119]:
decision_tree_model = gl.decision_tree_classifier.create(train_data, validation_set=None,
                                target = target, features = features,max_depth=6)

## Visualizing a learned model

**10.** As noted in the [documentation](https://dato.com/products/create/docs/generated/gl.boosted_trees_classifier.create.html#gl.boosted_trees_classifier.create), typically the max depth of the tree is capped at 6. However, such a tree can be hard to visualize graphically.  Here, we instead learn a smaller model with **max depth of 2** to gain some intuition by visualizing the learned tree.

In [77]:
small_model = gl.decision_tree_classifier.create(train_data, validation_set=None,
                   target = target, features = features, max_depth = 2)

In the view that is provided by GraphLab Create, you can see each node, and each split at each node. This visualization is great for considering what happens when this model predicts the target of a new data point.  (If you are using scikit-learn and have the package [Graphviz](http://graphviz.readthedocs.org/en/latest/#), then you will be able to perform this section.)

**Note:** To better understand this visual:
* The root node is represented using pink. 
* Intermediate nodes are in green. 
* Leaf nodes in blue and orange. 

In [113]:
small_model.show(view="Tree")

# Making predictions

Let's consider two positive and two negative examples **from the validation set** and see what the model predicts. We will do the following:
* Predict whether or not a loan is safe.
* Predict the probability that a loan is safe.

**11.** First, let's grab 2 positive examples and 2 negative examples. In SFrame, that would be:

In [79]:
validation_safe_loans = validation_data[validation_data[target] == 1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)
sample_validation_data

grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none
B,B3,0,11,OWN,11.18,credit_card,36 months,1
D,D1,0,10,RENT,16.85,debt_consolidation,36 months,1
D,D2,0,3,RENT,13.97,other,60 months,0
A,A5,0,11,MORTGAGE,16.33,debt_consolidation,36 months,1

last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
1,82.4,0.0,1
1,96.4,0.0,1
1,59.5,0.0,-1
1,62.1,0.0,-1


In [114]:
sample_validation_data['grade','total_rec_late_fee','safe_loans']

grade,total_rec_late_fee,safe_loans
B,0.0,1
D,0.0,1
D,0.0,-1
A,0.0,-1


## Explore label predictions

**12.** Now, we will use our model  to predict whether or not a loan is likely to default. For each row in the **sample_validation_data**, use the **decision_tree_model** to predict whether or not the loan is classified as a **safe loan**. 

**Hint:** Be sure to use the `.predict()` method.

In [80]:
predictions = decision_tree_model.predict(sample_validation_data, output_type='class')
print predictions
print sample_validation_data['safe_loans']

[1, -1, -1, 1]
[1, 1, -1, -1]


**Quiz Question:** What percentage of the predictions on `sample_validation_data` did `decision_tree_model` get correct?

In [81]:
print "num of predictions +ve class:%d" %(predictions == +1).sum()
print "num of predictions -ve class:%d" %(predictions == -1).sum()
actuals = sample_validation_data['safe_loans']
num_correct = sum( np.equal(predictions, actuals))
num_mistakes = len(sample_validation_data) - num_correct
accuracy = num_correct/len(sample_validation_data)
print "-----------------------------------------------------"
print '# Reviews   correctly classified =', num_correct
print '# Reviews incorrectly classified =', num_mistakes
print '# Reviews total                  =', len(sample_validation_data)
print "-----------------------------------------------------"
print 'Accuracy = %.2f' % accuracy

num of predictions +ve class:2
num of predictions -ve class:2
-----------------------------------------------------
# Reviews   correctly classified = 2
# Reviews incorrectly classified = 2
# Reviews total                  = 4
-----------------------------------------------------
Accuracy = 0.50


## Explore probability predictions

**13.** For each row in the **sample_validation_data**, what is the probability (according **decision_tree_model**) of a loan being classified as **safe**? 


**Hint:** Set `output_type='probability'` to make **probability** predictions using **decision_tree_model** on `sample_validation_data`. If you are using scikit-learn, you can use the .predict_proba() method.

In [82]:
prob_predictions = decision_tree_model.predict(sample_validation_data, output_type='probability')
#prob_predictions = prob_predictions.to_numpy()
#print "Probs %0.4f" %prob_predictions
print prob_predictions

[0.5473502278327942, 0.4891221821308136, 0.4559234082698822, 0.5864479541778564]


In [83]:
max_prob, index = max([(v,i) for i,v in enumerate(prob_predictions)])
print "Highest probabilty = %0.4f for item at index %d" %(max_prob, index)

sample_validation_data[index]

Highest probabilty = 0.5864 for item at index 3


{'dti': 16.33,
 'emp_length_num': 11,
 'grade': 'A',
 'home_ownership': 'MORTGAGE',
 'last_delinq_none': 1,
 'last_major_derog_none': 1,
 'purpose': 'debt_consolidation',
 'revol_util': 62.1,
 'safe_loans': -1,
 'short_emp': 0,
 'sub_grade': 'A5',
 'term': ' 36 months',
 'total_rec_late_fee': 0.0}

In [84]:
#prob_predictions = prob_predictions.to_numpy()
#max_value = max(prob_predictions)
#max_index = prob_predictions.index(max_value)
#print "Highest probabilty = %0.2f for item at index %d" %(max_value, max_index)
#'numpy.ndarray' and 'SArray' object object has no attribute 'index'

In [85]:
#max([(v,i) for i,v in enumerate(prob_predictions)])
#print "Highest probabilty = %0.2f for item at index %d" %(v, prob_predictions.index(i))
#'numpy.ndarray' and 'SArray' object object has no attribute 'index'

**Quiz Question:** Which loan has the highest probability of being classified as a **safe loan**?

The 4th loan, although this is actually an unsafe loan!!.

**Checkpoint:** Can you verify that for all the predictions with `probability >= 0.5`, the model predicted the label **+1**?

In [86]:
prob_2_class = [+1 if x >= 0.5 else -1 for x in prob_predictions]  
print "Number matching ", sum( np.equal(predictions, prob_2_class))

Number matching  4


### Tricky predictions!

**14.** Now, we will explore something pretty interesting. For each row in the **sample_validation_data**, what is the probability (according to **small_model**) of a loan being classified as **safe**?

**Hint:** Set `output_type='probability'` to make **probability** predictions using **small_model** on `sample_validation_data`:

In [87]:
prob_predictions_small = small_model.predict(sample_validation_data, output_type='probability')
print prob_predictions_small
print sample_validation_data.head()

[0.5242817997932434, 0.47226759791374207, 0.47226759791374207, 0.5798847675323486]
+-------+-----------+-----------+----------------+----------------+-------+
| grade | sub_grade | short_emp | emp_length_num | home_ownership |  dti  |
+-------+-----------+-----------+----------------+----------------+-------+
|   B   |     B3    |     0     |       11       |      OWN       | 11.18 |
|   D   |     D1    |     0     |       10       |      RENT      | 16.85 |
|   D   |     D2    |     0     |       3        |      RENT      | 13.97 |
|   A   |     A5    |     0     |       11       |    MORTGAGE    | 16.33 |
+-------+-----------+-----------+----------------+----------------+-------+
+--------------------+------------+------------------+-----------------------+
|      purpose       |    term    | last_delinq_none | last_major_derog_none |
+--------------------+------------+------------------+-----------------------+
|    credit_card     |  36 months |        1         |           1      

**Quiz Question:** Notice that the probability preditions are the **exact same** for the 2nd and 3rd loans. Why would this happen?

Follow the tree!, The tree growing algorithm usually decides on a varibale to split on by checking which split results in most pure nodes. So it picked feature="grade==A" to split on at base/1st decision/split of tree. Even at 2nd level split decision, it again chose "grade==B". So since both 2nd and 3rd loan were D grade loans, they got binned together and got same prediction as bad loans, even though in actual fact 2nd loan was safe_loan.

## Visualize the prediction on a tree


**14a.** Note that you should be able to look at the small tree (of depth 2), traverse it yourself, and visualize the prediction being made. Consider the following point in the **sample_validation_data**

In [88]:
sample_validation_data[1]

{'dti': 16.85,
 'emp_length_num': 10,
 'grade': 'D',
 'home_ownership': 'RENT',
 'last_delinq_none': 1,
 'last_major_derog_none': 1,
 'purpose': 'debt_consolidation',
 'revol_util': 96.4,
 'safe_loans': 1,
 'short_emp': 0,
 'sub_grade': 'D1',
 'term': ' 36 months',
 'total_rec_late_fee': 0.0}

Let's visualize the small tree here to do the traversing for this data point.

In [89]:
small_model.show(view="Tree")

**Note:** In the tree visualization above, the values at the leaf nodes are not class predictions but scores (a slightly advanced concept that is out of the scope of this course). You can read more about this [here](https://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf).  If the score is $\geq$ 0, the class +1 is predicted.  Otherwise, if the score < 0, we predict class -1.


**Quiz Question:** Based on the visualized tree, what prediction would you make for this data point?

The grade of loan is D. Now follow the tree from top, 
* 1st split decision Grade==A, No - so take left branch
* 2nd split decision Grade==B, No - so take left branch - leads to leaf nore score -0.11
* the score < 0, we predict class -1

**15.** Now, let's verify your prediction by examining the prediction made using GraphLab Create.  Use the `.predict` function on `small_model`.

In [90]:
prob_predictions_small = small_model.predict(sample_validation_data[1],output_type='class')
print prob_predictions_small
print sample_validation_data[1]['safe_loans']

[-1]
1


# Evaluating accuracy of the decision tree model

Recall that the accuracy is defined as follows:
$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

**16.** Let us start by evaluating the accuracy of the `small_model` and `decision_tree_model` on the training data

In [91]:
print round(small_model.evaluate(train_data)['accuracy'],2)
print round(decision_tree_model.evaluate(train_data)['accuracy'],2)

0.61
0.64


**Checkpoint:** You should see that the **small_model** performs worse than the **decision_tree_model** on the training data.


**17.** Now, let us evaluate the accuracy of the **small_model** and **decision_tree_model** on the entire **validation_data**, not just the subsample considered above.

In [92]:
print round(small_model.evaluate(validation_data)['accuracy'],2)
print round(decision_tree_model.evaluate(validation_data)['accuracy'], 2)

0.62
0.64


**Quiz Question:** What is the accuracy of `decision_tree_model` on the validation set, rounded to the nearest .01?
0.64

## Evaluating accuracy of a complex decision tree model

Here, we will train a large decision tree with `max_depth=10`. This will allow the learned tree to become very deep, and result in a very complex model. Recall that in lecture, we prefer simpler models with similar predictive power. This will be an example of a more complicated model which has similar predictive power, i.e. something we don't want.

**18.** If using sklearn, use sklearn.tree.DecisionTreeClassifier to train a decision tree with maximum depth = 10. Call this model big_model.

In [93]:
big_model = gl.decision_tree_classifier.create(train_data, validation_set=None,
                   target = target, features = features, max_depth = 10)

**19.** Now, let us evaluate **big_model** on the training set and validation set.

In [94]:
print round(big_model.evaluate(train_data)['accuracy'],2)
print round(big_model.evaluate(validation_data)['accuracy'],2)

0.67
0.63


**Checkpoint:** We should see that **big_model** has even better performance on the training set than **decision_tree_model** did on the training set.

**Quiz Question:** How does the performance of **big_model** on the validation set compare to **decision_tree_model** on the validation set? Is this a sign of overfitting?  Big Model has worse performance (accuracy 63%) on validation set. Simpler **decision_tree_model** has marginally better accuracy - 64%. Shows that a complex model that overfits training data (i.e small error on training data - high accuracy 67%, due low bias) doesn't generallise well and performs poorly than simpler models on test/validation data.

### Quantifying the cost of mistakes

Every mistake the model makes costs money. In this section, we will try and quantify the cost of each mistake made by the model.

Assume the following:

* **False negatives**: Loans that were actually safe but were predicted to be risky. This results in an oppurtunity cost of losing a loan that would have otherwise been accepted. 
* **False positives**: Loans that were actually risky but were predicted to be safe. These are much more expensive because it results in a risky loan being given. 
* **Correct predictions**: All correct predictions don't typically incur any cost.


Let's write code that can compute the cost of mistakes made by the model. Complete the following 4 steps:
1. First, let us compute the predictions made by the model.
1. Second, compute the number of false positives.
2. Third, compute the number of false negatives.
3. Finally, compute the cost of mistakes made by the model by adding up the costs of true positives and false positives.

First, let us make predictions on `validation_data` using the `decision_tree_model`:

In [95]:
predictions = decision_tree_model.predict(validation_data)
true_labels = validation_data['safe_loans']

**False positives** are predictions where the model predicts +1 but the true label is -1. Complete the following code block for the number of false positives:

In [96]:
#[1 if (x == 1 and y == -1) else 0 for x,y in enumerate(predictions, actuals)]

#TypeError: 'SArray' object cannot be interpreted as an index

false_positives = sum((predictions == +1) & (true_labels == -1))
false_positives

1656

**False negatives** are predictions where the model predicts -1 but the true label is +1. Complete the following code block for the number of false negatives:

In [97]:
false_negatives = sum((predictions == -1) & (true_labels == +1))
false_negatives

1716

**Quiz Question:** Let us assume that each mistake costs money:
* Assume a cost of \$10,000 per false negative.
* Assume a cost of \$20,000 per false positive.

What is the total cost of mistakes made by `decision_tree_model` on `validation_data`?

In [98]:
total_cost = 10000 * false_negatives + 20000 * false_positives
print "Total_cost for wrong predictions: $%.2f" %total_cost 

Total_cost for wrong predictions: $50280000.00
