In [141]:
import numpy as np 
import pandas as pd
pd.set_option('display.max_colwidth', -1)


## Load LendingClub dataset

In [142]:
loans =pd.read_csv('lending-club-data.csv')

In [143]:
loans.head(2)

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,sub_grade_num,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,0.4,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,0.8,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1


## Exploring some features

Let's quickly explore what the dataset looks like. 

First, let's print out the column names to see what features we have in this dataset.




In [144]:
loans.columns

Index([u'id', u'member_id', u'loan_amnt', u'funded_amnt', u'funded_amnt_inv',
       u'term', u'int_rate', u'installment', u'grade', u'sub_grade',
       u'emp_title', u'emp_length', u'home_ownership', u'annual_inc',
       u'is_inc_v', u'issue_d', u'loan_status', u'pymnt_plan', u'url', u'desc',
       u'purpose', u'title', u'zip_code', u'addr_state', u'dti',
       u'delinq_2yrs', u'earliest_cr_line', u'inq_last_6mths',
       u'mths_since_last_delinq', u'mths_since_last_record', u'open_acc',
       u'pub_rec', u'revol_bal', u'revol_util', u'total_acc',
       u'initial_list_status', u'out_prncp', u'out_prncp_inv', u'total_pymnt',
       u'total_pymnt_inv', u'total_rec_prncp', u'total_rec_int',
       u'total_rec_late_fee', u'recoveries', u'collection_recovery_fee',
       u'last_pymnt_d', u'last_pymnt_amnt', u'next_pymnt_d',
       u'last_credit_pull_d', u'collections_12_mths_ex_med',
       u'mths_since_last_major_derog', u'policy_code', u'not_compliant',
       u'status', u'inactiv

Here, we see that we have some feature columns that have to do with grade of the loan, annual income, home ownership status, etc. Let's take a look at the distribution of loan grades in the dataset.

In [145]:
loans.groupby(['grade'])['id'].count()

grade
A    22314
B    37172
C    29950
D    19175
E    8990 
F    3932 
G    1074 
Name: id, dtype: int64

We can see that over half of the loan grades are assigned values B or C. Each loan is assigned one of these grades, along with a more finely discretized feature called sub_grade (feel free to explore that feature column as well!). These values depend on the loan application and credit report, and determine the interest rate of the loan.

In [146]:
loans.groupby(['home_ownership'])['id'].count()

home_ownership
MORTGAGE    59240
OTHER       179  
OWN         9943 
RENT        53245
Name: id, dtype: int64

This feature describes whether the loanee is mortaging, renting, or owns a home. We can see that a small percentage of the loanees own a home.



## Exploring the target column

The target column (label column) of the dataset that we are interested in is called `bad_loans`. In this column **1** means a risky (bad) loan **0** means a safe  loan.

In order to make this more intuitive and consistent with the lectures, we reassign the target to be:
* **+1** as a safe  loan, 
* **-1** as a risky (bad) loan. 

We put this in a new column called `safe_loans`.

In [147]:
loans['bad_loans'][0:3]

0    0
1    1
2    0
Name: bad_loans, dtype: int64

In [148]:
# safe_loans =  1 => safe
# safe_loans = -1 => risky
loans['safe_loans'] =loans['bad_loans'].apply(lambda x:+1 if x==0 else -1)
del loans['bad_loans']

In [149]:
print len(loans)
loans.head(2)

122607


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,delinq_2yrs_zero,pub_rec_zero,collections_12_mths_zero,short_emp,payment_inc_ratio,final_d,last_delinq_none,last_record_none,last_major_derog_none,safe_loans
0,1077501,1296599,5000,5000,4975,36 months,10.65,162.87,B,B2,...,1.0,1.0,1.0,0,8.1435,20141201T000000,1,1,1,1
1,1077430,1314167,2500,2500,2500,60 months,15.27,59.83,C,C4,...,1.0,1.0,1.0,1,2.3932,20161201T000000,1,1,1,-1


Now, let us explore the distribution of the column safe_loans. This gives us a sense of how many safe and risky loans are present in the dataset.

In [150]:
#pd.DataFrame({'Percentage': df.groupby(('ID', 'Feature')).size() / len(df)})
loans.groupby(['safe_loans'])['id'].count()

safe_loans
-1    23150
 1    99457
Name: id, dtype: int64

You should have:

Around 81% safe loans

Around 19% risky loans

It looks like most of these loans are safe loans (thankfully). But this does make our problem of identifying risky loans challenging.

## Features for the classification algorithm

In this assignment, we will be using a subset of features (categorical and numeric). The features we will be using are described in the code comments below. 

In [151]:
features = ['grade',                     # grade of the loan
            'sub_grade',                 # sub-grade of the loan
            'short_emp',                 # one year or less of employment
            'emp_length_num',            # number of years of employment
            'home_ownership',            # home_ownership status: own, mortgage or rent
            'dti',                       # debt to income ratio
            'purpose',                   # the purpose of the loan
            'term',                      # the term of the loan
            'last_delinq_none',          # has borrower had a delinquincy
            'last_major_derog_none',     # has borrower had 90 day or worse rating
            'revol_util',                # percent of available credit being used
            'total_rec_late_fee',        # total late fees received to day
           ]

target = 'safe_loans'                   # prediction target (y) (+1 means safe, -1 is risky)

# Extract the feature columns and target column
loans = loans[features + [target]]

In [152]:
loans.head(2)

Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
0,B,B2,0,11,RENT,27.65,credit_card,36 months,1,1,83.7,0.0,1
1,C,C4,1,1,RENT,1.0,car,60 months,1,1,9.4,0.0,-1


What remains now is a **subset of features** and the **target** that we will use for the rest of this notebook. 

## Split data into training and validation sets


**If you are NOT using SFrame, download the list of indices for the training and validation sets:**

module-5-assignment-1-train-idx.json.zip

module-5-assignment-1-validation-idx.json.zip

Then follow the following steps:

Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding. 

Alternatively, see #7 for implementation hints.

Load the JSON files into the lists train_idx and validation_idx.

Perform train/validation split using train_idx and validation_idx. 
In Pandas, for instance:


In [153]:
# Apply one-hot encoding to loans. Your tool may have a function for one-hot encoding.


categorical_variables =[]
for feat_name,feat_type in zip(loans.columns,loans.dtypes):
    if feat_type==object: # In pandas dataframe string types shows as object 
        categorical_variables.append(feat_name)

#df['list_from_dict'] = [[x['name'] for x in list_dict] for list_dict in df['list_dicts']]

for feature in categorical_variables:
    loans_one_hot = loans[feature].apply(lambda x:{x:1})
    # the above o/p will give like :  1 {u' 60 months': 1}- so need to convert it like {' 60 months': 1}which is list of dicts
    loans_one_hot_encoded =loans_one_hot.values.tolist() # gives list of dict 
    loans_unpacked = pd.DataFrame(loans_one_hot_encoded) # gives a dataframe 
    
    # Change NaN's to 0's
    for columns in loans_unpacked.columns:
        loans_unpacked[columns]=loans_unpacked[columns].fillna(0)
        loans[columns] = loans_unpacked[columns].values
    del loans[feature]  # removing cols ['grade', 'sub_grade', 'home_ownership', 'purpose', 'term']
    


In [154]:
loans.head(2)

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,A,B,...,house,major_purchase,medical,moving,other,small_business,vacation,wedding,36 months,60 months
0,0,11,27.65,1,1,83.7,0.0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1,1,1.0,1,1,9.4,0.0,-1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [155]:
# 1st read the indexes in a json file 
train_val=pd.read_json('module-5-assignment-1-train-idx.json')
valid_val=pd.read_json('module-5-assignment-1-validation-idx.json')

# list out the values which is ndarray
lst_train =train_val.values.tolist()
lst_valid = valid_val.values.tolist()
# flattening the list of list to single list 
flat_list_train = [item for sublist in lst_train for item in sublist]
flat_list_validation = [item  for sublist in lst_valid  for item in sublist]

In [156]:
train_data = loans.iloc[flat_list_train]
validation_data = loans.iloc[flat_list_validation]
print len(train_data) , len(validation_data)

37224 9284


In [157]:
train_data.head(2)

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,A,B,...,house,major_purchase,medical,moving,other,small_business,vacation,wedding,36 months,60 months
1,1,1,1.0,1,1,9.4,0.0,-1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,0,5,5.55,1,1,32.6,0.0,-1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


## Now proceed to the section "Build a decision tree classifier", skipping three sections below.



## Sample data to balance classes

As we explored above, our data is disproportionally full of safe loans.  Let's create two datasets: one with just the safe loans (`safe_loans_raw`) and one with just the risky loans (`risky_loans_raw`).

In [49]:
safe_loans_raw=loans[loans[target]==+1]
risky_loans_raw = loans[loans[target] == -1]
print "Number of safe loans  : %s" % len(safe_loans_raw)
print "Number of risky loans : %s" % len(risky_loans_raw)

Number of safe loans  : 99457
Number of risky loans : 23150


Now, write some code to compute below the percentage of safe and risky loans in the dataset and validate these numbers against what was given   earlier in the assignment:

In [50]:
print "Percentage of safe loans  :", float(len(safe_loans_raw))*100/len(loans)
print "Percentage of risky loans :", float(len(risky_loans_raw))*100/len(loans)

Percentage of safe loans  : 81.1185331996
Percentage of risky loans : 18.8814668004


One way to combat class imbalance is to undersample the larger class until the class distribution is approximately half and half. Here, we will undersample the larger class (safe loans) in order to balance out our dataset. This means we are throwing away many data points. We used seed=1 so everyone gets the same results.



In [51]:
# Since there are fewer risky loans than safe loans, find the ratio of the sizes
# and use that percentage to undersample the safe loans.
percentage = len(risky_loans_raw)/float(len(safe_loans_raw))
print percentage
risky_loans =risky_loans_raw
safe_loans = safe_loans_raw.sample(frac=percentage,replace=True)

# Append the risky_loans with the downsampled version of safe_loans
loans_data =risky_loans.append(safe_loans)

0.232763908021


In [52]:
print len(loans_data)
loans_data.head(2)

46300


Unnamed: 0,grade,sub_grade,short_emp,emp_length_num,home_ownership,dti,purpose,term,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans
1,C,C4,1,1,RENT,1.0,car,60 months,1,1,9.4,0.0,-1
6,F,F2,0,5,OWN,5.55,small_business,60 months,1,1,32.6,0.0,-1


Now, let's verify that the resulting percentage of safe and risky loans are each nearly 50%.

In [53]:
print "Percentage of safe loans                 :", len(safe_loans) / float(len(loans_data))
print "Percentage of risky loans                :", len(risky_loans) / float(len(loans_data))
print "Total number of loans in our new dataset :", len(loans_data)

Percentage of safe loans                 : 0.5
Percentage of risky loans                : 0.5
Total number of loans in our new dataset : 46300


**Note:** There are many approaches for dealing with imbalanced data, including some where we modify the learning algorithm. These approaches are beyond the scope of this course, but some of them are reviewed in this [paper](http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=5128907&url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F69%2F5173046%2F05128907.pdf%3Farnumber%3D5128907 ). For this assignment, we use the simplest possible approach, where we subsample the overly represented class to get a more balanced dataset. In general, and especially when the data is highly imbalanced, we recommend using more advanced methods.

## One-hot encoding 

#7. For scikit-learn's decision tree implementation, it requires numerical values for it's data matrix. This means you will have to turn categorical variables into binary features via one-hot encoding.

In [21]:
loans_data.dtypes

grade                    object 
sub_grade                object 
short_emp                int64  
emp_length_num           int64  
home_ownership           object 
dti                      float64
purpose                  object 
term                     object 
last_delinq_none         int64  
last_major_derog_none    int64  
revol_util               float64
total_rec_late_fee       float64
safe_loans               int64  
dtype: object

In [123]:
loans_data =risky_loans.append(safe_loans)

categorical_variables =[]
for feat_name,feat_type in zip(loans_data.columns,loans_data.dtypes):
    if feat_type==object: # In pandas dataframe string types shows as object 
        categorical_variables.append(feat_name)

#df['list_from_dict'] = [[x['name'] for x in list_dict] for list_dict in df['list_dicts']]

for feature in categorical_variables:
    loans_data_one_hot = loans_data[feature].apply(lambda x:{x:1})
    # the above o/p will give like :  1 {u' 60 months': 1}- so need to convert it like {' 60 months': 1}which is list of dicts
    loans_data_one_hot_encoded =loans_data_one_hot.values.tolist() # gives list of dict 
    loans_data_unpacked = pd.DataFrame(loans_data_one_hot_encoded) # gives a dataframe 
    
    # Change NaN's to 0's
    for columns in loans_data_unpacked.columns:
        loans_data_unpacked[columns]=loans_data_unpacked[columns].fillna(0)
        loans_data[columns] = loans_data_unpacked[columns].values
    del loans_data[feature]  # removing cols ['grade', 'sub_grade', 'home_ownership', 'purpose', 'term']
    


In [124]:
loans_data.head(2)
#del loans_data

Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,A,B,...,house,major_purchase,medical,moving,other,small_business,vacation,wedding,36 months,60 months
1,1,1,1.0,1,1,9.4,0.0,-1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6,0,5,5.55,1,1,32.6,0.0,-1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0


## Split data into training and validation


#8. We split the data into training and validation sets using an 80/20 split and specifying seed=1 
so everyone gets the same results. Call the training and validation sets train_data and validation_data, respectively.

**Note: **
    
In previous assignments, we have called this a train-test split. However, the portion of data that we don't train on will be used to help select model parameters (this is known as model selection). Thus, this portion of data should be called a validation set. Recall that examining performance of various potential models (i.e. models with different parameters) should be on validation set, while evaluation of the final selected model should always be on test data. Typically, we would also save a portion of the data (a real test set) to test our final model on or use cross-validation on the training set to select our final model. But for the learning purposes of this assignment, we won't do that.



In [125]:
print target
train_data = loans_data.sample(frac=0.8,random_state=200)
validation_data = loans_data.drop(train_data.index)
train_data_target = train_data[target]
validation_data_target = validation_data[target]

safe_loans


In [126]:
len(train_data) , len(validation_data) , len(loans_data)

(37040, 8450, 46300)

## Build a decision tree classifier


#9. Now, let's use the built-in scikit learn decision tree learner 
(sklearn.tree.DecisionTreeClassifier) to create a loan prediction model on the training data. 

To do this, you will need to import sklearn, sklearn.tree, and numpy.

Note: You will have to first convert the SFrame into a numpy data matrix, and extract the target labels as a numpy array (Hint: you can use the .to_numpy() method call on SFrame to turn SFrames into numpy arrays). See the API for more information. Make sure to set max_depth=6.

Call this model decision_tree_model.


#10. Also train a tree using with max_depth=2. Call this model small_model.



In [136]:
import sklearn, sklearn.tree

In [158]:
type(train_data[target].as_matrix())


numpy.ndarray

In [159]:
train_data.as_matrix(columns=features)
train_data.dtypes


short_emp                int64  
emp_length_num           int64  
dti                      float64
last_delinq_none         int64  
last_major_derog_none    int64  
revol_util               float64
total_rec_late_fee       float64
safe_loans               int64  
A                        float64
B                        float64
C                        float64
D                        float64
E                        float64
F                        float64
G                        float64
A1                       float64
A2                       float64
A3                       float64
A4                       float64
A5                       float64
B1                       float64
B2                       float64
B3                       float64
B4                       float64
B5                       float64
C1                       float64
C2                       float64
C3                       float64
C4                       float64
C5                       float64
          

In [160]:
decision_tree_model= sklearn.tree.DecisionTreeClassifier( max_depth=6,random_state=0 )
decision_tree_model.fit(train_data.as_matrix(),train_data['safe_loans'])
#decision_tree_model.fit(train_data,train_data['safe_loans'])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=0, splitter='best')

In [161]:
decision_tree_model.classes_  , decision_tree_model.class_weight , decision_tree_model.get_params 

(array([-1,  1]),
 None,
 <bound method DecisionTreeClassifier.get_params of DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=6,
             max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
             min_samples_split=2, min_weight_fraction_leaf=0.0,
             presort=False, random_state=0, splitter='best')>)

In [162]:
decision_tree_model.predict(validation_data)

array([-1, -1, -1, ...,  1,  1,  1])

#10. Also train a tree using with max_depth=2. 

Call this model small_model.

In [170]:
small_model= sklearn.tree.DecisionTreeClassifier( max_depth=2 )
small_model.fit(train_data.as_matrix(),train_data['safe_loans'])

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=2,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

## Visualizing a learned model

10a. For this optional section, we would like to see what the small learned tree looks like. 

If you are using scikit-learn and have the package Graphviz, then you will be able to perform this section. 

If you are using a different software, try your best to follow along.

In [171]:
'''
from IPython.core.magic import register_line_magic

@register_line_magic
def pip(args):
    """Use pip from the current kernel"""
    from pip import main
    main(args.split())
    
%pip install graphviz

Successfully installed graphviz-0.8.4 - o/p
the above code is used to install graphviz
'''


'\nfrom IPython.core.magic import register_line_magic\n\n@register_line_magic\ndef pip(args):\n    """Use pip from the current kernel"""\n    from pip import main\n    main(args.split())\n    \n%pip install graphviz\n\nSuccessfully installed graphviz-0.8.4 - o/p\nthe above code is used to install graphviz\n'

In [164]:
from graphviz import Digraph

In [165]:
sklearn.tree.export_graphviz(small_model,out_file='small_model.dot')

In [166]:
from subprocess import check_call
check_call(['dot','-Tpng','small_model.dot','-o','small_model.png'])

# o/p 0 means it creates a png file in the path 

0

# Making predictions

Let's consider two positive and two negative examples **from the validation set** and see what the model predicts. We will do the following:
* Predict whether or not a loan is safe.
* Predict the probability that a loan is safe.

In [171]:
validation_safe_loans=validation_data[validation_data[target]==1]
validation_risky_loans = validation_data[validation_data[target] == -1]

sample_validation_data_risky = validation_risky_loans[0:2]
sample_validation_data_safe = validation_safe_loans[0:2]

sample_validation_data = sample_validation_data_safe.append(sample_validation_data_risky)

print len(sample_validation_data)
sample_validation_data

4


Unnamed: 0,short_emp,emp_length_num,dti,last_delinq_none,last_major_derog_none,revol_util,total_rec_late_fee,safe_loans,A,B,...,house,major_purchase,medical,moving,other,small_business,vacation,wedding,36 months,60 months
19,0,11,11.18,1,1,82.4,0.0,1,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
79,0,10,16.85,1,1,96.4,0.0,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
24,0,3,13.97,0,1,59.5,0.0,-1,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
41,0,11,16.33,1,1,62.1,0.0,-1,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


#12. Now, we will use our model to predict whether or not a loan is likely to default. 

For each row in the sample_validation_data, use the decision_tree_model to predict whether or not 

the loan is classified as a safe loan. 

**(Hint: if you are using scikit-learn**, you can use the .predict() method)



**Quiz Question**: What percentage of the predictions on sample_validation_data did decision_tree_model get correct?



In [172]:
sample_validation_data['safe_loans']

19    1
79    1
24   -1
41   -1
Name: safe_loans, dtype: int64

In [173]:
decision_tree_model.predict(sample_validation_data)

array([ 1,  1, -1, -1])

In [45]:
#Ans predicts all correct i.e 100%,here it shows 100% but graphlab is 50% may be the decison model features is differs

**Explore probability predictions**
#13. For each row in the sample_validation_data, 

what is the probability (according decision_tree_model) of a loan being classified as safe? 

(Hint: if you are using scikit-learn, you can use the .predict_proba() method)



In [174]:

'''
returns p : array of shape = [n_samples, n_classes], or a list of n_outputs  
such arrays if n_outputs > 1. 
The class probabilities of the input samples. The order of the classes corresponds to that in the attribute classes_.
'''
decision_tree_model.predict_proba(sample_validation_data)


array([[ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.]])

**Quiz Question:** Which loan has the highest probability of being classified as a safe loan?

**Checkpoint:** Can you verify that for all the predictions with probability >= 0.5, the model predicted the label +1?

In [50]:
# this question i have doubt as i am unable to know interpret the predict_proba function 

### Tricky predictions!

Now, we will explore something pretty interesting. For each row in the **sample_validation_data**, what is the probability (according to **small_model**) of a loan being classified as **safe**?


**Quiz Question**: Notice that the probability preditions are the exact same for the 2nd and 3rd loans. Why would this happen?



In [175]:
small_model.predict_proba(sample_validation_data)

array([[ 0.,  1.],
       [ 0.,  1.],
       [ 1.,  0.],
       [ 1.,  0.]])

In [53]:
# these things i am getting wrong . heavy doubt , i didnt get exact same 

## Visualize the prediction on a tree

14a. Note that you should be able to look at the small tree (of depth 2), traverse it yourself, and visualize the prediction being made. Consider the following point in the sample_validation_data:



If you have Graphviz, go ahead and re-visualize small_model here to do the traversing for this data point.



In [69]:
## unable to understand how to do 

## Evaluating accuracy of the decision tree model

Recall that the accuracy is defined as follows:
$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$
​
Let us start by evaluating the accuracy of the `small_model` and `decision_tree_model` on the training data

#16. Evaluate the accuracy of small_model and decision_tree_model on the training data. 


(Hint: if you are using scikit-learn, you can use the .score() method)

Checkpoint: You should see that the small_model performs worse than the decision_tree_model on the training data.



In [176]:
print small_model.score(train_data,train_data['safe_loans'])
print decision_tree_model.score(train_data,train_data['safe_loans'])
# both results i am getting wrong . so need to check how this is done 

1.0
1.0


#17. Now, evaluate the accuracy of the small_model and decision_tree_model on the entire validation_data, not just the subsample considered above.

Quiz Question: What is the accuracy of decision_tree_model on the validation set, rounded to the nearest .01?



In [177]:
print small_model.score(validation_data,validation_data['safe_loans'])
print decision_tree_model.score(validation_data,validation_data['safe_loans'])
# both results i am getting wrong . so need to check how this is done 

1.0
1.0
