# Modeling loan approval using demo data
Using [this data](https://github.com/IBMDevConnect/RBSHack2018/raw/master/hackdata/hack_data_v1.zip), this notebook outlines the modeling process to predict whether an applicant will repay their loan, and then make this prediction for new applications. If you are working in Watson Studio, find instructions for loading the data into your project in the `getting-data-clean.ipynb` notebook. Else, you can access the data from the link yourself and load the data using `pd.read_csv()` in ypour jupyter environment.

Problem statement:
"I need some base models using GTB , RF , SVM or whatever algo of your choice..we have the logistic regression one..
This is for first time model developers, I need reference models on different  algorithmns to give them visibility 
on the impact of the associated hyperparamters...if you could pick any, it helps...preferably python notebook since 
I will attach WML part to it later..if not , any model is fine.."

## Workflow
1. Data access
2. Data understanding and shaping
3. Modeling
4. Evalutation
5. WML deployment

## Data access

If you are using Watson Studio, use the _Insert To Code_ function on the left to get your data. Else, load into a pd df as usual.

In [1]:
# Use Insert to Code to get data using Watson Studio, or load into pd.df from filesystem 

train_df = pd.read_csv(body)
train_df.head()

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,batch_enrolled,int_rate,grade,sub_grade,emp_title,...,collections_12_mths_ex_med,mths_since_last_major_derog,application_type,verification_status_joint,last_week_pay,acc_now_delinq,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,loan_status
0,58189336,14350,14350,14350.0,36 months,,19.19,E,E3,clerk,...,0.0,74.0,INDIVIDUAL,,26th week,0.0,0.0,28699.0,30800.0,0
1,70011223,4800,4800,4800.0,36 months,BAT1586599,10.99,B,B4,Human Resources Specialist,...,0.0,,INDIVIDUAL,,9th week,0.0,0.0,9974.0,32900.0,0
2,70255675,10000,10000,10000.0,36 months,BAT1586599,7.26,A,A4,Driver,...,0.0,,INDIVIDUAL,,9th week,0.0,65.0,38295.0,34900.0,0
3,1893936,15000,15000,15000.0,36 months,BAT4808022,19.72,D,D5,Us office of Personnel Management,...,0.0,,INDIVIDUAL,,135th week,0.0,0.0,55564.0,24700.0,0
4,7652106,16000,16000,16000.0,36 months,BAT2833642,10.64,B,B2,LAUSD-HOLLYWOOD HIGH SCHOOL,...,0.0,,INDIVIDUAL,,96th week,0.0,0.0,47159.0,47033.0,0


In [2]:
# Use Insert to Code to get data using Watson Studio, or load into pd.df from filesystem 

test_df = pd.read_csv(body)
test_df.shape

(354951, 44)

## Data understanding and shaping

In [3]:
train_df.shape

(532428, 45)

In [4]:
train_df.columns

Index(['member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term',
       'batch_enrolled', 'int_rate', 'grade', 'sub_grade', 'emp_title',
       'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
       'pymnt_plan', 'desc', 'purpose', 'title', 'zip_code', 'addr_state',
       'dti', 'delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq',
       'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal',
       'revol_util', 'total_acc', 'initial_list_status', 'total_rec_int',
       'total_rec_late_fee', 'recoveries', 'collection_recovery_fee',
       'collections_12_mths_ex_med', 'mths_since_last_major_derog',
       'application_type', 'verification_status_joint', 'last_week_pay',
       'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim',
       'loan_status'],
      dtype='object')

In [5]:
# train_df.info()

Was able to collect some info from the web, for example: https://www.kaggle.com/c/to-loan-or-not-to-loan-that-is-the-question/data, https://www.kaggle.com/sundarn/sundar-narasimhan

`member_id` is a personal identifier, do not want to include this in modeling.  
`loan_amnt` is the listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.  
`funded_amnt` is the total amount committed to that loan at that point in time.  
`funded_amnt_inv` is not the same as funded_amt - total amount committed by investors for that loan at that point in time.  
`term` is the length of time the loan was given `['36 months', '60 months']`.  
`batch_enrolled` classes a loan in one of 105 batches. Need to explore further.  
`int_rate` is the interest rate.  
`grade` is a categorical variable, A through G.  
`sub_grade` is another catagorical variable, as above but with groups 1-5 for each letter.  
`emp_title` is maybe job title? 190125 unique values in the training data.  
`emp_length` is number of years employed, categorical, < 1 year to 10 + years.  
`home_ownership` is status of housing, one from `['OWN', 'MORTGAGE', 'RENT', 'OTHER', 'NONE', 'ANY']`.  
`annual_inc` is the amount the applicant earns. Continuous variable.  
`verification_status` is unclear what is being verified. Categorical, one of `['Source Verified', 'Not Verified', 'Verified']`.  
`pymnt_plan` is `y` or `n`.  
`desc` is free-text description of why the loan has been applied for, e.g. "My goal is to obtain a loan to pay off my high credit cards and get out of debt within 3 years."  
`purpose` is a categorical variable with reason for taking the loan, one from `['debt_consolidation', 'home_improvement', 'credit_card', 'other','major_purchase', 'small_business', 'vacation', 'car', 'moving', 'medical', 'wedding', 'renewable_energy', 'house', 'educational']`  
`title` is another category with info as to why the loan was taken out. 39694 unique values.  
`zip_code` is first 3 chars of zip code, 917 uniques  
`addr_state` are state codes  
`dti` is some value, borrower's debt to income ratio, calculated using the monthly payments on the total debt obligations, excluding mortgage, divided by self-reported monthly income  
`delinq_2yrs` is integers, 0 through 30 and NaNs. Number of late payments in last 2 years?  
`inq_last_6mths` is integers, 0 through 31 and NaNs. Number of late payments in last 6 months?  
`mths_since_last_delinq` is int of months since last late payment  
`mths_since_last_record` is another int, not sure what the record is..  
`open_acc` is another int. The number of open credit lines in the borrower's credit file.  
`pub_rec` is another int. Number of derogatory public records.  
`revol_bal` is another int. Total credit revolving balance.  
`revol_util` is a 1 decimal place float. Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.  
`total_acc` is another int. The total number of credit lines currently in the borrower's credit file.  
`initial_list_status` values are 'w' or 'f'. Unsure.  
`total_rec_int` are float values, interest recieved to date.  
`total_rec_late_fee` are float values, total recieved in late fees.  
`recoveries` could be amount recovered. Float values. Mainly 0.00.  
`collection_recovery_fee` could be fee charged for collecting the recovery. Mainly 0.00.  
`collections_12_mths_ex_med` - Number of collections in 12 months excluding medical.  
`mths_since_last_major_derog` is number on months since last most recent 90-day or worse rating  
`application_type` is whether the application is `['INDIVIDUAL', 'JOINT']`  
`verification_status_joint` is `[nan, 'Verified', 'Not Verified', 'Source Verified']` - verification status where application is joint. `train_df[['verification_status_joint', 'application_type']].where(train_df['application_type'] == 'JOINT').dropna()`  
`last_week_pay` is a category, 98 values.  
`acc_now_delinq` is the number of accounts on which the borrower is now delinquent. Int.  
`tot_coll_amt` is the total collection amounts ever owed.  
`tot_cur_bal` is the total current balance of all accounts.
`total_rev_hi_lim` is the total revolving high credit/credit limit.  

`loan_status` is the current status of the loan.

### Actions
```
* Remove ['member_id', 'batch_enrolled', 'desc']
* Integer encode: ['emp_length', 'delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'pub_rec', 'total_acc', 'open_acc', 'collections_12_mths_ex_med', 'acc_now_delinq', 'last_week_pay']
* One hot encode: ['term', 'grade', 'sub_grade', 'home_ownership', 'verification_status', 'pymnt_plan', 'purpose', 'addr_state', 'initial_list_status', 'application_type', 'verification_status_joint', ]
* Leave as-is: ['annual_inc', 'dti', 'revol_bal', 'revol_util', 'total_acc', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'tot_coll_amt', 'tot_cur_bal' 'total_rev_hi_lim', 'open_acc', 'mths_since_last_major_derog']
* Unsure: ['member_id', 'emp_title', 'title', 'zip_code']
```
[Encoding help](https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/)


In [7]:
from sklearn import preprocessing
import numpy as np

def prep_data(WORKING_DF):
    encoded = pd.DataFrame()
    
    WORKING_DF.drop(columns=['batch_enrolled', 'desc', 'zip_code'], axis=1)
    # Mapping and encoding emp_length values
    scale_mapper = {np.nan:0, '< 1 year':1, '1 year':2, '2 years':3, '3 years':4, '4 years':5, '5 years':6, '6 years':7, '7 years':8, '8 years':9, '9 years':10, '10+ years':11}
    encoded['emp_length_encoded'] = WORKING_DF['emp_length'].replace(scale_mapper)

    # Encoding remaining ordinal variables
    grouped = WORKING_DF[['delinq_2yrs', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'pub_rec', 'total_acc', 'open_acc', 'collections_12_mths_ex_med', 'acc_now_delinq', 'last_week_pay']]
    grouped = grouped.apply(preprocessing.LabelEncoder().fit_transform)
    encoded = pd.concat([encoded, grouped], axis=1)


    # One-hot encode nominal variables
    grouped = WORKING_DF[['term', 'grade', 'sub_grade', 'home_ownership', 'verification_status', 'pymnt_plan', 'purpose', 'addr_state', 'initial_list_status', 'application_type', 'verification_status_joint', 'member_id']]
    grouped = pd.get_dummies(grouped)
    encoded = pd.concat([encoded, grouped], axis=1)

    # Append float columns to encoded df
    grouped = WORKING_DF[['annual_inc', 'dti', 'revol_bal', 'revol_util', 'total_acc', 'total_rec_int', 'total_rec_late_fee', 'recoveries',
                          'collection_recovery_fee', 'tot_coll_amt', 'tot_cur_bal', 'total_rev_hi_lim', 'open_acc', 'mths_since_last_major_derog']]

    # Fill NaN values with mean of each column
    fill_NaN = preprocessing.Imputer(missing_values=np.nan, strategy='mean', axis=0)
    imputed_df = pd.DataFrame(fill_NaN.fit_transform(grouped))
    imputed_df.columns = grouped.columns
    imputed_df.index = grouped.index

    encoded = pd.concat([encoded, imputed_df], axis=1)
    return encoded

In [8]:
encoded = prep_data(train_df)
encoded_test_df = prep_data(test_df)
encoded = encoded.drop(columns=['home_ownership_ANY'], axis=1)

In [9]:
encoded_test_df.head()

Unnamed: 0,emp_length_encoded,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,pub_rec,total_acc,open_acc,collections_12_mths_ex_med,acc_now_delinq,...,total_acc.1,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,open_acc.1,mths_since_last_major_derog
0,5,1,1,20,57397,0,52,16,0,0,...,53.0,3915.61,0.0,0.0,0.0,0.0,85230.0,45700.0,16.0,44.079923
1,6,0,0,101777,191677,0,62,8,0,0,...,63.0,1495.06,0.0,0.0,0.0,0.0,444991.0,21400.0,8.0,44.079923
2,11,1,0,23,115,1,19,11,0,0,...,20.0,2096.21,0.0,0.0,0.0,0.0,105737.0,16300.0,11.0,26.0
3,11,0,0,101778,191676,0,25,21,0,0,...,26.0,1756.31,0.0,0.0,0.0,0.0,287022.0,72400.0,21.0,44.079923
4,6,0,0,101779,191675,0,35,16,0,0,...,36.0,172.21,0.0,0.0,0.0,0.0,234278.0,26700.0,16.0,44.079923


## 3. Modeling
### SGD Classifier

In [10]:
from sklearn.model_selection import train_test_split

X = encoded
y = train_df['loan_status']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [11]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=150, tol=0.31)
sgd_clf.fit(X_train, y_train)



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=150, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=0.31, verbose=0, warm_start=False)

In [12]:
sgd_clf.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [13]:
sgd_clf.score(X_test, y_test)

0.81780071671662646

In [14]:
# Cross-fold validation example

# from sklearn.metrics import accuracy_score
# from sklearn.cross_validation import KFold

# numFolds = 10
# kf = KFold(len(X_train), numFolds, shuffle=True)

# params = [{'max_iter': 150, 'tol': 0.31}]
# Models = [SGDClassifier]

# for param, Model in zip(params, Models):
#     total = 0
#     for train_indices, test_indices in kf:
#         reg = Model(**param)
#         reg.fit(X_train, y_train)
#         predictions = reg.predict(X_test)
#         total += accuracy_score(y_test, predictions)
        
#     accuracy = total / numFolds
#     print ("Accuracy score of {0}: {1}".format(Model.__name__, accuracy))



Accuracy score of SGDClassifier: 0.7618832968964817


### Random Forest Classifier

In [15]:
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(n_estimators=1000, max_depth=2, random_state=0)
rf_clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=1000, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [16]:
importances = rf_clf.feature_importances_
indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

1. feature 11 (0.164468)
2. feature 131 (0.107282)
3. feature 132 (0.104154)
4. feature 147 (0.090419)
5. feature 10 (0.065221)
6. feature 148 (0.057849)
7. feature 12 (0.056771)
8. feature 13 (0.052084)
9. feature 149 (0.045234)
10. feature 139 (0.032300)
11. feature 143 (0.028521)
12. feature 145 (0.027919)
13. feature 146 (0.025348)
14. feature 62 (0.024167)
15. feature 2 (0.016117)
16. feature 141 (0.010369)
17. feature 79 (0.009622)
18. feature 61 (0.009314)
19. feature 140 (0.008980)
20. feature 4 (0.008588)
21. feature 151 (0.006584)
22. feature 150 (0.006151)
23. feature 3 (0.004844)
24. feature 5 (0.004689)
25. feature 7 (0.004240)
26. feature 60 (0.003182)
27. feature 0 (0.002830)
28. feature 15 (0.002696)
29. feature 1 (0.002041)
30. feature 67 (0.001655)
31. feature 14 (0.001598)
32. feature 18 (0.001518)
33. feature 6 (0.001437)
34. feature 84 (0.001389)
35. feature 63 (0.001292)
36. feature 69 (0.001271)
37. feature 8 (0.001009)
38. feature 142 (0.000676)
39. feature 138 

In [17]:
rf_clf.score(X_test, y_test)

0.76272472522106272

## 4. Score Test Data

In [18]:
encoded_test_df.head()

Unnamed: 0,emp_length_encoded,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,pub_rec,total_acc,open_acc,collections_12_mths_ex_med,acc_now_delinq,...,total_acc.1,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,tot_coll_amt,tot_cur_bal,total_rev_hi_lim,open_acc.1,mths_since_last_major_derog
0,5,1,1,20,57397,0,52,16,0,0,...,53.0,3915.61,0.0,0.0,0.0,0.0,85230.0,45700.0,16.0,44.079923
1,6,0,0,101777,191677,0,62,8,0,0,...,63.0,1495.06,0.0,0.0,0.0,0.0,444991.0,21400.0,8.0,44.079923
2,11,1,0,23,115,1,19,11,0,0,...,20.0,2096.21,0.0,0.0,0.0,0.0,105737.0,16300.0,11.0,26.0
3,11,0,0,101778,191676,0,25,21,0,0,...,26.0,1756.31,0.0,0.0,0.0,0.0,287022.0,72400.0,21.0,44.079923
4,6,0,0,101779,191675,0,35,16,0,0,...,36.0,172.21,0.0,0.0,0.0,0.0,234278.0,26700.0,16.0,44.079923


In [20]:
sgd_scored = sgd_clf.predict(encoded_test_df[:20])
sgd_scored

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0])

In [21]:
rf_scored = rf_clf.predict(encoded_test_df[:20])
rf_scored

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

I chose the [`SGDClassifer`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) because of this flowchart. I find this quite useful when it comes to a first pass at a ML problem.  
There are hyperparameters which we can play with to get better and worse performance (`max_iter`, `tol`, etc).  

<img src="http://1.bp.blogspot.com/-ME24ePzpzIM/UQLWTwurfXI/AAAAAAAAANw/W3EETIroA80/s1600/drop_shadows_background.png" alt="scikit alto selection" style="width:500px;"/>

### Further work
- Could look to do better pre-processing of the float values, with scalar/normalization.  
- Could deploy the model/models in WML.  
- Could look at more evaluation statistics of the models.  
- Could integrate with AI Openscale

---
_Notebook author: Joe Plumb_  
_Notebook version: 1.0_