## Predicting Probability of Loan Delinquency

#### In this assignment, we will explore the historical loan data from [lending club](https://www.kaggle.com/wendykan/lending-club-loan-data/data), and build a very short machine learning pipeline to predict on likelihood of loan delinquency. 

#### We will start out with inspecting and describing the data, then prepare the dataset for a K-Nearest Neighbor algorithm, and evaluate the performance of the algorithm. This exercise walks you through _most_ steps of a machine learning project with much simplicity and focus on data manipulation rather than algorithm training. You are encouraged to check out the [sklearn library](http://scikit-learn.org/stable/index.html) and the recommended reading materials this week to learn more about ML!

#### There are 17 questions in total. Each question worths 5 points. Proper documentation and readability of your code will be 15 points.

In [1]:
import pandas as pd
import numpy as np
pd.options.display.max_columns = 999
pd.set_option('display.float_format', lambda x: '%.3f' % x)

### Inspect the data 

In [5]:
loan = pd.read_csv('loan.csv')



  interactivity=interactivity, compiler=compiler, result=result)


(887379, 74)


In [6]:
print(loan.shape)

(887379, 74)


In [7]:
print(list(loan))

['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il

In [23]:
print(loan['member_id'][0:5])
print(loan['id'][0:5])

s = set()
for i in loan['member_id']:
    s.add(i)

print(len(s))

0    1296599
1    1314167
2    1313524
3    1277178
4    1311748
Name: member_id, dtype: int64
0    1077501
1    1077430
2    1077175
3    1076863
4    1075358
Name: id, dtype: int64
887379


In [18]:
u_mid = loan.member_id.unique()
print(len(u_mid))
print(max(u_mid), min(u_mid))

887379
73544841 70473


In [19]:
uid = loan.id.unique()
print(len(uid))
print(max(uid), min(uid))

887379
68617057 54734


#### Question 1: Browse the data dictionary and use pandas functions to browse the data and answer the following questions with a paragraph. You don't need to show your work :
    - How many columns are there? What general categories of information do they provide? 
    - What’s the time frame of the dataset based on the issue date of the loans? (You will have to parse the date) 
    - What’s the difference between “Id” and “Member Id”? 
    - How many unique borrowers and loans are in the dataset?

#### write a summary paragraph here: 

The "Loan.csv" file has 74 columns. The general categories are loan amount, location of person, past loan performance, where they got the loan at.  

The number of unique Id and Member Id are the same. The maximum integer value of member id is higher than id and the minimun value is lower than id. 

There are 887379 borrowers and 887379 loans. 

### Descriptive statistics - About the clients

####  What's the average annual individual income of the borrowers?

#### What are the top ten purposes people borrow money for? 

#### Which 10 states has the most number of loans? Sort by descending order

### Descriptive statistics - About the loans

#### How many loans were issued per year?

#### What’s the average interest rate for loans in each grade and within that for each sub_grade? 

#### What's the percentage breakdown of loan status? In particular we want to see the percentage of loans that were defaulted and fully paid.

#### How does default rate varies from year to year? Create one dataframe showing year vs default rate

### Case Study

#### You are considering approval of a loan of 100,000 dollars to a couple whose joint income is 200,000 dollars. You've verfied their information and now you are considering whether you want to approve this loan offer and what interest/term you would offer based on historical data. Write a short paragraph about how you would decide if you want to approve the loan or not based on available data on borrowers such as income, employment histories, and so on. You are encouraged to but not required to include numbers you obtained from analyzing historical data. You may also discuss other uncertainties you would like to consider to justify your decision. 

### Machine Learning

#### Machine learning algorithms allow us to take in large quantity of information, identify patterns, and make data-driven decisions. Now we want to use it to aid our decision-making by predicting a client's likelihood of deliquency. 

#### Some lingos before we start:
__features__: also your regressor, left-hand variables, variables that you use to predict your outcome 

__target variable__: also outcome variable, variable whose value you are trying to predict. 



### A machine learning pipeline generally consists of :

__Data preprocessing__: clean the data by getting rid of outliers and idiosyncracies, standarizeding the data on the same scale, and creating target variable if it's not clearly defined by the problem.
    

#### Then an re-iterative process of:

- __Feature generation and selection__: create predictors, ie features, to provide "hints" to the model and select the most informative features.


- __Cross-validation__: divide the dataset into training and testing to obtain various performance metrics for the algorithms. Common partition scheme for this are 10-/5-fold cross validation. Read more [here](https://genome.tugraz.at/proclassify/help/pages/XV.html)


- __Algorithm selection/Grid search__: find the best performing algorithm and the best hyper-parameters for the algorithm. this process can be time consuming, which is the so-called "no free lunch thereom", but generally the process is quite automated.



- __Evaluation__: choose your loss function and metrics you want to maximize based on your real concerns and tradeoffs between false positive/false negative rate, training time/latency, and so on.


- __Implementation__: In a real project, what algorithm you choose is also informed by accessiblity to the data in real time, the required data literacy from the users/complexity of your algorithm, transparency/ethics, costs of maintaince, and so on. You will also build in check-ins for abnormal prediction results.

#### In the following section you will get the data ready to feed it through a machine learning pipeline.

In [None]:
# read in your data frame here
# warning: this will update the df global variable 
# and erase the columns you created previously for descriptive analysis! we want to start fresh here.

#### First of all, we will sub-set our dataframe. To keep the training time short we will only use data from 2015. Then let's divide it into two parts: historical records with known payment outcomes vs loans that are currently active or just issued. We will use the historical records (with known outcomes) to train, test, and evaluate our model. In reality, we will apply the trained model to predict on loans that are still active (outcomes unkown) . 

__Task__: Create one df that contains all loans in 2015 then split it into two dataframes: one with status that are either 'Current' or 'Issued' - call it _application_, and one that contains the rest -- call it _development_

#### Now we'll work with our _development set_. 

#### We will adjust our outcome variable slightly -- instead of predicting on default rate which is quite low and can make accurate prediction more challenging, we will predict on "delinquency" which include loans that are charged off, late, and defaulted. In this way, we increase the number of outcomes and also simplifies the problem into a binary classification -- meaning we have two outcomes essentially: delinquent (1) / not delinquent (0)

__Task:__ Create a new column called "delinquency" that consolidate the values from "loan_status" and is a binary variable that is 1 for delinquent loan and 0 otherwise. A delinquent loan is one whose status is any of the following:  charged off, late, in grace period, does not meet the credit policy - charged off, and default. Drop "loan_status" variable once you are done creating "delinquency" since it's no longer needed.

#### Feature generation: Once we have our target variable ready, we can move on to generate some features. The model does not learn if you do not provide it with informative hints! 

#### First, we will drop the id columns since they have no inherent value but for book keeping in this case. To simplify the task, we will also drop 'url', 'desc', 'title', 'zip_code', 'emp_title' columns because they contain text information that require more tedious cleaning and natural language processing. Finally, we will drop 'grade' because the information is contained in 'sub_grade'.

__Task__: drop 'id', 'member_id', 'url', 'desc', 'title', 'zip_code', 'emp_title', 'grade' columns

#### Browse through a few rows, you'll notice we have a mix of categorical, date, and numerical variables. Turn all categorical variables into dummy variables using pd.get_dummies().

__Task__: Turn all categorical variables into dummy variables and join them back to the dataframe. Drop the original variable that we used to create the dummy variables. *Hint: There are 11 categorical variables*

#### Now let's clean the date variables. Date variables genernally don't provide much information unless we extract something useful from them. For example from entry/exit dates of a project, we can infer the project duration. We can also extract the month/year from the dates especially when there's seasonality involved in the data. For instance, most people may fail to pay off their loands after 2008 financial crisis. In this data, we will extract the year from the dates.

__Task__: 
        - 1) Make sure date columns contain datetime types 
        - 2) Create 'issued_yr' by extracting year from 'issued_d'
        - 3) Create 'earliest_cr_line_yr' by extracting year from 'earliest_cr_line' 
        - 4) Create 'payment_period' by subtracting 'next_pymnt_d' - 'last_pymnt_d' (make sure you convert timestamp to integers: 54days => 54)
        - 5) Create 'last_credit_pull_yr' by extracting year from 'last_credit_pull_d'
        - 6) Drop all original date variables

#### As a last step, we will fill in the missing values for each column. For numerical columns such as annual income, it's reasonable to assume the missing value will be around the median or mean income. For others such as payment, accounts opened, a missing value may simply indicate no activities happened and the value can be 0. You can also use interpolation (see more) to fill in the missing values. 

__Task__: Fill in missing values at your discretion. include justification of your methods in the comments. 

### Now let's feed the cleaned dataset through the pipeline! 

#### X represents all the predictors - features; y represents the target variable - whether the borrower has been delinquent on the loan

__task__: divide development dataframe into two X and y. X will be a dataframe that contains all predictors. y will be a single column therefore a Series that contains the target variable 'delinquency'

#### For this pipeline, we will experiment with K-Nearest Neighbor algorithm. Similar to what you've done in the case study by finding comparable examples to the case at hand and decide whether you want to lend money or not, this algorithm calculates the distances of each sample to the rest in the feature space, find the K-nearest neighbors to this particular sample, and assign the sample the most common class from its neighbors.

#### There are several variation of the algorithm with some options more efficient than the others -- calculating distance between one and the rest of the samples can get computationally expensive really fast! Read more about KNN [here](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)



__task__: Call pipeline function and store the returned scores in a variable. You do NOT need to modify the pipeline function

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import train_test_split, KFold
from sklearn.model_selection import cross_val_score
import time

def pipeline(X, y):
    
    begin = time.time()
    
    # we will split the development set into training(80%) and testing (20%)
    # 80-20 is a convention. 70-30 is also used. The choice really depends on the base rate of your outcome
    X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=123, stratify=y)

    # here we use the make_pipeline module to make a pipeline by using standardscaler to standarize
    # the data and feed it through a K-nearest neighbor algorithm
    pipe_kn = make_pipeline(StandardScaler(), 
                            KNN(n_neighbors=30))

    # we will set up a 5-fold cross validation scheme to evaluate the model
    scores = cross_val_score(estimator=pipe_kn,
                             X=X_train,
                             y=y_train,
                             cv=5,
                             scoring='accuracy',
                             n_jobs=1)
    end = time.time()

    print("The model took %d seconds to train." % (end-begin))
    
    return scores

In [None]:
scores = pipeline(X, y)

#### The model has now been trained, let's see how it performs. There are many metrics you can use to evaluate an algorithm including accuracy, precision, recall, ROC_AUC, and so on. You can read more about it [here](https://machinelearningmastery.com/classification-accuracy-is-not-enough-more-performance-measures-you-can-use/). For simplicity, we will use accuracy -- of all the predictiions we made, how many times did we make a right decision. 

#### To establish comparison, we need a baseline of accuracy. Accuracy is often not a good enough metric by itself. Imagine if you have a very skewed outcome that you are trying to predict -- the outcome only happens 1% of the time. Your model will be very accurate if it keeps giving out negative predictions and would by default reach an accuracy of 1%. So we want our model to have a higher accuracy than the default rate of positive outcomes. 

__Task__: Identify the baseline accuracy. That is ( 1 - (# of delinquency) / (total # of predictions made))

In [None]:
baseline = 

#### Now let's print out our accuracy obtained from cross-validation by running the following code

In [None]:
print('CV accuracy scores: %s' % scores)
print('CV accuracy: %.3f +/- %.3f' % (np.mean(scores), np.std(scores)))

### As you can see the average accuracy is way above the baseline rate. Congrats!

#### Bonus step: Now we are happy with the development algorithm, you can train the model on the entire development dataset and use it to predict on the application dataset where the actual outcomes are not yet oberved. Since it can take quite some time to train a model, you can "pickle" the model to avoid having to train the model each time you use it. See more about pickle library [here](https://docs.python.org/3/library/pickle.html).