# Project Idea 2:
- A good loan (from the prospective of an investor) pays 
the interests and fractional principals on time and terminate at loan maturity.
- An investor often lose money when a loan goes into default, settlement,
or 'written off' (called **charged off** in this data set).
- Build a supervised model to make **multi-label** prediction on 3 dimensions
"charged off + default", "settlement involved", "hardship".
- This can be used either by **Lending Club** itself or a third-party investing firm
for loan-grade design or accurated portfolio selection.
- Depending on the scope of your project, you may 
    - tackle a single label prediction.
    - restrict to the pooled models.
    - focus on the time seris models
- This is a **multi-label** binary imbalance classification task.
- If you train a **pooled** model, you have to deal with $2M+$ samples, often too
large for a typical ML algorithm to handle.

- Try several imbalance classification techniques and evaluate their performance.

- Based on your business, discuss the negative impacts of type I (false
positive), type II (false negative) errors in your prediction.

- If you decide to train a time series model, make sure that you have some
basic background on performing hyper-parameter tuning in the time series context.

- **MUST**: A defaulted loan with a loan amount $\$1000$ has a totally different 
impact to the final profit than a defaulted $\$50000$ loan. 
   - Discuss whether the **classroom-taught** machine learning techniques 
    addresses these issues. How would you modify the classifier to take into account 
         - your business objectives.
         - the profit and loss focus.

- Can you use **NLP** technique to extract insights on the loan descriptions
which helps your predictive task?
</a><br>
# Structure: 
- <a href="#preprocessing">Preprocessing</a><br>
- <a href="#function">Function</a><br>
- <a href="#ml">Machine Learning</a><br>
    - Unsupervised Machine Learning  
        - <a href="#kmeans">K Means</a><br> 

    - Supervised Machine Learning      
        - <a href="#decision">Decision Tree</a><br>
        - <a href="#rf">Random Forest</a><br>
        - <a href="#svm">SVM</a><br>
        - <a href="#xgboost">XGBoost</a><br>
        - <a href="#logistic">Logistic Regression</a><br>
        - <a href="#naive">Naive Bayes Classifier</a><br>
        - <a href="#neighbor">Nearest Neighbor</a><br>
- <a href="#imbalance">Handling Imbalanced Data</a><br>
    - <a href="#smote">SMOTE</a><br>

In [2]:
import numpy as np
import pandas as pd

pd.options.display.max_columns = None
pd.options.display.max_rows = 100

# from sklearn.preprocessing import OrdinalEncoder
# from sklearn.preprocessing import OneHotEncoder
# from sklearn.preprocessing import LabelEncoder

# from sklearn.model_selection import train_test_split
# from sklearn.linear_model import LinearRegression 
# from sklearn.preprocessing import StandardScaler, MinMaxScaler, MaxAbsScaler 


In [None]:
df_raw_accepted = pd.read_csv('accepted_2007_to_2018Q4.csv')
df_raw_rejected = pd.read_csv('rejected_2007_to_2018Q4.csv')

In [None]:
df_raw_accepted.sample(frac=0.001).to_csv('sample_accepted.csv')
df_raw_rejected.sample(frac=0.001).to_csv('sample_rejected.csv')

In [3]:
sample_accepted = pd.read_csv('sample_accepted.csv')
sample_rejected = pd.read_csv('sample_rejected.csv')

In [None]:
sample_accepted.shape
sample_rejected.shape

In [5]:
sample_accepted = sample_accepted.sample(2000)
sample_rejected = sample_rejected.sample(2000)

 <p><a name="preprocessing"></a></p>
 
 ## Preprocessing Test

In [8]:
df_processed = sample_accepted.copy()

In [None]:
# Drop irrelavant columns 
drop_list = ['Unnamed: 0','id','member_id','funded_amnt','url','desc','title']

drop_for_grade_list = ['funded_amnt_inv','int_rate','installment','issue_d','loan_status','pymnt_plan','out_prncp','out_prncp_inv']

df_processed = df_processed.drop(drop_list, axis=1)
df_processed = df_processed.drop(drop_for_grade_list, axis=1)

# Convert categorical to numerical 
df_processed['term'] = df_processed['term'].apply(lambda x: int(x.split()[0]))
df_processed['emp_length'] = df_processed['emp_length'].str.extract('(\d+)') 
#10 means more than 10 years 

# Convert to Datetime
df_processed['earliest_cr_line'] = pd.to_datetime(df_processed['earliest_cr_line'])

# Missing Values 

df_processed.mths_since_last_record = df_processed.mths_since_last_record.fillna(0)
df_processed.mths_since_last_delinq = df_processed.mths_since_last_delinq.fillna(0)

df_processed.emp_title = df_processed.emp_title.fillna('None')
df_processed.emp_length = df_processed.emp_length.fillna(0)

df_processed.revol_util = df_processed.revol_util.fillna(0)

df_processed.dti = df_processed.dti.fillna(df_processed.revol_bal / df_processed.annual_inc)

In [9]:
df_processed.head()

Unnamed: 0.1,Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,total_cu_tl,inq_last_12m,acc_open_past_24mths,avg_cur_bal,bc_open_to_buy,bc_util,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_bc_dlq,mths_since_recent_inq,mths_since_recent_revol_delinq,num_accts_ever_120_pd,num_actv_bc_tl,num_actv_rev_tl,num_bc_sats,num_bc_tl,num_il_tl,num_op_rev_tl,num_rev_accts,num_rev_tl_bal_gt_0,num_sats,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit,revol_bal_joint,sec_app_fico_range_low,sec_app_fico_range_high,sec_app_earliest_cr_line,sec_app_inq_last_6mths,sec_app_mort_acc,sec_app_open_acc,sec_app_revol_util,sec_app_open_act_il,sec_app_num_rev_accts,sec_app_chargeoff_within_12_mths,sec_app_collections_12_mths_ex_med,sec_app_mths_since_last_major_derog,hardship_flag,hardship_type,hardship_reason,hardship_status,deferral_term,hardship_amount,hardship_start_date,hardship_end_date,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
654,1997069,86022540,,10000.0,10000.0,10000.0,36 months,8.99,317.96,B,B1,Low Voltage Electrician,7 years,RENT,40000.0,Not Verified,Aug-2016,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,credit_card,Credit card refinancing,598xx,MT,9.93,0.0,Mar-2003,670.0,674.0,1.0,33.0,,5.0,0.0,12555.0,75.0,7.0,f,1554.33,1554.33,9849.2,9849.2,8445.67,1403.53,0.0,0.0,0.0,Mar-2019,317.96,Apr-2019,Mar-2019,714.0,710.0,0.0,,1.0,Individual,,,,0.0,0.0,12555.0,1.0,0.0,0.0,0.0,,0.0,0.0,1.0,1.0,6558.0,75.0,16805.0,0.0,0.0,1.0,1.0,3138.0,4201.0,75.0,0.0,0.0,,148.0,6.0,6.0,1.0,6.0,33.0,6.0,33.0,0.0,5.0,4.0,4.0,5.0,0.0,4.0,6.0,4.0,4.0,,0.0,0.0,1.0,86.0,50.0,0.0,0.0,16805.0,12555.0,16805.0,0.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
2182,1474130,141324118,,16500.0,16500.0,16500.0,36 months,7.21,511.06,A,A3,,< 1 year,RENT,60000.0,Not Verified,Oct-2018,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,credit_card,Credit card refinancing,212xx,MD,22.54,0.0,Jan-2005,685.0,689.0,0.0,57.0,,10.0,0.0,16296.0,74.8,16.0,w,14415.49,14415.49,2548.69,2548.69,2084.51,464.18,0.0,0.0,0.0,Mar-2019,511.06,Apr-2019,Mar-2019,729.0,725.0,0.0,57.0,1.0,Individual,,,,0.0,0.0,48964.0,0.0,2.0,0.0,1.0,18.0,32668.0,4.0,1.0,4.0,6776.0,66.0,21800.0,2.0,0.0,1.0,5.0,4896.0,4604.0,78.0,0.0,0.0,165.0,73.0,9.0,9.0,0.0,9.0,,9.0,,2.0,6.0,6.0,6.0,6.0,7.0,8.0,8.0,6.0,10.0,0.0,0.0,0.0,1.0,93.3,66.7,0.0,0.0,64072.0,48964.0,20900.0,42272.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,DirectPay,N,,,,,,
1951,1535241,134030357,,5000.0,5000.0,5000.0,36 months,11.05,163.82,B,B4,Unit Secretary,10+ years,RENT,43700.0,Not Verified,May-2018,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,credit_card,Credit card refinancing,125xx,NY,6.7,5.0,Oct-2005,680.0,684.0,0.0,14.0,,9.0,0.0,3397.0,48.5,15.0,w,3772.19,3772.19,1635.13,1635.13,1227.81,407.32,0.0,0.0,0.0,Mar-2019,163.82,Apr-2019,Mar-2019,654.0,650.0,0.0,14.0,1.0,Individual,,,,0.0,0.0,3397.0,0.0,0.0,0.0,0.0,89.0,0.0,,0.0,1.0,1384.0,49.0,7000.0,0.0,1.0,1.0,1.0,377.0,2029.0,59.4,0.0,0.0,89.0,151.0,21.0,21.0,0.0,21.0,14.0,12.0,14.0,3.0,6.0,7.0,7.0,11.0,1.0,9.0,14.0,7.0,9.0,0.0,0.0,3.0,0.0,40.0,28.6,0.0,0.0,7000.0,3397.0,5000.0,0.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
711,1188291,28012092,,11200.0,11200.0,11200.0,36 months,6.49,343.22,A,A2,Jr. Designer,6 years,RENT,50000.0,Not Verified,Oct-2014,Fully Paid,n,https://lendingclub.com/browse/loanDetail.acti...,,debt_consolidation,Debt consolidation,088xx,NJ,15.75,0.0,Oct-2002,715.0,719.0,0.0,,,15.0,0.0,14037.0,74.7,24.0,f,0.0,0.0,12180.030002,12180.03,11200.0,980.03,0.0,0.0,0.0,Jul-2016,5298.48,,Sep-2018,734.0,730.0,0.0,,1.0,Individual,,,,0.0,0.0,19562.0,,,,,,,,,,,,18800.0,,,,1.0,1304.0,458.0,95.7,0.0,0.0,143.0,116.0,22.0,22.0,0.0,22.0,,,,0.0,6.0,10.0,6.0,6.0,12.0,10.0,12.0,10.0,15.0,0.0,0.0,0.0,0.0,100.0,100.0,0.0,0.0,38425.0,19562.0,10700.0,19625.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
154,2256801,89989529,,8000.0,8000.0,8000.0,36 months,13.49,271.45,C,C2,,,RENT,40931.0,Not Verified,Oct-2016,Current,n,https://lendingclub.com/browse/loanDetail.acti...,,credit_card,Credit card refinancing,310xx,GA,11.79,0.0,Feb-1979,670.0,674.0,0.0,53.0,,16.0,0.0,5852.0,83.6,38.0,w,1817.26,1817.26,7881.03,7881.03,6182.74,1698.29,0.0,0.0,0.0,Mar-2019,271.45,Apr-2019,Mar-2019,784.0,780.0,0.0,,1.0,Individual,,,,0.0,110.0,72381.0,1.0,13.0,1.0,4.0,8.0,66529.0,95.0,1.0,2.0,2838.0,94.0,7000.0,0.0,0.0,0.0,6.0,4524.0,1148.0,83.6,0.0,0.0,183.0,452.0,4.0,4.0,1.0,4.0,,,,1.0,3.0,3.0,3.0,4.0,32.0,3.0,5.0,3.0,16.0,0.0,0.0,0.0,2.0,92.1,66.7,0.0,0.0,77215.0,72381.0,7000.0,70215.0,,,,,,,,,,,,,,N,,,,,,,,,,,,,,,Cash,N,,,,,,
