# Lending Club Case Study

**Problem Statement**

The company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default.  The company can utilise this knowledge for its portfolio and risk assessment. 

In [12]:
#Import the libraries
import numpy as np
import pandas as pd

In [13]:
#read the dataset and check the first five rows
df0 = pd.read_csv('loan.csv')
df0.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,percent_bc_gt_75,pub_rec_bankruptcies,tax_liens,tot_hi_cred_lim,total_bal_ex_mort,total_bc_limit,total_il_high_credit_limit
0,1077501,1296599,5000,5000,4975.0,36 months,10.65%,162.87,B,B2,...,,,,,0.0,0.0,,,,
1,1077430,1314167,2500,2500,2500.0,60 months,15.27%,59.83,C,C4,...,,,,,0.0,0.0,,,,
2,1077175,1313524,2400,2400,2400.0,36 months,15.96%,84.33,C,C5,...,,,,,0.0,0.0,,,,
3,1076863,1277178,10000,10000,10000.0,36 months,13.49%,339.31,C,C1,...,,,,,0.0,0.0,,,,
4,1075358,1311748,3000,3000,3000.0,60 months,12.69%,67.79,B,B5,...,,,,,0.0,0.0,,,,


In [15]:
#Check the shape of the dataframe
df0.shape

(39717, 111)

### Data Cleaning and Manipulation
Data Quality Issues can be treated by
- For Missing Values:
    - Dropping the columns containing all null values
    - Dropping the rows containing the missing values
    - Imputing the missing values
    - Keep the missing values if they don't affect the analysis
- Incorrect Data Types:
    - Clean certain values 
    - Clean and convert an entire column


In [16]:

df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39717 entries, 0 to 39716
Columns: 111 entries, id to total_il_high_credit_limit
dtypes: float64(74), int64(13), object(24)
memory usage: 33.6+ MB


#### Missing Vaue Treatment

In [18]:
#print null value count for all columns
df.isnull().sum()

id                                0
member_id                         0
loan_amnt                         0
funded_amnt                       0
funded_amnt_inv                   0
                              ...  
tax_liens                        39
tot_hi_cred_lim               39717
total_bal_ex_mort             39717
total_bc_limit                39717
total_il_high_credit_limit    39717
Length: 111, dtype: int64

**All columns are not visible hence write a custom function to list all the columns with null value count**

In [23]:
def print_null(df):
    for col in df:
        print(df[col].name,"             " , df[col].isnull().sum())
print_null(df0)

id               0
member_id               0
loan_amnt               0
funded_amnt               0
funded_amnt_inv               0
term               0
int_rate               0
installment               0
grade               0
sub_grade               0
emp_title               2459
emp_length               1075
home_ownership               0
annual_inc               0
verification_status               0
issue_d               0
loan_status               0
pymnt_plan               0
url               0
desc               12940
purpose               0
title               11
zip_code               0
addr_state               0
dti               0
delinq_2yrs               0
earliest_cr_line               0
inq_last_6mths               0
mths_since_last_delinq               25682
mths_since_last_record               36931
open_acc               0
pub_rec               0
revol_bal               0
revol_util               50
total_acc               0
initial_list_status               0
out_prnc

**Drop columns with all null values**

In [36]:
#drop columns with all null values
df1 = pd.DataFrame()
print(df1.dtypes)
for col in df:
     if  (df[col].isnull().sum() != df0.shape[0]) :
        df1[col] = df[col]
print(df1.head())
print(df1.shape)

Series([], dtype: object)
        id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \
0  1077501    1296599       5000         5000           4975.0   36 months   
1  1077430    1314167       2500         2500           2500.0   60 months   
2  1077175    1313524       2400         2400           2400.0   36 months   
3  1076863    1277178      10000        10000          10000.0   36 months   
4  1075358    1311748       3000         3000           3000.0   60 months   

  int_rate  installment grade sub_grade  ... next_pymnt_d last_credit_pull_d  \
0   10.65%       162.87     B        B2  ...          NaN             May-16   
1   15.27%        59.83     C        C4  ...          NaN             Sep-13   
2   15.96%        84.33     C        C5  ...          NaN             May-16   
3   13.49%       339.31     C        C1  ...          NaN             Apr-16   
4   12.69%        67.79     B        B5  ...       Jun-16             May-16   

  collections_12_mths_ex

In [24]:
for col in df:
     if  (df[col].isnull().sum() < 39717) & (df[col].isnull().sum() > 0) :
        print(df[col].name,"  " , df[col].isnull().sum())

emp_title    2459
emp_length    1075
desc    12940
title    11
mths_since_last_delinq    25682
mths_since_last_record    36931
revol_util    50
last_pymnt_d    71
next_pymnt_d    38577
last_credit_pull_d    2
collections_12_mths_ex_med    56
chargeoff_within_12_mths    56
pub_rec_bankruptcies    697
tax_liens    39
