<a href="https://colab.research.google.com/github/lailasaummi/Virtual-Internship-id-x-partners/blob/main/Laila_Awalia_Saummi_VIX_ID_X_Partners.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Credit Risk Prediction for a Lending Company**

Company provided the dataset that contains of good and bad credit scoring issued from 2007 until 2014.

Objectives:
1.   Built a model with provided technology solution to predict the probability of a borrower defaulting a loan
2.   Developed WOE, IV and Train-Test to get the best model
3.   Deployed the Machine Learning model using pickle



### Data Resources and References

*   Loan Dataset 2007-2014 (format .csv) https://drive.google.com/file/d/1r17UjbuxkcCwGbXOUkr3wcG8UmjvEzCD/view?usp=sharing
*   Loan Dataset Dictionary (sheet: LoanStats) https://docs.google.com/spreadsheets/d/1iT1JNOBwU4l616_rnJpo0iny7blZvNBs/edit#gid=1001272030
*   Credit Risk Modelling in Python https://medium.com/analytics-vidhya/credit-risk-modelling-in-python-3ab4b00f6505
*   Credit Risk Modelling Git https://github.com/yineme/Credit_Risk_modelling.git
*   Uji Multikolinearitas pada Analisis Regresi https://lab_adrk.ub.ac.id/id/uji-multikolinearitas-pada-analisis-regresi/




## **Data Preparation**

### Import Libraries to Process the Dataset

In [1]:
import pandas as pd
import numpy as np

### Load Dataset Dictionary
Understand the representation of each column in dataset

In [None]:
# Import dataset from Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
Description = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/LCDataDictionary.xlsx', 'LoanStats').dropna()
Description.style.set_properties(subset=['Description'], **{'width' :'850px'})

### Load Dataset

In [None]:
data = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/loan_data_2007_2014.csv', low_memory = False)

In [None]:
# Check dimension of dataset
data.shape

## **Data Cleaning**

### Drop Inconsistent Columns

``Unnamed: 0``, ``id``, ``member_id``, ``url``, ``title``, ``zip_code``, ``emp_title`` and ``policy_code`` grade columns are considered as identifiers and can not be used in this modelling. The ``sub_grade`` columns also contains same information as grade columns. 

Columns that contains of future information like ``next_pymnt_d``, ``recoveries``, ``collection_recovery_fee``, ``total_rec_prncp`` and ``total_rec_late_fee`` grade columns can not be used because those events aren't yet occur. The sub_grade columns also contains same information as grade columns. Then we can drop them.

In [None]:
data.drop(['Unnamed: 0','id','member_id', 'sub_grade', 'url', 'title','zip_code', 'emp_title', 'policy_code'], axis = 1, inplace = True)

In [None]:
data.shape

In [None]:
data.drop(['next_pymnt_d', 'recoveries', 'collection_recovery_fee', 'total_rec_prncp', 'total_rec_late_fee'], axis = 1, inplace = True)

In [None]:
data.shape

### Filling Columns Contain of Missing Values

In [None]:
# Expand the output display of columns that have missing values
pd.options.display.max_rows = None
data.isnull().sum()

Loan Dataset Dictionary tell us about the description of ``total_rev_hi_lim`` is total revolving high credit/ credit limit. ``total_rev_hi_lim`` have missing values but we need it for credit risk prediction modelling analysis. We can recover it with ``funded_amnt`` column (The total amount committed to that loan at that point in time).

In [None]:
data['total_rev_hi_lim'].fillna(data['funded_amnt'],inplace = True)

In [None]:
data['total_rev_hi_lim'].isnull().sum()

Loan Dataset Dictionary tell us about the description of ``annual_inc`` is The self-reported annual income provided by the borrower during registration. ``annual_inc`` have missing values but we need it for credit risk prediction modelling analysis. We can recover it with fill the missing values with ``annual_inc`` mean values.

In [None]:
data.annual_inc.fillna(data.annual_inc.mean(),inplace=True)

In [None]:
data.annual_inc.isnull().sum()

We can consider the other columns with missing values that we need for credit risk prediction modelling with fill them by zero.

In [None]:
for i in list(['acc_now_delinq','total_acc','pub_rec','open_acc','inq_last_6mths','delinq_2yrs','emp_length']):
    data[i].fillna(0,inplace = True)

In [None]:
data.loc[:,['acc_now_delinq','total_acc','pub_rec','open_acc','inq_last_6mths','delinq_2yrs','emp_length']].isnull().sum()

### Drop Columns Contain of Missing Values

In [None]:
# Display columns that have greater than 70% of missing values
missing_values = data.isnull().mean()
missing_values[missing_values>0.7]

In [None]:
missing_values = ['desc', 'mths_since_last_record', 'mths_since_last_major_derog', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'open_acc_6m', 'open_il_6m', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'inq_fi', 'total_cu_tl', 'inq_last_12m']

In [None]:
data.drop(columns=missing_values, inplace=True, axis=1)

In [None]:
data.shape

Create a new variable called ``good_bad`` will help us for scoring the loan whether is good or bad loan

good --> 1<br>
bad --> 0

In [None]:
# Create a new column based on 'loan_status' column that will be our target variable
data['good_bad'] = np.where(data.loc[:, 'loan_status'].isin(['Charged Off', 'Default', 'Late (31-120 days)',
                                                                       'Does not meet the credit policy. Status:Charged Off']), 0, 1)
# Drop the original 'loan_status' column
data.drop(columns = ['loan_status'], inplace = True)

In [None]:
data.head()

In [None]:
data.shape

### Import Plotting Libraries

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

### Drop Multicollinear Features from Dataset

If there is multicollinearity in dataset, then the predictive power of a variable that is strongly correlated with other variables not reliable and unstable. So we should drop them.

In [None]:
# Drop another missing values column
data.dropna(inplace=True)

In [None]:
data.shape

In [None]:
# Correlation matrix showing correlation co-effiecients 
corr_matrix = data.corr()
heatMap=sns.heatmap(corr_matrix, annot=True,  cmap="BrBG", annot_kws={'size':12})
heatmap=plt.gcf()
heatmap.set_size_inches(20,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)


In [None]:
# Drop multicollinear features 
data.drop(columns=['loan_amnt', 'revol_bal', 'funded_amnt', 'funded_amnt_inv', 'installment',  'total_pymnt_inv',  'out_prncp_inv',  'total_acc'], inplace=True)

In [None]:
corr_matrix = data.corr()
heatMap=sns.heatmap(corr_matrix, annot=True,  cmap="BrBG", annot_kws={'size':12})
heatmap=plt.gcf()
heatmap.set_size_inches(20,15)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

In [None]:
data.shape

In [None]:
data.describe()

In [None]:
data.info()

### Converting Data Types of Continous Variables

In [None]:
# Display unique value of 'emp_length' column
data['emp_length'].unique()

In [None]:
# Convert 'emp_length' datatype to numerical column and assign missing values to zero
def emp_length_convert(df, column):
    df[column] = df[column].str.replace('\+ years', '')
    df[column] = df[column].str.replace('< 1 year', str(0))
    df[column] = df[column].str.replace(' years', '')
    df[column] = df[column].str.replace(' year', '')
    df[column] = pd.to_numeric(df[column])
    df[column].fillna(value = 0, inplace = True)
    
emp_length_convert(data, 'emp_length')

data['emp_length'].unique()

In [None]:
data['emp_length'].dtype

In [None]:
# Convert 'term' datatype to numerical column
def term_numeric(df, column):
    df[column] = pd.to_numeric(df[column].str.replace(' months', ''))
    
term_numeric(data, 'term')

In [None]:
data['term'].dtype

In [None]:
# Modified date columns 

def date_columns(df, column):
    # Current month
    today_date = pd.to_datetime('2020-08-01')
    # Convert to datetime format
    df[column] = pd.to_datetime(df[column], format = "%b-%y")
    # Calculate the difference in months and add to a new column
    df['mths_since_' + column] = round(pd.to_numeric((today_date - df[column]) / np.timedelta64(1, 'M')))
    # Make any resulting -ve values to be equal to the max date
    df['mths_since_' + column] = df['mths_since_' + column].apply(lambda x: df['mths_since_' + column].max() if x < 0 else x)
    # Drop the original date column
    df.drop(columns = [column], inplace = True)
    

date_columns(data, 'issue_d')
date_columns(data, 'last_pymnt_d')
date_columns(data, 'last_credit_pull_d')
date_columns(data, 'earliest_cr_line')

### Concatenate Discrete Variables

In [None]:
data.info()

In [None]:
# Create dummy variables for categorical columns
pd.get_dummies(data['grade'],prefix = 'grade', prefix_sep = ":")

In [None]:
data_dummies = [pd.get_dummies(data['grade'],prefix = 'grade', prefix_sep = ":"),
                    pd.get_dummies(data['home_ownership'],prefix = 'home_ownership', prefix_sep = ":"),
                    pd.get_dummies(data['verification_status'],prefix = 'verification_status', prefix_sep = ":"),
                    pd.get_dummies(data['good_bad'],prefix = 'good_bad', prefix_sep = ":"),
                    pd.get_dummies(data['purpose'],prefix = 'purpose', prefix_sep = ":"),
                    pd.get_dummies(data['addr_state'],prefix = 'addr_state', prefix_sep = ":"),
                    pd.get_dummies(data['initial_list_status'],prefix = 'initial_list_status', prefix_sep = ":")]

In [None]:
data_dummies = pd.concat(data_dummies,axis = 1)

In [None]:
type(data_dummies)

In [None]:
data = pd.concat([data,data_dummies],axis = 1)

In [None]:
data.columns.values

In [None]:
# Check for missing values columns again 
missing_values = data.isnull().sum()
missing_values[missing_values>0]/len(data)

In [None]:
preprocess_data = data

In [None]:
# Check for any missing values
missing = preprocess_data.isnull().sum()
missing[missing>0]

## **Supervised Learning**

### Weight of Evidence (WOE) and Information Value (IV)

Weight of evidence (WOE) can determine which categories should be binned. Information value (IV) can determine which variables are useful in the logistic regression which is the algorithm of supervised learning. 

In [None]:
# Calculate WOE and IV

def iv_woe(data, target, bins=10, show_woe=False):
    
    # Empty Dataframe
    newDF,woeDF = pd.DataFrame(), pd.DataFrame()
    
    # Extract Column Names
    cols = data.columns
    
    # Run WOE and IV on all the independent variables
    for ivars in cols[~cols.isin([target])]:
        if (data[ivars].dtype.kind in 'bifc') and (len(np.unique(data[ivars]))>10):
            binned_x = pd.qcut(data[ivars], bins,  duplicates='drop')
            d0 = pd.DataFrame({'x': binned_x, 'y': data[target]})
        else:
            d0 = pd.DataFrame({'x': data[ivars], 'y': data[target]})
        d = d0.groupby("x", as_index=False).agg({"y": ["count", "sum"]})
        d.columns = ['Cutoff', 'N', 'Events']
        d['% of Events'] = np.maximum(d['Events'], 0.5) / d['Events'].sum()
        d['Non-Events'] = d['N'] - d['Events']
        d['% of Non-Events'] = np.maximum(d['Non-Events'], 0.5) / d['Non-Events'].sum()
        d['WoE'] = np.log(d['% of Events']/d['% of Non-Events'])
        d['IV'] = d['WoE'] * (d['% of Events'] - d['% of Non-Events'])
        d.insert(loc=0, column='Variable', value=ivars)
        print("Information value of " + ivars + " is " + str(round(d['IV'].sum(),6)))
        temp =pd.DataFrame({"Variable" : [ivars], "IV" : [d['IV'].sum()]}, columns = ["Variable", "IV"])
        newDF=pd.concat([newDF,temp], axis=0)
        woeDF=pd.concat([woeDF,d], axis=0)

        # Show WOE Table
        if show_woe == True:
            print(d)
    return newDF, woeDF
iv, woe = iv_woe(preprocess_data, target='good_bad', bins=20)

In [None]:
print(iv)

The rule of dumb is for variables with less than 0.2 of information value are not useful for prediction and if greater than 0.5 have a suspicious predictive power. 

Therefore, the following variables will not be included: 
``out_prncp``, ``last_pymnt_amnt``, ``delinq_2yrs``, ``mths_since_last_delinq``, ``open_acc``, ``pub_rec``, ``total_acc``, ``collections_12_mths_ex_med``, ``acc_now_delinq``, ``tot_coll_amt`` and ``mths_since_last_pymnt_d``

In [None]:
# Drop columns with low information value
preprocess_data.drop(columns=[ 'pymnt_plan', 'last_pymnt_amnt', 'revol_util', 'delinq_2yrs', 'mths_since_last_delinq', 'open_acc', 'pub_rec',  'collections_12_mths_ex_med', 'acc_now_delinq',
                              'tot_coll_amt', 'mths_since_last_pymnt_d', 'emp_length', 'application_type'], axis=1, inplace=True)

In [None]:
preprocess_data.shape

In [None]:
# Create dummy variables for categorical columns
data_dummies1 = [pd.get_dummies(preprocess_data['grade'], prefix='grade', prefix_sep=':'),
               pd.get_dummies(preprocess_data['home_ownership'], prefix='home_ownership', prefix_sep=':'),
               pd.get_dummies(preprocess_data['verification_status'], prefix='verification_status', prefix_sep=':'),
                pd.get_dummies(preprocess_data['purpose'], prefix='purpose', prefix_sep=':'),
                pd.get_dummies(preprocess_data['addr_state'], prefix='addr_state', prefix_sep=':'),
                pd.get_dummies(preprocess_data['initial_list_status'], prefix='initial_list_status', prefix_sep=':')
                               
               ]


In [None]:
# Turn 'data_dummies' into dataframe

categorical_dummies = pd.concat(data_dummies1, axis=1)

In [None]:
categorical_dummies.head()

In [None]:
# Concatinate preprocess_data variable with categorical_dummies

preprocess_data = pd.concat([preprocess_data, categorical_dummies], axis=1)

In [None]:
preprocess_data.shape

In [None]:
preprocess_data.columns

In [None]:
# Calculate WOE of categorical features

def woe_categorical(df, cat_feature, good_bad_df):
    df = pd.concat([df[cat_feature], good_bad_df], axis=1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
    df['n_good'] = df['prop_good'] * df['n_obs']
    df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
    df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
    df = df.sort_values(['WoE'])
    df = df.reset_index(drop = True)
    df['diff_prop_good'] = df['prop_good'].diff().abs()
    df['diff_WoE'] = df['WoE'].diff().abs()
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    df['IV'] = df['IV'].sum()
    return df


# Plot WOE values 
# Set seaborn as default style of graphs
sns.set()
# Plot WoE across categories that takes 2 arguments: a dataframe and a number.
def plot_by_woe(df_WoE, rotation_of_x_axis_labels = 0):
    x = np.array(df_WoE.iloc[:, 0].apply(str))
    y = df_WoE['WoE']
    plt.figure(figsize=(18, 6))
    plt.plot(x, y, marker = 'o', linestyle = '--', color = 'k')
    plt.xlabel(df_WoE.columns[0])
    plt.ylabel('Weight of Evidence')
    plt.title(str('Weight of Evidence by ' + df_WoE.columns[0]))
    plt.xticks(rotation = rotation_of_x_axis_labels)    


In [None]:
# Separate data into target and features
X= preprocess_data.drop(columns='good_bad', axis=1)
y=preprocess_data['good_bad']
df_grade = woe_categorical(X, 'grade', y)
df_grade


In [None]:
plot_by_woe(df_grade)

From this graph we can see that grades variable have different WOE from another variable. We will keep each grade as a feature.

In [None]:
# Analyze 'home_ownership' variable

df_home = woe_categorical(X, 'home_ownership', y)
df_home

In [None]:
# Plot df_home WOE
plot_by_woe(df_home)

OTHER, NONE and ANY have very few observations and should be combined with the category with high risk of default which is RENT.

In [None]:
# Analyze 'verification_status'

veri_df = woe_categorical(X, 'verification_status', y)
veri_df

In [None]:
plot_by_woe(veri_df)

This variable has different WOE values and can be used to separate variables.

In [None]:
# Analyze 'purpose'  variable
pur_df = woe_categorical(X, 'purpose', y)
pur_df

In [None]:
plot_by_woe(pur_df, 90)

The following  categories wil be combined together:
1. educational, renewable_energy, moving
2. other,house, medical
3. weeding, vacation
4. debt_consolidation
5. home_improvement, major purchase
6. car, credit_card


In [None]:
# Analyze 'addr_state' WOE

addr_df = woe_categorical(X, 'addr_state', y)
addr_df

In [None]:
plot_by_woe(addr_df)

The states NE, IA, ME and ID have low observations and this may because of their extreme WOE. We will plot the graph again without including these categories and see if there are any changes.

In [None]:
# Dataframe excluding low observations for 'addr_state' column
data1 =addr_df.iloc[2:44, :]
data2 =addr_df.iloc[45:49, :]
low_data_woe = pd.concat([data1, data2], axis=0)

In [None]:
low_data_woe

In [None]:
# Plot 'addr_state' excluding states with low observations
plot_by_woe(low_data_woe)

To decide which categories will combined, we use both WOE and the number of observations. Categories with similar WOE but significantly different observations will not be combine together. It is because the number of observations could influence the WOE values. Also, categories with similar WOE and observations greater than 5% can be combined together to form a new category. 

The categories will combined together, such as:

1. NE, IA, NV, HI, FL, AL
2. NY
3. LA, NM, OK, NC, MO, MD, NJ, VA
4. CA
5. AZ, MI, UT, TN, AR, PA
6. RI, OH, KY, DE, MN, SD, MA, IN
7. GA, WA
8. WI, OR
9. TX
10. IL, CT,MT
11. CO, SC
12. KS, VT, AK, MS
13. NH, WV, WY, DC




In [None]:
# Analyze 'initial_list_status' WOE 

init_list_df = woe_categorical( X, 'initial_list_status', y)
init_list_df

In [None]:
plot_by_woe(init_list_df)

This variable has significantly different WOE values and categories should be kept as separate variables.

### Analyze Continous Variables

In [None]:
# Function to calculate WOE for continous variables
def woe_continous(df, cat_feature, good_bad_df):
    df = pd.concat([df[cat_feature], good_bad_df], axis=1)
    df = pd.concat([df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].count(),
                    df.groupby(df.columns.values[0], as_index = False)[df.columns.values[1]].mean()], axis = 1)
    df = df.iloc[:, [0, 1, 3]]
    df.columns = [df.columns.values[0], 'n_obs', 'prop_good']
    df['prop_n_obs'] = df['n_obs'] / df['n_obs'].sum()
    df['n_good'] = df['prop_good'] * df['n_obs']
    df['n_bad'] = (1 - df['prop_good']) * df['n_obs']
    df['prop_n_good'] = df['n_good'] / df['n_good'].sum()
    df['prop_n_bad'] = df['n_bad'] / df['n_bad'].sum()
    df['WoE'] = np.log(df['prop_n_good'] / df['prop_n_bad'])
#     df = df.sort_values(['WoE'])
#     df = df.reset_index(drop = True)
    df['diff_prop_good'] = df['prop_good'].diff().abs()
    df['diff_WoE'] = df['WoE'].diff().abs()
    df['IV'] = (df['prop_n_good'] - df['prop_n_bad']) * df['WoE']
    df['IV'] = df['IV'].sum()
    return df


In [None]:
# Analyze 'term' WOE
plot_by_woe(woe_continous(X,'term', y ))

In [None]:
X['mths_since_issue_d'].unique()

In [None]:
# Fine classing by create a new variable

X['mths_since_issue_d_factor'] = pd.cut(X['mths_since_issue_d'], 10)


In [None]:
# Analyze 'mths_since_iss_df' WOE
mths_since_iss_df = woe_continous(X, 'mths_since_issue_d_factor', y)
mths_since_iss_df

In [None]:
plot_by_woe(mths_since_iss_df)

The following categories will be created based on their WOE and number of observations:
1. (67.97, 70.8)
2. (70.8, 73.6)
3. (73.6- 76.4)
4. (76.4.- 79.2)
5. (79.2-82)
6. 82-84
7. 84-90.4
8. 90.4-96

In [None]:
# Analyze interest rate WOE
X['int_rate_factor'] = pd.cut(X['int_rate'], 10)

In [None]:
int_rate_df = woe_continous(X, 'int_rate_factor',y)
int_rate_df

In [None]:
plot_by_woe(int_rate_df)

This graph shows that only the last two categories will be combined. That is:
1. (22.048, 26) 

In [None]:
# Analyze 'tot_rec_int' WOE
X['total_rec_int_factor'] = pd.cut(X['total_rec_int'], 20)
rec_int_df = woe_continous(X, 'total_rec_int_factor', y)
rec_int_df

In [None]:
plot_by_woe(rec_int_df, 90)

In [None]:
# Analyze 'total_revol_hi_lim' WOE
X['total_rev_hi_lim_factor'] = pd.cut(X['total_rev_hi_lim'], 100)
revol_hi_df = woe_continous(X, 'total_rev_hi_lim_factor', y)
revol_hi_df

In [None]:
# Analyze income below 100000
# Analyze income below 150000
X_train_prepr_temp = X[X['total_rev_hi_lim'] <= 100000].copy()
# Fine-classing again
X_train_prepr_temp['total_rev_hi_lim_factor'] = pd.cut(X_train_prepr_temp['total_rev_hi_lim'],10)
# Select only the relevant indexes in the target column
df_temp = woe_continous(X_train_prepr_temp, 'total_rev_hi_lim_factor', y[X_train_prepr_temp.index])
df_temp

In [None]:
plot_by_woe(df_temp, 90)

In [None]:
# Analyze 'total_pymnt' WOE
X['total_pymnt_factor'] = pd.cut(X['total_pymnt'], 10)
total_pym_df = woe_continous(X, 'total_pymnt_factor', y)
total_pym_df


In [None]:
# Analyze 'dti' WOE
X['dti_factor'] = pd.cut(X['dti'], 10)
dti_df = woe_continous(X, 'dti_factor', y)
dti_df

In [None]:
plot_by_woe(dti_df)

The following categories will  be combined together:
1. (27.993, 31.992), (31.992, 35.991), (35.991, 39.99)

In [None]:
# Analyze annual income WOE
X['annual_inc_factor'] = pd.cut(X['annual_inc'], 50)
ann_inc_df = woe_continous(X, 'annual_inc_factor', y)
ann_inc_df


In [None]:
plot_by_woe(ann_inc_df, 90)

Separate this variable into person with higher and lower income. From the WOE table we see that as annual income increases, the the number of observations decreases. It is because only a few person earn high income. We will analyze a new variable of person with income above 150000 dollars and below 150000 dollars.

In [None]:
# Analyze income below 150000
X_train_prepr_temp = X[X['annual_inc'] <= 150000].copy()
# Fine-classing again
X_train_prepr_temp['annual_inc_factor'] = pd.cut(X_train_prepr_temp['annual_inc'], 10)
# Select only the relevant indexes in the target column
df_temp = woe_continous(X_train_prepr_temp, 'annual_inc_factor', y[X_train_prepr_temp.index])
df_temp

In [None]:
plot_by_woe(df_temp, 90)

We will combine the following categories based on WOE and number of observations as follows: 
 (<=32000), (>32000 <= 50000), (>50000 <= 60000), (>60000 <=75000), (>75000 <=90000), (>90000 <=120000), (>120000 <=135000), (>135000 <=150000), (>150000)



In [None]:
# Analyze 'inq_last_6mths' WOE
X['inq_last_6mths_factor'] = pd.cut(X['inq_last_6mths'], 7)
inq_fact_df = woe_continous(X, 'inq_last_6mths_factor', y)
inq_fact_df

In [None]:
plot_by_woe(inq_fact_df)

The following categories will be created to new categories:
1. <1 months
2. 1-2
3. 2-4
4. 4-7

In [None]:
# Analyze total current balance WOE
X['tot_cur_bal_factor'] = pd.cut(X['tot_cur_bal'], 20)
curr_bal_df = woe_continous(X, 'tot_cur_bal_factor', y)
curr_bal_df

In [None]:
# Analyze total current balance below WOE
X_train_prepr_temp = X[X['tot_cur_bal'] <= 400000].copy()
# Fine-classing again
X_train_prepr_temp['tot_cur_bal_factor'] = pd.cut(X_train_prepr_temp['tot_cur_bal'], 10)
# Select only the relevant indexes in the target column
df_temp = woe_continous(X_train_prepr_temp, 'tot_cur_bal_factor', y[X_train_prepr_temp.index])
df_temp

In [None]:
plot_by_woe(df_temp, 90)

The variables will be created: 
<40000
40000-80000
80000-120000
120000-160000
160000-200000
200000-240000
240000-320000
320000-400000

In [None]:
# Analyze 'mths_since_credit_pull' WOE 
X['mths_since_last_credit_pull_d_factor'] = pd.cut(X['mths_since_last_credit_pull_d'], 10)
mths_cr_pull_df = woe_continous(X, 'mths_since_last_credit_pull_d_factor', y)
mths_cr_pull_df

In [None]:
# Analyze 'mths_since_credit_pull' below 60
X_train_prepr_temp = X[X['mths_since_last_credit_pull_d'] <= 60].copy()
# Fine-classing again
X_train_prepr_temp['mths_since_last_credit_pull_d'] = pd.cut(X_train_prepr_temp['mths_since_last_credit_pull_d'], 5)
# Select only the relevant indexes in the target column
df_temp = woe_continous(X_train_prepr_temp, 'mths_since_last_credit_pull_d', y[X_train_prepr_temp.index])
df_temp

In [None]:
plot_by_woe(mths_cr_pull_df)

The following categories will be grouped together: 54-65, 65-76, greater than 76







In [None]:
# Analyze 'out_prncp' WOE 
X['out_prncp_factor'] = pd.cut(X['out_prncp'], 10)
out_df = woe_continous(X, 'out_prncp_factor', y)
out_df

In [None]:
plot_by_woe(out_df, 90)

In [None]:
# Analyze 'mths_since_issue_date' WOE 
X['mths_since_issue_d'] = pd.cut(X['mths_since_issue_d'], 10)
iss_df = woe_continous(X, 'mths_since_issue_d', y)
iss_df

In [None]:
plot_by_woe(iss_df)

### Creating  new features based on WOE

In [None]:
# Create a new dataframe and start with 'grade' variable

new_df = preprocess_data.loc[:, 'grade:A':'grade:G']

In [None]:
new_df.head()

In [None]:
# home_ownership 

new_df['home_ownership:OWN'] = preprocess_data.loc[:, 'home_ownership:OWN']
new_df['home_ownership:OTHER_NONE_RENT_ANY'] = sum([preprocess_data['home_ownership:OTHER'], preprocess_data['home_ownership:NONE'],
                                                 preprocess_data['home_ownership:RENT'], preprocess_data['home_ownership:ANY']])
new_df['home_ownership:MORTGAGE'] = preprocess_data.loc[:, 'home_ownership:MORTGAGE']

# verification_status
new_df['verification_status:NOT_VERIFIED'] = preprocess_data.loc[:, 'verification_status:Not Verified']
new_df['verification_status:SOURCE_VERIFIED'] = preprocess_data.loc[:, 'verification_status:Source Verified']
new_df['verification_status:VERIFIED'] = preprocess_data.loc[:, 'verification_status:Verified']

# purpose of loan
new_df['purpose:SMALL_BUSINESS_EDUCATIONAL_RENEWABLE_ENERGY_MOVING'] = sum([preprocess_data['purpose:small_business'],  preprocess_data['purpose:renewable_energy'], preprocess_data['purpose:moving']])

new_df['purpose:OTHER_HOUSE_MEDICAL'] =sum([preprocess_data['purpose:other'], preprocess_data['purpose:house'], preprocess_data['purpose:medical']])
new_df ['purpose:WEDDING_VACATION'] = sum([preprocess_data['purpose:wedding'], preprocess_data['purpose:vacation']])
new_df ['purpose:HOME_IMPROVEMENT_MAJOR_PURCHASE'] = sum([preprocess_data['purpose:home_improvement'], preprocess_data['purpose:major_purchase']])
new_df ['purpose:CAR_CREDIT_CARD'] = sum([preprocess_data['purpose:car'], preprocess_data['purpose:credit_card']])


# addr_state
new_df['addr_state:NE_IA_NV_HI_FL_AL'] =sum([preprocess_data['addr_state:IA'],preprocess_data['addr_state:NV'],
                                           preprocess_data['addr_state:HI'],preprocess_data['addr_state:FL'],preprocess_data['addr_state:AL']])
new_df['addr_state:NY'] = preprocess_data.loc[:, 'addr_state:NY']
new_df['addr_state:LA_NM_OK_NC_MO_MD_NJ_VA'] = sum([preprocess_data['addr_state:LA'],preprocess_data['addr_state:NM'],preprocess_data['addr_state:OK'],
                     preprocess_data['addr_state:NC'],preprocess_data['addr_state:MO'],preprocess_data['addr_state:MD'], preprocess_data['addr_state:NJ'],
                                                  preprocess_data['addr_state:VA']])
new_df['addr_state:CA'] = preprocess_data.loc[:,'addr_state:CA']
new_df['addr_state:AZ_MI_UT_TN_AR_PA'] =sum([preprocess_data['addr_state:AZ'],preprocess_data['addr_state:MI'],preprocess_data['addr_state:UT'],
preprocess_data['addr_state:TN'],preprocess_data['addr_state:AR'],preprocess_data['addr_state:PA']])

new_df['addr_state:RI_OH_KY_DE_MN_SD_MA_IN'] =sum([preprocess_data['addr_state:RI'],preprocess_data['addr_state:OH'],preprocess_data['addr_state:KY'],
 preprocess_data['addr_state:DE'],preprocess_data['addr_state:MN'],preprocess_data['addr_state:SD'],preprocess_data['addr_state:MA'],
                    preprocess_data['addr_state:IN']])

new_df['addr_state:GA_WA'] = sum([preprocess_data['addr_state:GA'], preprocess_data['addr_state:WA']])
new_df['addr_state:WI_OR'] = sum([preprocess_data['addr_state:WI'], preprocess_data['addr_state:OR']])
new_df['addr_state:TX'] = preprocess_data.loc[:,'addr_state:TX']
new_df['addr_state:IL_CT_MT'] =sum([preprocess_data['addr_state:IL'],preprocess_data['addr_state:CT'],preprocess_data['addr_state:MT']])
new_df['addr_state:CO_SC'] = sum([preprocess_data['addr_state:CO'], preprocess_data['addr_state:SC']])
new_df['addr_state:KS_VT_AK_NS'] =sum([preprocess_data['addr_state:KS'],preprocess_data['addr_state:VT'],preprocess_data['addr_state:AK'],
                                           preprocess_data['addr_state:MS']])
new_df['addr_state:NH_WV_WY_DC'] =sum([preprocess_data['addr_state:NH'],preprocess_data['addr_state:WV'],preprocess_data['addr_state:WY'],
                                           preprocess_data['addr_state:DC']])
# initial_list_status
new_df['initial_list_status:F'] = preprocess_data.loc[:, 'initial_list_status:f']
new_df['initial_list_status:W'] = preprocess_data.loc[:, 'initial_list_status:w']

# term 
new_df['term:36'] = np.where((preprocess_data['term'] == 36), 1, 0)
new_df['term:60'] = np.where((preprocess_data['term']==60), 1,0)

# total_rec_int 
new_df['total_rec_int:<1000'] = np.where((preprocess_data['total_rec_int']<=1000), 1,0)
new_df['total_rec_int:1000-2000'] = np.where((preprocess_data['total_rec_int']>1000) &(preprocess_data['total_rec_int']<=2000), 1,0)
new_df['total_rec_int:2000-9000'] = np.where((preprocess_data['total_rec_int']>2000) &(preprocess_data['total_rec_int']<=9000), 1,0)
new_df['total_rec_int:>9000'] = np.where((preprocess_data['total_rec_int']>9000), 1,0)


# total_revol_hi_lim
new_df['total_rev_hi_lim:<10000'] =np.where((preprocess_data['total_rev_hi_lim']<=10000),1,0)
new_df['total_rev_hi_lim:10000-20000'] =np.where((preprocess_data['total_rev_hi_lim']>10000)&(preprocess_data['total_rev_hi_lim']<=20000),1,0)
new_df['total_rev_hi_lim:20000-40000'] =np.where((preprocess_data['total_rev_hi_lim']>20000)&(preprocess_data['total_rev_hi_lim']<=40000),1,0)
new_df['total_rev_hi_lim:40000-60000'] =np.where((preprocess_data['total_rev_hi_lim']>40000)&(preprocess_data['total_rev_hi_lim']<=60000),1,0)
new_df['total_rev_hi_lim:60000-80000'] =np.where((preprocess_data['total_rev_hi_lim']>60000)&(preprocess_data['total_rev_hi_lim']<=80000),1,0)
new_df['total_rev_hi_lim:80000-100000'] =np.where((preprocess_data['total_rev_hi_lim']>80000)&(preprocess_data['total_rev_hi_lim']<=100000),1,0)
new_df['total_rev_hi_lim:<100000'] =np.where((preprocess_data['total_rev_hi_lim']>100000),1,0)


# total_pymnt
new_df['total_pymnt:<5000'] = np.where((preprocess_data['total_pymnt']<=5000), 1,0)
new_df['total_pymnt:5000-11000'] = np.where((preprocess_data['total_pymnt']>5000)&(preprocess_data['total_pymnt']<=11000),1,0)
new_df['total_pymnt:11000-16000'] = np.where((preprocess_data['total_pymnt']>11000)&(preprocess_data['total_pymnt']<=16000),1,0)
new_df['total_pymnt:16000-22000'] = np.where((preprocess_data['total_pymnt']>16000)&(preprocess_data['total_pymnt']<=22000),1,0)
new_df['total_pymnt:>22000'] = np.where((preprocess_data['total_pymnt']<=5000), 1,0)
#int_Rate

new_df['int_rate:<7.484'] = np.where((preprocess_data['int_rate'] <= 7.484), 1, 0)
new_df['int_rate:7.484-9.548'] = np.where((preprocess_data['int_rate'] > 7.484) & (preprocess_data['int_rate'] <= 9.548), 1, 0)
new_df['int_rate:9.548-11.612'] = np.where((preprocess_data['int_rate'] > 9.548) & (preprocess_data['int_rate'] <= 11.612), 1, 0)
new_df['int_rate:11.612-13.676'] = np.where((preprocess_data['int_rate'] > 11.612) & (preprocess_data['int_rate'] <= 13.676), 1, 0)
new_df['int_rate:13.676-15.74'] = np.where((preprocess_data['int_rate'] > 13.676) & (preprocess_data['int_rate'] <= 15.74), 1, 0)
new_df['int_rate:15.74-17.804'] = np.where((preprocess_data['int_rate'] > 15.74) & (preprocess_data['int_rate'] <= 17.804), 1, 0)
new_df['int_rate:17.804-19.868'] = np.where((preprocess_data['int_rate'] > 17.804) & (preprocess_data['int_rate'] <= 19.868), 1, 0)
new_df['int_rate:7.19.868-21.932'] = np.where((preprocess_data['int_rate'] > 19.868) & (preprocess_data['int_rate'] <= 21.932), 1, 0)
new_df['int_rate:21.932-26.06'] = np.where((preprocess_data['int_rate'] > 21.932) & (preprocess_data['int_rate'] <= 26.06), 1, 0)


# dti 
new_df['dti:<4'] = np.where((preprocess_data['dti'] <=4), 1, 0)
new_df['dti:4-8'] = np.where((preprocess_data['dti'] > 4) & (preprocess_data['dti'] <= 8), 1, 0)
new_df['dti:8-12'] = np.where((preprocess_data['dti'] > 8) & (preprocess_data['dti'] <= 12), 1, 0)
new_df['dti:12-16'] = np.where((preprocess_data['dti'] > 12) & (preprocess_data['dti'] <= 16), 1, 0)
new_df['dti:16-20'] = np.where((preprocess_data['dti'] > 16) & (preprocess_data['dti'] <= 20), 1, 0)
new_df['dti:20-23'] = np.where((preprocess_data['dti'] > 20) & (preprocess_data['dti'] <= 23), 1, 0)
new_df['dti:23-27'] = np.where((preprocess_data['dti'] > 23) & (preprocess_data['dti'] <= 27), 1, 0)
new_df['dti:27-40'] = np.where((preprocess_data['dti'] > 27) & (preprocess_data['dti'] <= 40), 1, 0)

# annual income 
new_df['annual_inc:<32000'] = np.where((preprocess_data['annual_inc'] <= 32000), 1, 0)
new_df['annual_inc:32000-50000'] = np.where((preprocess_data['annual_inc'] > 32000) & (preprocess_data['annual_inc'] <= 50000),1, 0)
new_df['annual_inc:32000-50000'] = np.where((preprocess_data['annual_inc'] > 32000) & (preprocess_data['annual_inc'] <= 50000), 1, 0)
new_df['annual_inc:50000-60000'] = np.where((preprocess_data['annual_inc'] > 50000) & (preprocess_data['annual_inc'] <= 60000), 1, 0)
new_df['annual_inc:60000-75000'] = np.where((preprocess_data['annual_inc'] > 60000) & (preprocess_data['annual_inc'] <= 75000), 1, 0)
new_df['annual_inc:75000-90000'] = np.where((preprocess_data['annual_inc'] > 75000) & (preprocess_data['annual_inc'] <= 90000), 1, 0)
new_df['annual_inc:90000-120000'] = np.where((preprocess_data['annual_inc'] > 90000) & (preprocess_data['annual_inc'] <= 120000), 1, 0)
new_df['annual_inc:120000-135000'] = np.where((preprocess_data['annual_inc'] > 120000) & (preprocess_data['annual_inc'] <= 135000), 1, 0)
new_df['annual_inc:135000-150000'] = np.where((preprocess_data['annual_inc'] > 135000) & (preprocess_data['annual_inc'] <= 150000), 1, 0)
new_df['annual_inc:>150000'] = np.where((preprocess_data['annual_inc'] > 150000), 1, 0)

# inq_last_6mths
new_df['inq_last_6mths:<1'] = np.where((preprocess_data['inq_last_6mths'] <=1), 1, 0)
new_df['inq_last_6mths:1-2'] = np.where((preprocess_data['inq_last_6mths'] >1)& (preprocess_data['inq_last_6mths']<=2),  1, 0)
new_df['inq_last_6mths:2-4'] = np.where((preprocess_data['inq_last_6mths'] >2)& (preprocess_data['inq_last_6mths']<=4),  1, 0)
new_df['inq_last_6mths:4-7'] = np.where((preprocess_data['inq_last_6mths'] >4)& (preprocess_data['inq_last_6mths']<=7),  1, 0)

# revol_util
# new_df['revol_util:<44'] = np.where((preprocess_data['revol_util'] <=44), 1,0)
# new_df['revol_util:44-89'] =np.where((preprocess_data['revol_util'] > 44) & (preprocess_data['revol_util'] <= 89),1, 0)
# new_df['revol_util:>89'] = np.where((preprocess_data['revol_util'] >89), 1,0)

# tot_cur_balance
new_df['tot_cur_bal:<40000'] = np.where((preprocess_data['tot_cur_bal'] <= 40000), 1, 0)
new_df['tot_cur_bal:40000-80000'] = np.where((preprocess_data['tot_cur_bal'] > 40000) & (preprocess_data['tot_cur_bal'] <= 80000), 1, 0)
new_df['tot_cur_bal:80000-120000'] = np.where((preprocess_data['tot_cur_bal'] > 120000) & (preprocess_data['tot_cur_bal'] <= 160000), 1, 0)
new_df['tot_cur_bal:120000-160000'] = np.where((preprocess_data['tot_cur_bal'] > 120000) & (preprocess_data['tot_cur_bal'] <= 160000), 1, 0)
new_df['tot_cur_bal:160000-200000'] = np.where((preprocess_data['tot_cur_bal'] > 160000) & (preprocess_data['tot_cur_bal'] <= 200000), 1, 0)
new_df['tot_cur_bal:200000-240000'] = np.where((preprocess_data['tot_cur_bal'] > 200000) & (preprocess_data['tot_cur_bal'] <= 240000), 1, 0)
new_df['tot_cur_bal:240000-320000'] = np.where((preprocess_data['tot_cur_bal'] > 240000) & (preprocess_data['tot_cur_bal'] <= 320000), 1, 0)
new_df['tot_cur_bal:320000-400000'] = np.where((preprocess_data['tot_cur_bal'] > 320000) & (preprocess_data['tot_cur_bal'] <= 400000), 1, 0)
new_df['tot_cur_bal:>400000'] = np.where((preprocess_data['tot_cur_bal'] > 400000), 1, 0)

# mths_since_last_credit_pull_d
new_df['mths_since_last_credit_pull_d:<65'] = np.where((preprocess_data['mths_since_last_credit_pull_d']<=65), 1,0)
new_df['mths_since_last_credit_pull_d:65-76'] = np.where((preprocess_data['mths_since_last_credit_pull_d']>65)&(preprocess_data['mths_since_last_credit_pull_d']<=76),1,0)
new_df['mths_since_last_credit_pull_d:>76'] = np.where((preprocess_data['mths_since_last_credit_pull_d']>76), 1,0)

# mths_since_issue_d_factor
new_df['mths_since_issue_d_:<70.8'] = np.where((preprocess_data['mths_since_issue_d']<=70.8), 1,0)
new_df['mths_since_issue_d_:>70.8-73.6'] = np.where((preprocess_data['mths_since_issue_d'] >70.8) & (preprocess_data['mths_since_issue_d']<=73.6), 1,0)
new_df['mths_since_issue_d_:73.6-76.4'] = np.where((preprocess_data['mths_since_issue_d']>70.8) & (preprocess_data['mths_since_issue_d']<=76.4), 1,0)
new_df['mths_since_issue_d_:>76.4-79.2'] = np.where((preprocess_data['mths_since_issue_d'] >76.4) & (preprocess_data['mths_since_issue_d']<=79.2), 1,0)
new_df['mths_since_issue_d_:>79.2-82'] = np.where((preprocess_data['mths_since_issue_d'] >79.2) & (preprocess_data['mths_since_issue_d']<=82), 1,0)
new_df['mths_since_issue_d_>82-84'] = np.where((preprocess_data['mths_since_issue_d'] >82) & (preprocess_data['mths_since_issue_d']<=84), 1,0)
new_df['mths_since_issue_d_:>84-90.4'] = np.where((preprocess_data['mths_since_issue_d'] >84) & (preprocess_data['mths_since_issue_d']<=90.4), 1,0)
new_df['mths_since_issue_d_:>90.4-96'] = np.where((preprocess_data['mths_since_issue_d'] >90.4) & (preprocess_data['mths_since_issue_d']<=96), 1,0)

new_df['out_prncp:<3000'] = np.where((preprocess_data['out_prncp']<=3000), 1,0)
new_df['out_prncp:3000-6000'] = np.where((preprocess_data['out_prncp']>3000)&(preprocess_data['out_prncp']<=6000), 1,0)
new_df['out_prncp:6000-10000'] = np.where((preprocess_data['out_prncp']>6000)&(preprocess_data['out_prncp']<=10000), 1,0)
new_df['out_prncp:10000-12000'] = np.where((preprocess_data['out_prncp']>10000)&(preprocess_data['out_prncp']<=12000), 1,0)
new_df['out_prncp:>12000'] = np.where((preprocess_data['out_prncp']>12000), 1,0)



new_df['good_bad'] = preprocess_data.loc[:, 'good_bad']



In [None]:
# Display first 10 rows of new_df
pd.options.display.max_columns = None
new_df.head(10)


In [None]:
new_df.shape

In [None]:
new_df1 = new_df

Remove one dummy variable for each original variable, otherwise we will go into the dummy variable trap. The dummy variables with the lowest WOE will be removed.

In [None]:
# Dummy categories dropped
ref_categories = ['home_ownership:OTHER_NONE_RENT_ANY', 'total_rec_int:<1000', 'total_pymnt:<5000','total_rev_hi_lim:<10000', 'grade:G', 'verification_status:VERIFIED', 'purpose:SMALL_BUSINESS_EDUCATIONAL_RENEWABLE_ENERGY_MOVING',
                 'addr_state:NE_IA_NV_HI_FL_AL', 'initial_list_status:F', 'term:60', 'mths_since_issue_d_:>90.4-96','int_rate:21.932-26.06', 'dti:27-40',
                  'annual_inc:<32000', 'inq_last_6mths:4-7', 'tot_cur_bal:<40000', 'mths_since_last_credit_pull_d:>76', 'out_prncp:>12000']

In [None]:
new_df.drop(columns=ref_categories, inplace=True, axis=1)

In [None]:
new_df.shape

In [None]:
# Check the class labels are balanced

from yellowbrick.target import ClassBalance
X= new_df.drop(columns='good_bad', axis=1)
y = new_df['good_bad']
visualizer = ClassBalance()
visualizer.fit(y)
visualizer.show()

Based on graph we see that individuals who are classified as a bad borrowers have very few observations. This class imbalance can affect our model when it turns to training test. To solve this problem we will oversample the minority class.

In [None]:
# Split data into train and test 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
# Check imbalance data for training dataset
y_train.value_counts()

### Import Libraries for Train Test Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, precision_recall_curve
from imblearn.over_sampling import RandomOverSampler
from imblearn.combine import SMOTETomek

In [None]:
# Deal with imbalance data
os = RandomOverSampler()
X_train_o, y_train_o = os.fit_resample(X_train, y_train)
y_train_series = pd.Series(y_train_o)

In [None]:
# Check value counts after oversampling
y_train_series.value_counts()

In [None]:
# Build the model
model = LogisticRegression()
model.fit(X_train_o, y_train_o)

In [None]:
# Predict the model
y_preds = model.predict(X_test)

In [None]:
# Classification report
print(classification_report(y_test, y_preds))

In [None]:
y_hat_test_proba = model.predict_proba(X_test)
y_hat_test_proba = y_hat_test_proba[:][: , 1]
y_test_temp = y_test.copy()
y_test_temp.reset_index(drop = True, inplace = True)
y_test_proba = pd.concat([y_test_temp, pd.DataFrame(y_hat_test_proba), pd.DataFrame(y_preds)], axis = 1)
y_test_proba.columns = ['y_test_class_actual', 'y_hat_test_proba', 'y_hat_test']
y_test_proba.index = X_test.index
y_test_proba.head()

In [None]:
# Extract the values required to plot a ROC curve
fpr, tpr, thresholds = roc_curve(y_test_proba['y_test_class_actual'], y_test_proba['y_hat_test_proba'])
# Plot the ROC curve
plt.plot(fpr, tpr)
# Plot a secondary diagonal line, to show randomness of model
plt.plot(fpr, fpr, linestyle = '--', color = 'k')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve');

In [None]:
# Area under receiver operating charateristic curve (AUROC)
AUROC = roc_auc_score(y_test_proba['y_test_class_actual'], y_test_proba['y_hat_test_proba'])
AUROC

In [None]:
Gini = AUROC * 2 - 1
Gini

In [None]:
from sklearn.metrics import precision_recall_curve, auc
# Plot a PR curve
# Calculate no_skill line as the proportion of the positive class
no_skill = len(y_test[y_test == 1]) / len(y)
# {lot the no_skill precision-recall curve
plt.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')

# Calculate inputs for the PR curve
precision, recall, thresholds = precision_recall_curve(y_test_proba['y_test_class_actual'], y_test_proba['y_hat_test_proba'])
# Plot PR curve
plt.plot(recall, precision, marker='.', label='Logistic')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.legend()
plt.title('PR curve');

In [None]:
# Precision recall score
auc_pr = auc(recall, precision)
auc_pr

In [None]:
# Calculate ks statistic
actual_predicted_probs_df = y_test_proba.sort_values('y_hat_test_proba')

In [None]:
actual_predicted_probs_df.head()

In [None]:
actual_predicted_probs_df.tail()

In [None]:
actual_predicted_probs_df = actual_predicted_probs_df.reset_index()

In [None]:
actual_predicted_probs_df['cum_n_pop'] = actual_predicted_probs_df.index +1
actual_predicted_probs_df['cum_good'] = actual_predicted_probs_df['y_test_class_actual'].cumsum()
actual_predicted_probs_df['cum_bad'] = actual_predicted_probs_df['cum_n_pop'] - actual_predicted_probs_df['y_test_class_actual'].cumsum()


In [None]:
actual_predicted_probs_df.head()

In [None]:
actual_predicted_probs_df['cum_n_%'] = actual_predicted_probs_df['cum_n_pop']/(actual_predicted_probs_df.shape[0])
actual_predicted_probs_df['cum_good_%'] = actual_predicted_probs_df['cum_good']/actual_predicted_probs_df['y_test_class_actual'].sum()
actual_predicted_probs_df['cum_bad_%'] = actual_predicted_probs_df['cum_bad']/ (actual_predicted_probs_df.shape[0]-actual_predicted_probs_df['y_test_class_actual'].sum())

In [None]:
actual_predicted_probs_df.head()

In [None]:
plt.plot(actual_predicted_probs_df['cum_n_%'], actual_predicted_probs_df['cum_bad_%'])
plt.plot(actual_predicted_probs_df['cum_n_%'], actual_predicted_probs_df['cum_n_%'], linestyle='--', c='k')

In [None]:
plt.plot(actual_predicted_probs_df['y_hat_test_proba'], actual_predicted_probs_df['cum_bad_%'], c='r')
plt.plot(actual_predicted_probs_df['y_hat_test_proba'], actual_predicted_probs_df['cum_good_%'], c='g')


In [None]:
ks = max(actual_predicted_probs_df['cum_bad_%'] - actual_predicted_probs_df['cum_good_%'])
print('The KS score is ',ks)

## **Model Deployment**

In [None]:
# Save the model 
import pickle
filename = 'credit_risk_model.sav'
pickle.dump(model, open(filename, 'wb'))
