### Context

##### DRS bank is facing challenging times. Their NPAs (Non-Performing Assets) has been on a rise recently and a large part of these are due to the loans given to individual customers(borrowers). Chief Risk Officer of the bank decides to put in a scientifically robust framework for approval of loans to individual customers to minimize the risk of loans converting into NPAs and initiates a project for the data science team at the bank. You, as a senior member of the team, are assigned this project.

### Objective
#### To identify the criteria to approve loans for an individual customer such that the likelihood of the loan delinquency is minimized

### Key questions to be answered
##### What are the factors that drive the behavior of loan delinquency?

### Dataset
- ID: Customer ID
- isDelinquent : indicates whether the customer is delinquent or not (1 => Yes, 0 => No)
- term: Loan term in months
- gender: Gender of the borrower
- age: Age of the borrower
- purpose: Purpose of Loan
- home_ownership: Status of borrower's home
- FICO: FICO (i.e. the bureau score) of the borrower

### Domain Information
- Transactor – A person who pays his due amount balance full and on time.
- Revolver – A person who pays the minimum due amount but keeps revolving his balance and does not pay the full amount.
- Delinquent - Delinquency means that you are behind on payments, a person who fails to pay even the minimum due amount.
- Defaulter – Once you are delinquent for a certain period your lender will declare you to be in the default stage.
- Risk Analytics – A wide domain in the financial and banking industry, basically analyzing the risk of the customer.

### import the necessary packages

In [34]:
import warnings
warnings.filterwarnings('ignore')


import numpy as np
import pandas as pd



import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows",200)

from sklearn.model_selection import train_test_split

# to build decision tree classifier model
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# to tune different models
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    #plot_confusion_matrix,
    make_scorer
)




In [35]:
data = pd.read_csv('Loan_Delinquent_Dataset.csv')

In [36]:
df = data.copy()

In [37]:
df.head()

Unnamed: 0,ID,isDelinquent,term,gender,purpose,home_ownership,age,FICO
0,1,1,36 months,Female,House,Mortgage,>25,300-500
1,2,0,36 months,Female,House,Rent,20-25,>500
2,3,1,36 months,Female,House,Rent,>25,300-500
3,4,1,36 months,Female,Car,Mortgage,>25,300-500
4,5,1,36 months,Female,House,Rent,>25,300-500


In [38]:
df.tail()

Unnamed: 0,ID,isDelinquent,term,gender,purpose,home_ownership,age,FICO
11543,11544,0,60 months,Male,other,Mortgage,>25,300-500
11544,11545,1,36 months,Male,House,Rent,20-25,300-500
11545,11546,0,36 months,Female,Personal,Mortgage,20-25,>500
11546,11547,1,36 months,Female,House,Rent,20-25,300-500
11547,11548,1,36 months,Male,Personal,Mortgage,20-25,300-500


In [39]:
df.isnull().sum()

ID                0
isDelinquent      0
term              0
gender            0
purpose           0
home_ownership    0
age               0
FICO              0
dtype: int64

In [40]:
df.duplicated().sum()

0

In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11548 entries, 0 to 11547
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   ID              11548 non-null  int64 
 1   isDelinquent    11548 non-null  int64 
 2   term            11548 non-null  object
 3   gender          11548 non-null  object
 4   purpose         11548 non-null  object
 5   home_ownership  11548 non-null  object
 6   age             11548 non-null  object
 7   FICO            11548 non-null  object
dtypes: int64(2), object(6)
memory usage: 721.9+ KB


In [42]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,11548.0,5774.5,3333.764789,1.0,2887.75,5774.5,8661.25,11548.0
isDelinquent,11548.0,0.668601,0.470737,0.0,0.0,1.0,1.0,1.0


In [43]:
df.shape

(11548, 8)

In [44]:
df.drop(["ID"], axis= 1, inplace=True)

In [45]:
df.head()

Unnamed: 0,isDelinquent,term,gender,purpose,home_ownership,age,FICO
0,1,36 months,Female,House,Mortgage,>25,300-500
1,0,36 months,Female,House,Rent,20-25,>500
2,1,36 months,Female,House,Rent,>25,300-500
3,1,36 months,Female,Car,Mortgage,>25,300-500
4,1,36 months,Female,House,Rent,>25,300-500


#### Model building approach

- Data Preparation
- Partition the data in train and test data
- Built CART model on the train model
- Tune the model and prune the tree, if required


#### Split data

In [46]:
X = df.drop(['isDelinquent'], axis=1)
y = df['isDelinquent']

In [47]:
#X= pd.get_dummies(X, drop_first=True)
X.head()

Unnamed: 0,term,gender,purpose,home_ownership,age,FICO
0,36 months,Female,House,Mortgage,>25,300-500
1,36 months,Female,House,Rent,20-25,>500
2,36 months,Female,House,Rent,>25,300-500
3,36 months,Female,Car,Mortgage,>25,300-500
4,36 months,Female,House,Rent,>25,300-500


In [49]:
X_train, y_train, X_test, y_test = train_test_split(X,y, test_size = 0.3, random_state = 1)

### Build Decision Tree Model

##### Model evaluation criterion

- Predicting a customer will not be behind on payments (Non-Delinquent) but in reality the customer would be behind on payments.

- Predicting a customer will be behind on payments (Delinquent) but in reality the customer would not be behind on payments (Non-Delinquent).

##### Which case is more important?

recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.

##### First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
- The model_performance_classification_sklearn function will be used to check the model performance of models.
- The make_confusion_matrix function will be used to plot confusion matrix

In [None]:
# defining a function to compute different metrics to check model performace of a classification model build using decision tree model
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute metrics to check classification model performance 
    model: classifier
    predictors: independent variables
    target: dependent variables
    """
    