# AllLife Bank Personal Loan Campaign

## Context
AllLife Bank has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors). 

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget.

You as a Data scientist at AllLife bank has to build a model that will help marketing department to identify the potential customers who have higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.

## Objective
* To predict weather a liability customer will buy personal loans.
* Which variables are most significant.
* Which segment of customers should be targeted more.

### Data Dictionary
* ID: Customer ID
* Age: Customer’s age in completed years
* Experience: #years of professional experience
* Income: Annual income of the customer (in thousand dollars)
* ZIP Code: Home Address ZIP code.
* Family: the Family size of the customer
* CCAvg: Avg. spending on credit cards per month (in thousand dollars)
* Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
* Mortgage: Value of house mortgage if any. (in thousand dollars)
* Personal_Loan: Did this customer accept the personal loan offered in the last campaign?
* Securities_Account: Does the customer have securities account with the bank?
* CD_Account: Does the customer have a certificate of deposit (CD) account with the bank?
* Online: Do customers use internet banking facilities?
* CreditCard: Does the customer use a credit card issued by Bank?

## Loading libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

import matplotlib.pyplot as plt
import seaborn as sns

# To build linear model for statistical analysis and prediction
import statsmodels.stats.api as sms

## Import Dataset

In [None]:
df = pd.read_csv("Loan_Modelling.csv")
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")  # f-string

np.random.seed(1)  # To get the same random results every time
df.sample(n=10)

In [None]:
print(df.head(10))

### Let's explore data to get more insights

In [None]:
df.columns

* The variable ID does not add any interesting information. There is no association between a person's customer ID and loan, also it does not provide any general conclusion for future potential loan customers. We can neglect this information for our model prediction.


In [None]:
df = df.drop(['ID'], axis=1)

In [None]:
df.info()

* No columns have null data in the file

In [None]:
df.describe().T

* The mean age of the customers is 45 with standard deviation of 11.5. 
* The mean of Avg. spending on credit cards per month is 1.93 with standard deviation of 1.75. 
* The mean annual income of the customer is 73.77 with standard deviation of 46. 
* The mean value of house mortgage is 56.5 with standard deviation of 101.71. 

In [None]:
df.Age.unique()

In [None]:
df.Income.unique()

In [None]:
df.Experience.unique() 

In [None]:
#We can see there are few negative values in experience which is practically not possible
# We assume that these negative signs here are by mistake, so we will replace them with positive signs
df.Experience.replace(-1,1,inplace=True)
df.Experience.replace(-2,2,inplace=True)
df.Experience.replace(-3,3,inplace=True)

In [None]:
df.Family.unique() 

In [None]:
df.Securities_Account.unique() 

In [None]:
df.CCAvg.unique() 

In [None]:
df.Education.unique() 

In [None]:
df.Mortgage.unique() 

In [None]:
df.CD_Account.unique()

In [None]:
df.Online.unique()

In [None]:
df.CreditCard.unique()

In [None]:
df.ZIPCode.unique()

### Let's try to group them on the basis of first 2 digits

In [None]:
# The first digit indicates one of the regions and second digit indicates the sub region or one of the postal circles (States),
# So using first 2 digits will do work for our model
df['ZIPCode'] = df['ZIPCode'].astype(str)
print(df['ZIPCode'].str[0:2].nunique())
df['ZIPCode'] = df['ZIPCode'].str[0:2]

### Nothing unusual seen in the values of any of the variables

In [None]:
#Let's Look at correlation values
corr=df.corr()
fig,ax=plt.subplots(figsize=(12,12))
ax=sns.heatmap(corr,annot=True,square=True,fmt=".2f",cmap="YlGnBu")

* Age and Experience seems to highly correlated and that is very obvious too, as age increases, experience also increases
* Age, Experience, Online and credit card seems to be very less correlated with Personal loan
* Income and CCAvg seems to be correlated and that's obvious too, a person with higher salary will spend more on an average

### Let's Look at outliers in each column

In [None]:
# outlier detection using boxplot
numerical_col = df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,30))

for i, variable in enumerate(numerical_col):
                     plt.subplot(5,4,i+1)
                     plt.boxplot(df[variable],whis=1.5)
                     plt.tight_layout()
                     plt.title(variable)

plt.show()

We can see that Income, CCavg, Mortgage, Personal_loan, Securities_Account, CD_Account shows outliers, but they actually are not outliers - these are the variables that not every person has some value in. so there is nothing wrong in having such values

## EDA

### Let's look at details of dataset using pandas_profiling 

In [None]:
# pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
import pandas_profiling
df.profile_report()

## Observations
* The data set got 0 missing cells.
* It got 7 numeric variables: ‘Age’, ‘CC_Avg’, ‘ID’, ‘Income’, ‘Mortgage’, ‘Zip_Code’, ‘Experience’
* It got 2 categorical variables: ‘Education’, ‘Family’
* It got 5 Boolean variables: ‘CD_Account’, ‘Credit_Card’, ‘Online’, ‘Personal_Loan’, ‘Securities Account’
* Personal Loan is highly correlated with Income, average spending on Credit cards, mortgage & if the customer has a certificate of deposit (CD) account with the bank.
* Also, Experience is highly correlated with Age (ρ = 0.994214857)

### Categorical
* 42% of the candidates are graduated, while 30% are professional and 28% are Undergraduate.
* Around 29% of the customer’s family size is 1.

### Boolean
* 94% of the customer doesn’t have a certificate of deposit (CD) account with the bank.
* Around 71% of the customer doesn’t use a credit card issued by UniversalBank.
* Around 60% of customers use internet banking facilities.
* Around 90% of the customer doesn’t accept the personal loan offered in the last campaign.
* Around 90% of the customer doesn’t have a securities account with the bank.

### Numeric
* The mean age of the customers is 45 with standard deviation of 11.5. Also, we had estimated the average age in hypothesis testing between 30–50. The curve is slightly negatively skewed (Skewness = -0.02934068151) hence the curve is fairly symmetrical
* The mean of Avg. spending on credit cards per month is 1.93 with standard deviation of 1.75. The curve is highly positive skewed (Skewness = 1.598443337)
* The mean annual income of the customer is 73.77 with standard deviation of 46. The curve is moderately positive skewed (Skewness = 0.8413386073)
* The mean value of house mortgage is 56.5 with standard deviation of 101.71. The curve is highly positive skewed (Skewness = 2.104002319) and there are a lot of outlier’s present (Kurtosis = 4.756796669)

In [None]:
sns.pairplot(df)

* Income is positively skewed. Majority of the customers have income between 45K and 55K. We can confirm this by saying the mean is greater than the median.
* CCAvg is also a positively skewed variable and average spending is between 0K to 10K and majority spends less than 2.5K
* Experience is normally distributed with more customer having experience starting from 8 years. Here the mean is equal to median. 
* The variables family and education are ordinal variables. The distribution of families is evenly distributes

In [None]:
def stacked_plot(x):
    sns.set(palette='nipy_spectral')
    tab1 = pd.crosstab(x,df['Personal_Loan'],margins=True)
    print(tab1)
    print('-'*120)
    tab = pd.crosstab(x,df['Personal_Loan'],normalize='index')
    tab.plot(kind='bar',stacked=True,figsize=(10,5))
    plt.legend(loc='lower left', frameon=False)
    plt.legend(loc="upper left", bbox_to_anchor=(1,1))
    plt.show()
    

In [None]:
categorical_val = ['Family','Education','Personal_Loan','Securities_Account','CD_Account','Online','CreditCard']

plt.figure(figsize=(9,9))
for i , column in enumerate(categorical_val,1):
    stacked_plot(df[column])
    plt.xlabel(column)

### Influence of income and education on personal loan

In [None]:
sns.boxplot(x='Education',y='Income',hue='Personal_Loan',data=df)

*  It seems the customers whose education level is 1 is having more income. However customers who has taken the personal loan have the same income levels

### Influence of Securities_account on Personal loan

In [None]:
sns.countplot(x="Securities_Account", data=df,hue="Personal_Loan")

* Majority of customers who does not have loan have securities account, this might be happening because majority of people don't have loan.

### Influence of Family size on Personal loan

In [None]:
sns.countplot(x='Family',data=df,hue='Personal_Loan',palette='Set1')

* Family size does not have any impact in personal loan. But it seems families with size of 3 and 4 are more likely to take loan. When considering future campaign this might be good association.

### Influence of CDAccount on Personal loan

In [None]:
sns.countplot(x='CD_Account',data=df,hue='Personal_Loan')

* Customers who does not have CD account , does not have loan as well. This seems to be majority. But almost all customers who has CD account has loan as well

### Influence of Age on Personal loan

In [None]:
sns.countplot(x='Age',data=df,hue='Personal_Loan')

* Age doesn't have impact on Personal Loan

### Model Evaluation Criteria

In [None]:
print('No. of 0s' , (df.Personal_Loan == 0).sum(axis=0))
print('No. of 1s' , (df.Personal_Loan == 1).sum(axis=0))

### Here bank can face 2 types of losses
 1. False negative - Person would take a loan but model says - he won't - Loss of opportunity
 2. False positive - Model says person will take a loan, but in actual person won't - Increased Marketing cost

### Which loss is bigger?
 - False negatives i.e. Loss of Oppurtunity (Typically marketing cost is small)
 - So we want to reduce False negatives and for that we have to maximize the Recall while keeping Accuracy in balance

### Split the data

In [None]:
## Defining X and Y variables
# Here we had 6 categorical variables but 4 of them are binary, so we'll have same results with them even after creating dummies
# education and family have order within them, so we won't convert them to dummies
# so let's not change them and make dummies only for Zipcode

df["ZIPCode"] = df["ZIPCode"].astype('category')

X = df.drop(['Personal_Loan'], axis=1)
Y = df[['Personal_Loan']] 

X = pd.get_dummies(X, drop_first=True)

#Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.30)

In [None]:
print('Original dataset', Y.Personal_Loan.value_counts(normalize=True))
print('Train dataset', y_train.Personal_Loan.value_counts(normalize=True))
print('Test dataset', y_test.Personal_Loan.value_counts(normalize=True))

* Data split looks uniform

In [None]:
X.head()

In [None]:
## Defining a function for better visualization of confusion matrix

from sklearn.metrics import classification_report,confusion_matrix
#mat_train = confusion_matrix(y_train,pred_train)


def make_confusion_matrix(y_actual,y_predict,labels=[1, 0]):
    '''
    y_predict: prediction of class
    y_actual : ground truth  
    '''
    cm=confusion_matrix( y_predict,y_actual, labels=[1, 0])
    df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
                  columns = [i for i in ['1','0']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (7,5))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')



# Logistic Regression

### Let's build model using Statsmodels

### Before making the model, first let's check if our variables has multicollinearity

* There are different ways of detecting(or  testing) multi-collinearity, one such way is Variation Inflation Factor.

* **Variance  Inflation  factor**:  Variance  inflation  factors  measure  the  inflation  in  the variances of the regression coefficients estimates due to collinearities that exist among the  predictors.  It  is  a  measure  of  how  much  the  variance  of  the  estimated  regression coefficient βk is “inflated”by  the  existence  of  correlation  among  the  predictor variables in the model. 

* General Rule of thumb: If VIF is 1 then there is no correlation among the kth predictor and the remaining predictor variables, and  hence  the variance of β̂k is not inflated at all. Whereas if VIF exceeds 5, we say there is moderate VIF and if it is 10 or exceeding 10, it shows signs of high multi-collinearity. But the purpose of the analysis should dictate which threshold to use. 

In [None]:
# dataframe with numerical column only
num_feature_set = X.copy()
from statsmodels.tools.tools import add_constant
num_feature_set = add_constant(num_feature_set)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))

* Age and Experience seems to be highly correlated, so we will drop one of them depending on which has less effect on making predictions

In [None]:
X_train, X_test, y_train, y_test = train_test_split(num_feature_set, Y, test_size=0.30)

In [None]:
import statsmodels.api as sm
logit = sm.Logit(y_train, X_train)
lg = logit.fit()

print(lg.summary())

In [None]:
X_train1 = X_train.drop('Age', axis = 1)
X_test1 = X_test.drop('Age', axis = 1)

logit1 = sm.Logit(y_train, X_train1)
lg1 = logit1.fit()

In [None]:
# Let's check accuracy and recall for this model

from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

pred_train1 = lg1.predict(X_train1)
pred_test1 = lg1.predict(X_test1)

pred_train1 = np.round(pred_train1)
pred_test1 = np.round(pred_test1)

print('Accuracy on train data:',accuracy_score(y_train, pred_train1) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test1))

print('Recall on train data:',recall_score(y_train, pred_train1) )
print('Recall on test data:',recall_score(y_test, pred_test1))

print('Precision on train data:',precision_score(y_train, pred_train1) )
print('Precision on test data:',precision_score(y_test, pred_test1))

print('f1 score on train data:',f1_score(y_train, pred_train1) )
print('f1 score on test data:',f1_score(y_test, pred_test1))

In [None]:
X_train2 = X_train.drop('Experience', axis = 1)
X_test2 = X_test.drop('Experience', axis = 1)

logit2 = sm.Logit(y_train, X_train2)
lg2 = logit2.fit()


In [None]:
# Let's check accuracy and recall for this model

pred_train2 = lg2.predict(X_train2)
pred_test2 = lg2.predict(X_test2)

pred_train2 = np.round(pred_train2)
pred_test2 = np.round(pred_test2)

print('Accuracy on train data:',accuracy_score(y_train, pred_train2) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test2))

print('Recall on train data:',recall_score(y_train, pred_train2) )
print('Recall on test data:',recall_score(y_test, pred_test2))

print('Precision on train data:',precision_score(y_train, pred_train2) )
print('Precision on test data:',precision_score(y_test, pred_test2))

print('f1 score on train data:',f1_score(y_train, pred_train2) )
print('f1 score on test data:',f1_score(y_test, pred_test2))

* The accuracy on lg1 and lg2 is same
* But let's proceed with lg1 i.e. we should drop age variable and keep Experience variable

### Let's check VIF score again

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
num_feature_set = num_feature_set.drop(['Age'], axis = 1)
vif_series1 = pd.Series([variance_inflation_factor(num_feature_set.values,i) for i in range(num_feature_set.shape[1])],index=num_feature_set.columns)
print('Series before feature selection: \n\n{}\n'.format(vif_series1))

### Now none of the variables have high VIF score

Let's Look at summary of lg1 and make interpretations

In [None]:
print(lg1.summary())

* Observe that p value of Experience, Securities_Account and all 6 classes of ZIP code is greater than 0.05, they seem to be insignificant
* Let's Drop them one by one and observe how our model changes
* This is something we observed during EDA also, Experience and Securities_Account didn't showed any specific pattern with personal loan

In [None]:
X_train3 = X_train1.drop(['ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96'], axis = 1)
X_test3 = X_test1.drop(['ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96'], axis = 1)


logit3 = sm.Logit(y_train, X_train3)
lg3 = logit3.fit()

# Let's check accuracy and recall for this model
pred_train3 = lg3.predict(X_train3)
pred_test3 = lg3.predict(X_test3)

pred_train3 = np.round(pred_train3)
pred_test3 = np.round(pred_test3)

print('Accuracy on train data:',accuracy_score(y_train, pred_train3) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test3))

print('Recall on train data:',recall_score(y_train, pred_train3) )
print('Recall on test data:',recall_score(y_test, pred_test3))

print('Precision on train data:',precision_score(y_train, pred_train3) )
print('Precision on test data:',precision_score(y_test, pred_test3))

print('f1 score on train data:',f1_score(y_train, pred_train3) )
print('f1 score on test data:',f1_score(y_test, pred_test3))

print(lg3.summary())

* Accuarcy increased from .954 to .957
* Now let's drop Experience

In [None]:
X_train4 = X_train3.drop(['Experience'], axis = 1)
X_test4 = X_test3.drop(['Experience'], axis = 1)


logit4 = sm.Logit(y_train, X_train4)
lg4 = logit4.fit()

# Let's check accuracy and recall for this model
pred_train4 = lg4.predict(X_train4)
pred_test4 = lg4.predict(X_test4)

pred_train4 = np.round(pred_train4)
pred_test4 = np.round(pred_test4)

print('Accuracy on train data:',accuracy_score(y_train, pred_train4) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test4))

print('Recall on train data:',recall_score(y_train, pred_train4) )
print('Recall on test data:',recall_score(y_test, pred_test4))

print('Precision on train data:',precision_score(y_train, pred_train4) )
print('Precision on test data:',precision_score(y_test, pred_test4))

print('f1 score on train data:',f1_score(y_train, pred_train4))
print('f1 score on test data:',f1_score(y_test, pred_test4))

print(lg4.summary())

* No change in Accuracy, only the Recall on test data deceased from .67 to .66
* Now let's drop Securities_Account

In [None]:
X_train5 = X_train4.drop(['Securities_Account'], axis = 1)
X_test5 = X_test4.drop(['Securities_Account'], axis = 1)


logit5 = sm.Logit(y_train, X_train5)
lg5 = logit5.fit()

# Let's check accuracy and precision for this model
pred_train5 = lg5.predict(X_train5)
pred_test5 = lg5.predict(X_test5)

pred_train5 = np.round(pred_train5)
pred_test5 = np.round(pred_test5)

print('Accuracy on train data:',accuracy_score(y_train, pred_train5) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test5))

print('Recall on train data:',recall_score(y_train, pred_train5) )
print('Recall on test data:',recall_score(y_test, pred_test5))

print('Precision on train data:',precision_score(y_train, pred_train5) )
print('Precision on test data:',precision_score(y_test, pred_test5))

print('f1 score on train data:',f1_score(y_train, pred_train5))
print('f1 score on test data:',f1_score(y_test, pred_test5))

print(lg5.summary())

* Accuracy and Recall are appoximately same
* The difference isn't much, so we can say that lg5 is best model for making inference

* Coefficient of Income, Family, CCAvg, Education,Mortgage and CD_Account are positive, increase in these will lead to increase in chances of taking persoanl loan 
* Coefficient of Online and CreditCard is negative, increase in these will lead to decrease in chances of taking personal loan
* 1 unit change in CCAvg will change the odds of taking a loan by : 15.77%
* similarly 1 unit change in Income will change the odds of taking loan by : 5.77%
* Family and Education and CD_Account have greater coefficients, so small changes in there value will have bigger change in chances of taking personal loan

* Please note that when coefficient is b , than change in odds is (exp(b)-1)*100 %
* Probability = odd/(1+odd)

* It seems like recall score can be impoved further, so let's try to change the model threshold using AUC-ROC Curve
* There are no signs of Overfitting
* At some place we see that metric is perfoming better on test set, that totally depends on data distribution - if we change the random state then this would change too

### We can see that Recall can be improved further, let's try to do that using optimal threshold

### Optimal threshold using AUC-ROC curve

In [None]:
# Optimal threshold as per AUC-ROC curve
# The optimal cut off would be where tpr is high and fpr is low
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(y_test, lg5.predict(X_test5))

optimal_idx = np.argmax(tpr - fpr)
optimal_threshold = thresholds[optimal_idx]
print(optimal_threshold)

In [None]:
# Model prediction with optimal threshold
pred_train_opt = (lg5.predict(X_train5)>optimal_threshold).astype(int)
pred_test_opt = (lg5.predict(X_test5)>optimal_threshold).astype(int)

print('Accuracy on train data:',accuracy_score(y_train, pred_train_opt) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test_opt))

print('Recall on train data:',recall_score(y_train, pred_train_opt) )
print('Recall on test data:',recall_score(y_test, pred_test_opt))

print('Precision on train data:',precision_score(y_train, pred_train_opt) )
print('Precision on test data:',precision_score(y_test, pred_test_opt))

print('f1 score on train data:',f1_score(y_train, pred_train_opt))
print('f1 score on test data:',f1_score(y_test, pred_test_opt))

### Using AUC-ROC curve to get optimal threshold
* Accuracy decreased from .95 to .88
* Recall increased from .66 to .91

### As we will decrease the threshold value, Precision will go on increasing, but that's not what is needed because that will lead to high marketing cost, we need to choose optimal balance between recall and accuracy

In [None]:
# let us make confusion matrix on train set
make_confusion_matrix(y_train,pred_train_opt)

In [None]:
# let us make confusion matrix on test set
make_confusion_matrix(y_test,pred_test_opt)

In [None]:
#AUC ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, lg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, lg.predict(X_test))
plt.figure(figsize=(13,8))
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

* Area under the curve is 0.95
* Recall is .91 on train and .87 on test that is quite good

# Decision Trees

## Split data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=.30, random_state=1)

## Build Decision Tree Model

We will build our model using the DecisionTreeClassifier function. Using default 'gini' criteria to split. Other option include 'entropy'.  

In [None]:
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, y_train)

In [None]:
prob_train = dTree.predict_proba(X_train)
pred_train = dTree.predict(X_train)

prob_test = dTree.predict_proba(X_test)
pred_test = dTree.predict(X_test)

In [None]:
# let us make confusion matrix on train set
make_confusion_matrix(y_train,pred_train)

* 0 errors on train data, each sample has been classified correctly

In [None]:
# let us make confusion matrix on test set
make_confusion_matrix(y_test,pred_test)

In [None]:
print('Accuracy on train data:',accuracy_score(y_train, pred_train) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test))

print('Recall on train data:',recall_score(y_train, pred_train) )
print('Recall on test data:',recall_score(y_test, pred_test))

print('Precision on train data:',precision_score(y_train, pred_train) )
print('Precision on test data:',precision_score(y_test, pred_test))

print('f1 score on train data:',f1_score(y_train, pred_train) )
print('f1 score on test data:',f1_score(y_test, pred_test))

In [None]:
#AUC ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

logit_roc_auc = roc_auc_score(y_test, dTree.predict_proba(X_test)[:,1])
fpr, tpr, thresholds = roc_curve(y_test, dTree.predict_proba(X_test)[:,1])
plt.figure(figsize=(13,8))
plt.plot(fpr, tpr, label='Decision tree (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

* Overall the output looks good and shows no sign of overfitting in accuracy but if we look at recall, on train set it's 1 while on test it is .89, so we'll use pruning and try to reduce this difference
* Area under the curve is also 0.94 that is quite good

## Visualizing the Decision Tree

In [None]:
feature_names = list(X.columns)
print(feature_names)

In [None]:
plt.figure(figsize=(20,30))
tree.plot_tree(dTree,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()

In [None]:
# importance of features in the tree building ( The importance of a feature is computed as the 
#(normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))

In [None]:
importances = dTree.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(8,8))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
# Text report showing the rules of a decision tree -

print(tree.export_text(dTree,feature_names=feature_names,show_weights=True))

### Observations

* Online,CreditCard, Securities_Account, ZIPCode have 0 importance, Education is most important followed by Income and Family size
* People with Income less than 116.5 , CCAvg less than 2.95 and Income less than 106.5(1000dollars), have less chances of buying loan
* But people having income more than 106.5, Family not of size 4, age less than 28.50 and Experience greater than 3.50 have more chances of taking a loan
* People with Income greater than 116.5, did only undergraduation, have family size less than 2 have less chances of buying a loan while People with family size greater than 2, and education level more than undergraduate has more chances of buying a loan
     
     
* So bank should campaign more on people with higher income, More education and larger family sizes

## Reducing over fitting (Regularization)

* In general, the deeper you allow your tree to grow, the more complex your model will become because you will have more splits
  and it captures more information about the data and this is one of the root causes of overfitting

### Let's try Grid search
* Hyperparameter tuning is also tricky in the sense that there is no direct way to calculate how a change in then
hyperparameter value will reduce the loss of your model, so we usually resort to experimentation. i.e we'll use Gridsearch
* Grid search is a tuning technique that attempts to compute the optimum values of hyperparameters. 
* It is an exhaustive search that is performed on a the specific parameter values of a model.
* The parameters of the estimator/model used to apply these methods are optimized by cross-validated grid-search over a parameter grid.

In [None]:
from sklearn.model_selection import GridSearchCV

# Choose the type of classifier. 
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
parameters = {'max_depth': np.arange(6,15), 
              'min_samples_leaf': [1, 2, 5, 7, 10],
              'max_leaf_nodes' : [2, 3, 5, 10],
             }

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
estimator.fit(X_train, y_train)

In [None]:
prob_train = estimator.predict_proba(X_train)
pred_train = estimator.predict(X_train)

prob_test = estimator.predict_proba(X_test)
pred_test = estimator.predict(X_test)

In [None]:

print('Accuracy on train data:',accuracy_score(y_train, pred_train) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test))

print('Recall on train data:',recall_score(y_train, pred_train) )
print('Recall on test data:',recall_score(y_test, pred_test))

print('Precision on train data:',precision_score(y_train, pred_train) )
print('Precision on test data:',precision_score(y_test, pred_test))

print('f1 score on train data:',f1_score(y_train, pred_train) )
print('f1 score on test data:',f1_score(y_test, pred_test))

* This doesn't seem to provide good outputs, recall decreased for both train and test data

## Cost Complexity Pruning

The `DecisionTreeClassifier` provides parameters such as
``min_samples_leaf`` and ``max_depth`` to prevent a tree from overfiting. Cost
complexity pruning provides another option to control the size of a tree. In
`DecisionTreeClassifier`, this pruning technique is parameterized by the
cost complexity parameter, ``ccp_alpha``. Greater values of ``ccp_alpha``
increase the number of nodes pruned. Here we only show the effect of
``ccp_alpha`` on regularizing the trees and how to choose a ``ccp_alpha``
based on validation scores.

In [None]:
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities

In [None]:
pd.DataFrame(path)

In [None]:
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()

Next, we train a decision tree using the effective alphas. The last value
in ``ccp_alphas`` is the alpha value that prunes the whole tree,
leaving the tree, ``clfs[-1]``, with one node.

In [None]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, y_train)
    clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
      clfs[-1].tree_.node_count, ccp_alphas[-1]))


For the remainder, we remove the last element in
``clfs`` and ``ccp_alphas``, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.

In [None]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()

Recall vs alpha for training and testing sets
----------------------------------------------------
When ``ccp_alpha`` is set to zero and keeping the other default parameters
of `DecisionTreeClassifier`, the tree overfits, leading to
a 100% training Recall and 90% testing Recall. As alpha increases, more
of the tree is pruned, thus creating a decision tree that generalizes better.

In [None]:
recall_train=[]
for clf in clfs:
    pred_train3=clf.predict(X_train)
    values_train=metrics.recall_score(y_train,pred_train3)
    recall_train.append(values_train)
    
recall_test=[]
for clf in clfs:
    pred_test3=clf.predict(X_test)
    values_test=metrics.recall_score(y_test,pred_test3)
    recall_test.append(values_test)

In [None]:
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()

In [None]:
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)

In [None]:
prob_train = best_model.predict_proba(X_train)
pred_train = best_model.predict(X_train)

prob_test = best_model.predict_proba(X_test)
pred_test = best_model.predict(X_test)


print('Accuracy on train data:',accuracy_score(y_train, pred_train) )
print('Accuracy on test data:',accuracy_score(y_test, pred_test))

print('Recall on train data:',recall_score(y_train, pred_train) )
print('Recall on test data:',recall_score(y_test, pred_test))

print('Precision on train data:',precision_score(y_train, pred_train) )
print('Precision on test data:',precision_score(y_test, pred_test))

print('f1 score on train data:',f1_score(y_train, pred_train) )
print('f1 score on test data:',f1_score(y_test, pred_test))

* Post pruning using ccp alpha seems to have reduced difference between train and test performances but only by reducing model performance on train set, so let's proceed with the basic model we made earlier

### Compare outputs from Logistic regression and Decision tree
* Overall we can see that Decision tree performs better on given dataset
* Looking at important variables on the basis of pvalues in Logistic regression and Feature importance in Decision trees
    * Income, CCAvg, CD_Account, Family, Education, Mortgage are important in Both
    * And looking at their coefficients from logistic Regression shows that increase in these variables leads to increase in chances of buying loan
    
### Recommendation
**Bank should spend more on campaigning for people with income more than 116(thousand dollars), More education(graduate/professional), family size of 3 or more and Mortgage values of greater than 284**

In [None]:
Measures = {'Logistic Regression': [0.95, 0.94, 0.66, 0.59, 0.83, 0.74, 0.73, 0.66],
        'LR with opt threshold': [0.88, 0.88, 0.91, 0.87, 0.45, 0.44, 0.60, 0.59],
        'Decision Tree': [1, 0.98, 1, 0.89, 1, 0.91, 1, 0.90]
        }

df = pd.DataFrame(Measures, columns = ['Logistic Regression','LR with opt threshold', 'Decision Tree'],
                  index=['Train_accuracy','Test_accuracy','Train_Recall','Test_Recall','Train_precision','Test_precision','Train_f1_score','Test_f1_score'])

df.T

## Now let's do misclassification analysis
* Is there any certain pattern, followed by samples which are incorrectly classified by our model (dTree)

In [None]:
Y1 = dTree.predict(X) 
Y1 = Y1.reshape(5000, 1)

Y2 = np.subtract(Y ,Y1)

# Most of the values in Y2 are 0, only 33 values are either '1' or '-1'
# 1 says, Perosn would buy loan but model predicted he won't
# -1 says, Perosn won't buy loan but model predicted he would

#Let's concatenate this Y2 with X
df1 = pd.DataFrame(Y2)
df2 = pd.concat([X, df1], axis=1)

In [None]:
df2

In [None]:
incorrect_df = df2[df2['Personal_Loan'] != 0] 

In [None]:
incorrect_df.shape

* There are 28 misclassifications and all those are on test data
* incorrect_df consists of all misclassified elements

In [None]:
#Since Zipcode was not an important variable in both Logistic Regression and Decision trees. let's drop that - to make 
# visualization easier

incorrect_df = incorrect_df.drop(['ZIPCode_91','ZIPCode_92','ZIPCode_93','ZIPCode_94','ZIPCode_95','ZIPCode_96'], axis = 1)

incorrect_df

* Let's try to see if there is any specific pattern in these samples

In [None]:
incorrect_df.profile_report()

### Looking at above profile, we see that incorrectly classified people are :
* Usually between 28 and 65 age and have experience inbetween 4 to 41 years, with 18 and 20 uniques values
* Income varies inbetween 64 to 115(thousand dollars), while usual income varied from 8 to 224(thousand dollars)
* Most of the people misclassified have 0 mortgage, no Securities Account and no CD_account, have family size 1 and 2, have profession/Advanced education

* On the basis of Business Rule, we derived we were able to see that usually people with income less than 116, less mortgage, family size less than 3 doesn't buy loan - There are special cases always, so some people with less income and smaller family size might also buy loan