# Business Problem:
A group of customers were given an offer in person that they can get a loan at discounted rate and
processing fee will be waived off. A pilot campaign was conducted to get response from customers
whether they are interested in taking out a loan or not. Response was recorded and data was collected.
Based on data given we need to

- [x] Build a model to predict whether customers will be interested in taking out a loan or not.
- [ ] Identifying features which are most important
- [ ] In case of black box models e.g. Random forest use SHAP, LIME to figure out features affecting the target variable
- [ ] Approaching a customer has costs involved with it, hence find the profitable segments so that more customized marketing can be done.
- [ ] Model will be needed on a monthly basis as this data gets updated each month.

Variables involved: `Customer_id`, `Age`, `Gender`, `Balance`, `Occupation`, `No of Credit transaction`, `SCR`, `Holding period`

> ## Questions for the External Mentor


- [ ] `Holding Period` (units of measurement months/years)
- [ ] `Balance` units of measurement and is it current balance/quarterly etc?
- [ ] `SCR` Solvency Capital Requirement explain in detail
- [ ] `No. of Credit Transactions` meaning?
- [ ] `O` in `Gender` columnn, does it mean null value?
- [ ] What do the values in `Occupation` stand for `SELF-EMP`, `SAL`, `SENP`, `PROF`?
- [ ] Need a summary of what is going on
- [ ] 

SCR propensity of a customer to respond to a digital marketing

##### Changes v6:
1. Now All Models measure recall on same testing data

2. Fixed Sampling mistake

3. Redefined `print_classification_report` as `classification_report` for better clarity and ease of use

4. Visualized Decision Trees

5. Implemented SVC

6. Implemented KNN which provided great results with default parameters

##### Changes v7:
1. Fit Random Forest Models

2. Fit XgBoost Models

In [2]:
!pip -install xgboost

Note: you may need to restart the kernel to use updated packages.


ERROR: unknown command "!install" - maybe you meant "install"



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, cross_validate
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, recall_score, f1_score, precision_score
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTE

import xgboost as xgb

ModuleNotFoundError: No module named 'xgboost'

In [None]:
data = pd.read_csv('Model_data.csv')
data.head()

In [None]:
data.Balance = data.Balance.astype('int32') #Truncating decimals

In [None]:
data[data.Balance<0]

In [None]:
data.head()

In [None]:
data.Balance.describe()

In [None]:
data.shape

In [None]:
data.info()

`Gender` and `Occupation` are categorical varibles stored as object type

**EDA**

No Strong correlations measured except for mild ones in `Holding_period` and other variables

In [None]:
sns.heatmap(data.corr(), annot=True, square=True) # No strong correlations seen overall
plt.show()

In [None]:
# sns.pairplot(data)
# plt.show()

In [None]:
sns.countplot(x = data.Gender)
plt.show()

In [None]:
data.Gender.unique()

In [None]:
data.Occupation.unique()

In [None]:
data.Gender.value_counts()

In [None]:
data.drop(data.Gender[data.Gender== 'O'].index, axis = 0, inplace= True) # Removed 196 rows with `Gender` = 'O'

In [None]:
data.shape

In [None]:
data.Balance.describe()

In [None]:
# sns.histplot(data.Age)

In [None]:
sns.countplot(x = data.Occupation)

In [None]:
sns.countplot(x=data.Target, hue=data.Occupation) ## Self employed are much more likely to take loans

In [None]:
g = sns.FacetGrid(data, col='Occupation', hue="Gender")
plt.grid(True)
g.map(sns.countplot, "Gender", alpha=1)
g.add_legend()
plt.grid((False))

--------------------------

In [None]:
# sns.histplot(data.No_OF_CR_TXNS)

In [None]:
data.No_OF_CR_TXNS.describe()

In [None]:
sns.violinplot(x=data.No_OF_CR_TXNS)
plt.grid(True)

In [None]:
# len(data[data.No_OF_CR_TXNS==0].index)

In [None]:
# data.drop(index=data[data.No_OF_CR_TXNS==0].index, axis=0)

------------------

In [None]:
# sns.displot(data.SCR, kind = 'kde')
sns.distplot(data.SCR)
plt.show()

In [None]:
data.SCR.describe()

In [None]:
# sns.histplot(data.Holding_Period)

##### End of Exploratory Data Analysis
-----------
----------

> ### Create a function for easy report printing

In [None]:
# A class for pretty printing
class color:
    PURPLE = '\033[95m'
    CYAN = '\033[96m'
    DARKCYAN = '\033[36m'
    BLUE = '\033[94m'
    GREEN = '\033[92m'
    YELLOW = '\033[93m'
    RED = '\033[91m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'
    END = '\033[0m'
    
# function for validation on test data   
def classification_report(y_true, y_prediction, type_of_data='Enter Over/Under/Original sampled', type_of_classifier='ClassifierName'):
    """Print Classification report"""
    
    accuracy = accuracy_score(y_true, y_prediction)
    precision = precision_score(y_true, y_prediction)
    recall = recall_score(y_true, y_prediction)
    f1 = f1_score(y_true, y_prediction)
    
    print('Classification Report on Testing Data:\n'+ color.BOLD + type_of_data, 'data\n'+color.END+color.RED+color.BOLD+type_of_classifier,'Classifier'+color.END+color.END)
    print()
    print('---------------------------------------')
    print(color.BOLD + 'Recall: %s' %recall + color.END)
    print('Precision: %s' %precision)
    print('F1 score: %s' %f1)
    print('Accuracy: %s' %accuracy)
    print('---------------------------------------')
    print()


# A function for cross-validation report    
def cross_val_report(classifier, train_data, train_label, cv=10, scoring=['recall','precision', 'f1','accuracy']):
    
    score = cross_validate(classifier, train_data, train_label, cv=cv, scoring= scoring)
    recall = np.mean(score['test_recall'])
    precision = np.mean(score['test_precision'])
    f1 = np.mean(score['test_f1'])
    accuracy= np.mean(score['test_accuracy'])
    print('Cross Validation Report')
    print(color.BOLD + 'Recall: %s' %recall + color.END)
    print('Precision: %s' %precision)
    print('F1: %s' %f1)
    print('Accuracy: %s' %accuracy)
    print()
    print("*Mean values presented")
    print('---------------------------------------')

**Create the first set of training and test data on imbalanced data**

In [None]:
df = pd.get_dummies(data, columns=['Gender','Occupation'], drop_first = True)
df.head()

>**Creating a model with Original Unbalanced data and measuring metrics**

In [None]:
X_original = df.iloc[:,1:]
y_original = df.iloc[:,0]

In [None]:
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(X_original,y_original, shuffle = ['True'], stratify=y_original)

In [None]:
clf = DecisionTreeClassifier(max_depth = 5)
clf.fit(X_train_orig, y_train_orig)
y_prediction_orig = clf.predict(X_test_orig)
classification_report(y_test_orig, y_prediction_orig, 'Original', 'Decision Tree')
# cross_val_report(clf, y_test_orig,y_under_prediction.reshape(1,-1))

In [None]:
fig = plt.figure(figsize=(50,20))
_ = plot_tree(clf, 
                   feature_names=list(X_original.columns),  
                   class_names=['0','1'],
                   filled=True, fontsize=10)

-----------

>**Create undersampled data and fit a model**

In [None]:
X_under_train, y_under_train = NearMiss().fit_resample(X_train_orig, y_train_orig)

In [None]:
data[data.Target==1].shape

In [None]:
X_under_train.shape, y_under_train.shape

In [None]:
clf_under_sampled = DecisionTreeClassifier(max_depth = 5)
clf_under_sampled.fit(X_under_train, y_under_train)
y_under_prediction = clf_under_sampled.predict(X_test_orig)
classification_report(y_test_orig,y_under_prediction, 'Undersampled', 'Decision Tree')

In [None]:
# cross_val_report(clf_under_sampled, y_test_orig,y_under_prediction)

## crossval here causes unbalanced split

In [None]:
fig = plt.figure(figsize=(100,100))
_ = plot_tree(clf_under_sampled, 
                   feature_names=list(X_original.columns),  
                   class_names=['0','1'],
                   filled=True, fontsize=10)

--------------------

> Model on  an oversampled dataset

In [None]:
X_over_train, y_over_train = SMOTE().fit_resample(X_original, y_original)

In [None]:
clf_over_sampled = DecisionTreeClassifier(max_depth = 5)
clf_over_sampled.fit(X_over_train, y_over_train)
y_over_predict = clf_over_sampled.predict(X_test_orig)
classification_report(y_test_orig, y_over_predict, 'Oversampled', 'Decision Tree')

In [None]:
fig = plt.figure(figsize=(100,100))
_ = plot_tree(clf_over_sampled, 
                   feature_names=list(X_original.columns),  
                   class_names=['0','1'],
                   filled=True, fontsize=10)

---------------

In [None]:
print("Original:     "+color.BOLD+ "X_original,y_original"+color.END+"::  X_train_orig, X_test_orig, y_train_orig, y_test_orig")
print()
print("Undersampled:"+color.BOLD+ " X_under, y_under"+color.END+"     ::  X_under_train, y_under_train")
print()
print("Oversampled:"+color.BOLD+ "  X_over, y_over"+color.END+"       ::  X_over_train, y_over_train")

The above datasets can be better sampled by adjusting hyper-parameters of NearMiss and SMOTE, or other methods of sampling could be used

-----------
-----------

### SVM Classifiers applied

*SVC fails to fit on original dataset, possibly because of unbalance*

In [None]:
clf_svc0 = SVC()

clf_svc0.fit(X_under_train, y_under_train)
y_predict = clf_svc0.predict(X_test_orig)
classification_report(y_test_orig, y_predict, 'Undersampled', 'SVM')
cross_val_report(clf_svc0, X_under_train, y_under_train)

In [None]:
# %%time
# # Will take LONG Time for Training
# clf_svc1 = SVC()
# clf_svc1.fit(X_over_train, y_over_train)
# y_predict = clf_svc1.predict(X_test_orig)
# classification_report(y_test_orig, y_predict, 'Oversampled', 'SVM')
# cross_val_report(clf_svc1, X_under_train, y_under_train)

-------------
--------------------

In [None]:
clf_KNN0 = KNeighborsClassifier()
clf_KNN0.fit(X_train_orig, y_train_orig)
y_predict= clf_KNN0.predict(X_test_orig)
classification_report(y_test_orig, y_predict, 'Original', 'KNN')
cross_val_report(clf_KNN0, X_under_train, y_under_train)

-----------

In [None]:
clf_KNN1 = KNeighborsClassifier()
clf_KNN1.fit(X_under_train, y_under_train)
y_predict= clf_KNN1.predict(X_test_orig)
classification_report(y_test_orig, y_predict, 'Undersampled', 'KNN')
cross_val_report(clf_KNN1, X_under_train, y_under_train)

--------------

In [None]:
clf_KNN2 = KNeighborsClassifier()
clf_KNN2.fit(X_over_train, y_over_train)
y_predict= clf_KNN2.predict(X_test_orig)
classification_report(y_test_orig, y_predict, 'Oversampled', 'KNN')
cross_val_report(clf_KNN2, X_under_train, y_under_train)

----------
----------

### Random Forest Classifier Models

In [None]:
clf_rf0 = RandomForestClassifier()
clf_rf0.fit(X_train_orig, y_train_orig)
y_predict= clf_rf0.predict(X_test_orig)
classification_report(y_test_orig, y_predict, 'Original', 'Random Forest')
cross_val_report(clf_rf0, X_under_train, y_under_train)

-------------

In [None]:
clf_rf1 = RandomForestClassifier()
clf_rf1.fit(X_under_train, y_under_train)
y_predict= clf_rf1.predict(X_test_orig)
classification_report(y_test_orig, y_predict, 'Undersampled', 'Random Forest')
cross_val_report(clf_rf1, X_under_train, y_under_train)

-------------

In [None]:
clf_rf2 = RandomForestClassifier()

In [None]:
clf_rf2.fit(X_under_train, y_under_train)
y_predict= clf_rf2.predict(X_test_orig)
classification_report(y_test_orig, y_predict, 'Oversampled', 'Random Forest')
cross_val_report(clf_rf2, X_under_train, y_under_train)

----------
----------------

# Xgboost Models

### Changing training and testing data to DMatrix types

In [None]:
train_orig_DM = xgb.DMatrix(X_train_orig, label= y_train_orig)
train_under_DM = xgb.DMatrix(X_under_train, label= y_under_train)
train_over_DM = xgb.DMatrix(X_over_train, label= y_over_train)

test_DM = xgb.DMatrix(X_test_orig, label= y_test_orig)

In [None]:
param = {
    'max_depth': 5,
    'eta': 0.1,
    'objective': 'multi:softmax',
    'num_class': 3
}

epochs = 10

In [None]:
xgb_cl0 = xgb.train(param, train_orig_DM, epochs)
predictions = xgb_cl0.predict(test_DM)
classification_report(y_test_orig, predictions, 'Original', 'XgBoost')

In [None]:
xgb_cl1 = xgb.train(param, train_under_DM, epochs)
predictions = xgb_cl1.predict(test_DM)
classification_report(y_test_orig, predictions, 'Undersampled', 'XgBoost')

In [None]:
xgb_cl2 = xgb.train(param, train_over_DM, epochs)
predictions = xgb_cl2.predict(test_DM)
classification_report(y_test_orig, predictions, 'Oversampled', 'XgBoost')

-------
-------