## Bank Loan EDA and Classification

![](https://cdn.outsource2india.com/financial/images/banking-financial-analysis-services.jpg)

<h2>Data Analysis to identify the potential customers who have a higher probability of purchasing the loan.</h2>

The case is The Bank has a customers Data with various characteristics of the customers. The management built a new product - Personal Loan, and ran a small campaign towards selling the New Product to their clients. 
After some time, 9% of customers have Personal Loan from The Bank.


### The GOAL IS!
> - To sell more Personal Loan products to Bank customers.
> - To devise campaigns to better target marketing to increase the success ratio with a minimal budget.
> - To identify the potential customers who have a higher probability of purchasing the loan. 

Increase the success ratio of advertisement campaign while at the same time reduce the cost of the campaign.


### The Questions for Analysis
As soon as we got 9% of customers who bought the Product, we got the following questions:

> - Is there some associations between personal characteristics and the fact that customer bought the Product? If so:
>
> - What are those Main Characteristics that have an association with the Product and what is the strength of the association?
> - What are the Segments of Main Characteristics, that have a higher strength of association with the Product?
> - What is the sample of Data with customers from Main Segments?
 

### Approach

We made the simple step-by-step analysis of customer's characteristics to identify patterns to effectively choose the subset of customers who have a higher probability to buy new product "Personal Loan" from The Bank. 
<br><br>
We performed the following steps:
> - We check all twelve characteristics whether or not each of them has an association with the fact the product been sold.
> - We find FIVE main characteristics that have higher than moderate strength of association with the product.
> - We analyze main characteristics and get segments in each with different strength of association with the product.
> - We tried to make a subset of customers with ideal characteristics who has the highest probability to buy the product. Unfortunately, our dataset does not contain such information. So...
> - We build a simple algorithm to make a subset of data to get the customers IDs who have a high probability to buy the product.

### Technologies

- Python
- Pandas
- Numpy
- Seaborn
- Matplotlib

## Importing Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
from rich import print
from rich.console import Console
from rich.table import Table
from rich.progress import track
from time import sleep
import os
import sys
from rich.columns import Columns
from rich.markdown import Markdown
from rich.syntax import Syntax
console = Console()

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# from bubbly.bubbly import bubbleplot
import plotly.offline as py
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import plotly.express as px

## Data Load

In [None]:
#We train the data through pandas.
df = pd.read_csv("/kaggle/input/bank-loan-classification/UniversalBank.csv")

In [None]:
#I found the number of NaNs in the data.
df.isna().sum()

In [None]:
df.info()

No missing value.

### Variables definition


> - **ID** - Customer ID 
> - **Age** - Customer's age in completed years 
> - **Experience** - #years of professional experience 
> - **Income** - Annual income of the customer - in thousands usd 
> - **ZIPCode** - Home Address ZIP code. 
> - **Family** - Family size of the customer 
> - **CCAvg** - Avg. spending on credit cards per month - in thousands usd 
> - **Education** - Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional 
> - **Mortgage** - Value of house mortgage if any - in thousands usd  
> - **Personal Loan**  - Did this customer accept the personal loan offered in the last campaign? 
> - **Securities Account** - Does the customer have a securities account with the bank? 
> - **CD Account** - Does the customer have a certificate of deposit (CD) account with the bank? 
> - **Online** - Does the customer use internet banking facilities? 
> - **CreditCard** - Does the customer uses a credit card issued by UniversalBank?

In [None]:
df.nunique()

## Data Visualizations

In [None]:
__count = df['Personal Loan'].value_counts()
__count.plot.pie(y='mass', figsize=(15, 15))

***I found ones (loaned) and zeroes (not loaned) in the target column.***
****

In [None]:
cor_df = df.corr(method='pearson')
plt.figure(figsize=(13,15))
sns.heatmap(cor_df, cmap='RdYlGn',annot=True,cbar_kws={"orientation": "horizontal"})

***You can see "Income" under "Personal Loan" (targeted) column, "CCAvg" and "CD Account" teams are showing good results.
We use them in Machine Learning.***
****

In [None]:
fig, [ax0, ax1, ax2] = plt.subplots(1,3, figsize = (14,4))

ax0.hist(df.Mortgage)
ax0.set_xlabel('Mortgage distribution')
ax0.axvline(df.Mortgage.mean(), color = "black")

ax1.hist(df.Experience)
ax1.set_xlabel('Experience distribution')
ax1.axvline(0, color = "black");

ax2.hist(df.Income)
ax2.set_xlabel('Income distribution')
ax2.axvline(df.Income.mean(), color = "black");

In [None]:
plt.figure(figsize = (20,15))
sns.distplot( df[df["Personal Loan"] == 0]['Income']).set(title=f'Income when Personal Loan = 0:>>>{df[df["Personal Loan"] == 0]["Income"].median()}\n Income when Personal Loan = 1:>>>{df[df["Personal Loan"] == 1]["Income"].median()}\n')
sns.distplot( df[df["Personal Loan"] == 1]['Income'])
plt.show()
print(df[df["Personal Loan"] == 0]["Income"].median())
print(df[df["Personal Loan"] == 1]["Income"].median())

In [None]:
plt.figure(figsize=(15,5))
sns.scatterplot(x = "ID", y = "Income", data=df, hue = "Personal Loan", palette="YlGnBu", alpha = 1);

***In this visualization, I plotted the income of those who received credit and those who did not.
You can see that the income of borrowers is medium 142.5,
and those who didn't get it are almost twice as many around 59s.
So it is a good idea to give loans to people with high incomes.*** 🤠
*******

In [None]:
plt.rcParams['figure.figsize'] = (12, 9)
sns.violinplot(df['Personal Loan'], df['CCAvg'], palette = 'colorblind')
plt.title('Relation of CCAvg with Target', fontsize = 20, fontweight = 30)
plt.show()

## Analysis summary:

We made the simple step-by-step analysis of customer's characteristics to identify patterns to effectively choose the subset of customers who have a higher probability to buy new product "Personal Loan" from The Bank. We performed the following steps:
> - We check all twelve characteristics whether or not each of them has an association with the fact the product been sold.
> - We find FIVE main characteristics that have higher than moderate strength of association with the product.
> - We analyze main characteristics and get segments in each with different strength of association with the product.
> - We tried to make a subset of customers with ideal characteristics who has the highest probability to buy the product. Unfortunately, our dataset does not contain such information. So...
> - We build a simple algorithm to make a subset of data to get the customers IDs who have a high probability to buy the product.

## Classification

In [None]:
X = df.drop("Personal Loan",axis=1)
y = df['Personal Loan']

In [None]:
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2,stratify = y,random_state = 0)

***I adapted it to machine learning and split the train test.***
***

### LogisticRegression

In [None]:
LR_model = LogisticRegression()
LR_model.fit(X_train ,y_train)

y_pred = LR_model.predict(X_test)
print(metrics.classification_report(y_test,y_pred))
print(f"Model accuracy: [bold red]{metrics.accuracy_score(y_test, y_pred)}[/bold red]!")
console.print(":smiley:",":smiley:")

conf_mat = metrics.confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True,fmt = "g")
plt.show()

fpr, tpr , thresholds = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr = fpr,tpr = tpr,roc_auc = roc_auc,estimator_name="ROC curve")
display.plot()
plt.show()

***LogisticRegression did not give very good results because it draws an increasing line with S, and your data is not very accurate because there are 2 values.*** 🤠
***

### SVC

In [None]:
SVC_model = SVC()
SVC_model.fit(X_train ,y_train)

y_pred = SVC_model.predict(X_test)
print(metrics.classification_report(y_test,y_pred))
print(f"Model accuracy: [bold red]{metrics.accuracy_score(y_test, y_pred)}[/bold red]!")
console.print(":smiley:",":smiley:",":smiley:")

conf_mat = metrics.confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True,fmt = "g")
plt.show()

fpr, tpr , thresholds = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr = fpr,tpr = tpr,roc_auc = roc_auc,estimator_name="ROC curve")
display.plot()
plt.show()

### DecisionTreeClassifier

In [None]:
tree_model = DecisionTreeClassifier()
tree_model.fit(X_train ,y_train)

y_pred = tree_model.predict(X_test)
print(metrics.classification_report(y_test,y_pred))
print(f"Model accuracy: [bold red]{metrics.accuracy_score(y_test, y_pred)}[/bold red]!")
console.print(":smiley:",":smiley:",":smiley:",":smiley:")

conf_mat = metrics.confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True,fmt = "g")
plt.show()

fpr, tpr , thresholds = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr = fpr,tpr = tpr,roc_auc = roc_auc,estimator_name="ROC curve")
display.plot()
plt.show()

In [None]:
from sklearn.tree import plot_tree

cols = df.drop("Personal Loan",axis = 1).columns

plt.figure(figsize = (30,20))
plot_tree(tree_model,feature_names = cols, filled=True)
plt.show()

### RandomForestClassifier

In [None]:
RFC_model = RandomForestClassifier()
RFC_model.fit(X_train ,y_train)

y_pred = RFC_model.predict(X_test)
print(metrics.classification_report(y_test,y_pred))
print(f"Model accuracy: [bold red]{metrics.accuracy_score(y_test, y_pred)}[/bold red]!")
console.print(":smiley:",":smiley:",":smiley:",":smiley:")

conf_mat = metrics.confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True,fmt = "g")
plt.show()

fpr, tpr , thresholds = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr = fpr,tpr = tpr,roc_auc = roc_auc,estimator_name="ROC curve")
display.plot()
plt.show()

### XGBClassifier

In [None]:
XGBC_model = XGBClassifier()
XGBC_model.fit(X_train ,y_train)

y_pred = XGBC_model.predict(X_test)
print(metrics.classification_report(y_test,y_pred))
print(f"Model accuracy: [bold red]{metrics.accuracy_score(y_test, y_pred)}[/bold red]!")
console.print(":smiley:",":smiley:",":smiley:")

conf_mat = metrics.confusion_matrix(y_test, y_pred)
sns.heatmap(conf_mat, annot=True,fmt = "g")
plt.show()

fpr, tpr , thresholds = metrics.roc_curve(y_test, y_pred)
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr = fpr,tpr = tpr,roc_auc = roc_auc,estimator_name="ROC curve")
display.plot()
plt.show()

## Classification summary:
***We prefer to use the RandomForestClassifier model in this project. Because it works with 98.7% accuracy, which is better than the rest.*** 
### 🤠