# Machine Learning: AllLife Bank Personal Loan Campaign

## Problem Statement

### Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

### Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

### Data Dictionary
* `ID`: Customer ID
* `Age`: Customer’s age in completed years
* `Experience`: #years of professional experience
* `Income`: Annual income of the customer (in thousand dollars)
* `ZIP Code`: Home Address ZIP code.
* `Family`: the Family size of the customer
* `CCAvg`: Average spending on credit cards per month (in thousand dollars)
* `Education`: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
* `Mortgage`: Value of house mortgage if any. (in thousand dollars)
* `Personal_Loan`: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
* `Securities_Account`: Does the customer have securities account with the bank? (0: No, 1: Yes)
* `CD_Account`: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
* `Online`: Do customers use internet banking facilities? (0: No, 1: Yes)
* `CreditCard`: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

## Importing necessary libraries

In [2]:
# Installing the libraries with the specified version.
%pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user

Note: you may need to restart the kernel to use updated packages.


ERROR: Exception:
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\Lib\site-packages\pip\_internal\cli\base_command.py", line 180, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\Lib\site-packages\pip\_internal\cli\req_command.py", line 245, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\Lib\site-packages\pip\_internal\commands\install.py", line 377, in run
    requirement_set = resolver.resolve(
                      ^^^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.1008.0_x64__qbz5n2kfra8p0\Lib\site-packages\pip\_internal\resolution\resolvelib\resolver.py", line 95, in res

In [5]:
%pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Collecting scikit-learn
  Downloading scikit_learn-1.5.0-cp312-cp312-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.6.0 (from scikit-learn)
  Downloading scipy-1.13.1-cp312-cp312-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/60.6 kB ? eta -:--:--
     ---------------------------------------- 60.6/60.6 kB 3.1 MB/s eta 0:00:00
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.0-cp312-cp312-win_amd64.whl (10.9 MB)
   ---------------------------------------- 0.0/10.9 MB ? eta -:--:--
   - -------------------------------------- 0.3/10.9 MB 6.7 MB/s eta 0:00:02
   --- ------------------------------------ 0.9/10.9 MB 9.1 MB/s eta 0:00:02
   ----- -----------------------------

**Note**: *After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.*

In [6]:
import pandas as pd
import numpy as np
from scipy.stats import zscore

import matplotlib.pylab as plt
%matplotlib inline
import seaborn as sns
import plotly.express as plx

from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split


## Loading the dataset

In [7]:
df = pd.read_csv('./Loan_Modelling.csv')

## Data Overview

* Observations
* Sanity checks

In [9]:
df.head()

Unnamed: 0,ID,Age,Experience,Income,ZIPCode,Family,CCAvg,Education,Mortgage,Personal_Loan,Securities_Account,CD_Account,Online,CreditCard
0,1,25,1,49,91107,4,1.6,1,0,0,1,0,0,0
1,2,45,19,34,90089,3,1.5,1,0,0,1,0,0,0
2,3,39,15,11,94720,1,1.0,1,0,0,0,0,0,0
3,4,35,9,100,94112,1,2.7,2,0,0,0,0,0,0
4,5,35,8,45,91330,4,1.0,2,0,0,0,0,0,1


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   int64  
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   int64  
 4   ZIPCode             5000 non-null   int64  
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   int64  
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 547.0 KB


All of the Columns were read in as either integers or float point numbers. However, I know that many must in reality be categorical variables. From the variable description, I suggest the following changes:

0. Customer `ID` is an integer, and the integer number indicates the ID, NOT a quantitative metric about the customer. Therefore, Customer ID should be treated like alphabetical strings, `obj64`.
1. `Age` (years) stays `int64`, but could be downcast to `int8` (-128:127) to save RAM. Age cannot be negative.
2. Years professional `Experience` stays `int64`, but could be downcast to `int8` (-128:127). Experience cannot be negative.
3. `Income` in thousands should become `float` point, since it could be divided down to the fraction of a dollar, and there is no memory-saving using `int64` vs `float64`. 
4. `ZIP Code` needs to be changed to a string variable. When using Zip Codes numerically, we should use a geographic representation of the Zip Codes (ex. lat/long of the center of the zip code area). 
5. `Family` (in # of people) stays an `integer`.
6. `CCavg` is 1000's of dollars spent per month on credit cards, and should stay a `float` point. Family cannot be negative.
7. `Education` actually needs to be converted to a `categorical` variable with mapping of "`1: Undergrad; 2: Graduate; 3: Advanced/Professional`". We should keep the `order` of the categories, since we may want to see a comparison in order of ranking. 
8. `Mortgage` should also change to a `float64`, since it is a dollar amount (1000s). Mortgage can be zero, but cannot be negative. NaN values could be zero, and vice-versa.
9. `Personal_Loan` is a binary number indicating 1 for Yes and 0 for No. I will leave it as an `int64`, since that is standard practice for one-hot encoded variables with machine learning algorithms. 
10. `Securities_Account` is a binary number indicating 1 for Yes and 0 for No. I will leave it as an `int64`, since that is standard practice for one-hot encoded variables with machine learning algorithms. 
11. `CD_Account` is a binary number indicating 1 for Yes and 0 for No. I will leave it as an `int64`, since that is standard practice for one-hot encoded variables with machine learning algorithms. 
12. `Online` is a binary number indicating 1 for Yes and 0 for No. I will leave it as an `int64`, since that is standard practice for one-hot encoded variables with machine learning algorithms.
13. `CreditCard` is a binary number indicating 1 for Yes and 0 for No. I will leave it as an `int64`, since that is standard practice for one-hot encoded variables with machine learning algorithms.

In [40]:
df.describe()
df.ID = df["ID"].astype("string")
df.Income = df.Income.astype("float")
df.ZIPCode = df.ZIPCode.astype("string")
df.Mortgage = df.Mortgage.astype("float")
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   ID                  5000 non-null   string 
 1   Age                 5000 non-null   int64  
 2   Experience          5000 non-null   int64  
 3   Income              5000 non-null   float64
 4   ZIPCode             5000 non-null   string 
 5   Family              5000 non-null   int64  
 6   CCAvg               5000 non-null   float64
 7   Education           5000 non-null   int64  
 8   Mortgage            5000 non-null   float64
 9   Personal_Loan       5000 non-null   int64  
 10  Securities_Account  5000 non-null   int64  
 11  CD_Account          5000 non-null   int64  
 12  Online              5000 non-null   int64  
 13  CreditCard          5000 non-null   int64  
dtypes: float64(3), int64(9), string(2)
memory usage: 547.0 KB


## Exploratory Data Analysis.

- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.

**Questions**:

1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
2. How many customers have credit cards?
3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
4. How does a customer's interest in purchasing a loan vary with their age?
5. How does a customer's interest in purchasing a loan vary with their education?

Now I will check for inappropriate values, ex. negative Family Member numbers. 

In [42]:
df.describe()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal_Loan,Securities_Account,CD_Account,Online,CreditCard
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,45.3384,20.1046,73.7742,2.3964,1.937938,1.881,56.4988,0.096,0.1044,0.0604,0.5968,0.294
std,11.463166,11.467954,46.033729,1.147663,1.747659,0.839869,101.713802,0.294621,0.305809,0.23825,0.490589,0.455637
min,23.0,-3.0,8.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,35.0,10.0,39.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,45.0,20.0,64.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,55.0,30.0,98.0,3.0,2.5,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,67.0,43.0,224.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


Minimum experience is -3, so there must be errors. Let's take a look at values where Experience is <0.

It does seem weird that the minimum amount of Education is 1 (Undergraduate), since not everyone who uses our bank has attended university. We should consider renaming that category "Undergrad or lower."  We will ignore this for now. 

In [47]:
df[df.Experience < 0].describe()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal_Loan,Securities_Account,CD_Account,Online,CreditCard
count,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0,52.0
mean,24.519231,-1.442308,69.942308,2.865385,2.129423,2.076923,43.596154,0.0,0.115385,0.0,0.576923,0.288462
std,1.475159,0.639039,37.955295,0.970725,1.750562,0.83657,90.027068,0.0,0.322603,0.0,0.498867,0.457467
min,23.0,-3.0,12.0,1.0,0.2,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,24.0,-2.0,40.75,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,24.0,-1.0,65.5,3.0,1.8,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,25.0,-1.0,86.75,4.0,2.325,3.0,0.0,0.0,0.0,0.0,1.0,1.0
max,29.0,-1.0,150.0,4.0,7.2,3.0,314.0,0.0,1.0,0.0,1.0,1.0


### Should we drop these values? 
Q: Do the values represent something valid?

A: No (Not that we know of). 


Q: Is there a good way to tell what the values should have been? 

A: No - we're not sure how the data was entered, and whether or not the data collection technique would have made it easy to accidentally put in a negative number. Perhaps some users entered this number intentionally, and it means something to them. Perhaps the data was corrupted. Perhaps the 'experience' number was too high for the form, ex. 50 years, and the item wrapped around again (unlikely since the Ages are all between 23 and 29). Perhaps these users left experience blank, and the data parsing algorithm accidentally pulled in a hyphen and number from a neighboring data column that was dropped from this data set. 


Q: Can we safely drop these without skewing the data?

A: 52/5000 is ~1% of the data, which isn't a significant amount and should be droppable. Even if these customers have particularly high Mortgage, CCavg, etc., they won't be able to skew the means much. I can do a more thorough outlier test later to double-check. 

### Yes, we should drop these values. 

In [50]:
df_clean = df[df.Experience >= 0]
df_clean.describe()

Unnamed: 0,Age,Experience,Income,Family,CCAvg,Education,Mortgage,Personal_Loan,Securities_Account,CD_Account,Online,CreditCard
count,4948.0,4948.0,4948.0,4948.0,4948.0,4948.0,4948.0,4948.0,4948.0,4948.0,4948.0,4948.0
mean,45.557195,20.331043,73.81447,2.391471,1.935926,1.878941,56.634398,0.097009,0.104285,0.061035,0.597009,0.294058
std,11.320735,11.311973,46.112596,1.148444,1.747694,0.839745,101.828885,0.296,0.30566,0.239418,0.490549,0.455664
min,24.0,0.0,8.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,36.0,10.75,39.0,1.0,0.7,1.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,46.0,20.0,64.0,2.0,1.5,2.0,0.0,0.0,0.0,0.0,1.0,0.0
75%,55.0,30.0,98.0,3.0,2.6,3.0,101.0,0.0,0.0,0.0,1.0,1.0
max,67.0,43.0,224.0,4.0,10.0,3.0,635.0,1.0,1.0,1.0,1.0,1.0


## Data Preprocessing

* Missing value treatment
* Feature engineering (if needed)
* Outlier detection and treatment (if needed)
* Preparing data for modeling
* Any other preprocessing steps (if needed)

## Model Building

### Model Evaluation Criterion

*


### Model Building

### Model Performance Improvement

## Model Comparison and Final Model Selection

## Actionable Insights and Business Recommendations


* What recommedations would you suggest to the bank?

___