# Bank data analysis

The goal of your project is to create a robust classifier and use the data, where you will build a model that will recognize whether specific client will leave/unsubscribe the bank services.
Make feature engineering but also try differnet models in order to get as much accuracy as possible.




    

## Dataset Info

* CLIENTNUM
  - Client number. Unique identifier for the customer holding the account

* Attrition_Flag
  - Internal event (customer activity) variable - if the account is closed then 1 else 0

* Customer_Age
  - Demographic variable - Customer's Age in Years

* Gender
  - Demographic variable - M=Male, F=Female

* Dependent_count
  - Demographic variable - Number of dependents

* Education_Level
  - Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)


* Marital_Status
  - Demographic variable - Married, Single, Divorced, Unknown

* Income_Category
  - Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)


* Card_Category
  - Product Variable - Type of Card (Blue, Silver, Gold, Platinum)

* Months_on_book
  - Period of relationship with bank


* Total_Relationship_Count
  - Total no. of products held by the customer

* Months_Inactive_12_mon
  - No. of months inactive in the last 12 months

* Contacts_Count_12_mon
  - No. of Contacts in the last 12 months

* Credit_Limit
  - Credit Limit on the Credit Card

* Total_Revolving_Bal
  - Total Revolving Balance on the Credit Card

* Avg_Open_To_Buy
  - Open to Buy Credit Line (Average of last 12 months)

* Total_Amt_Chng_Q4_Q1
  - Change in Transaction Amount (Q4 over Q1)

* Total_Trans_Amt
  - Total Transaction Amount (Last 12 months)

* Total_Trans_Ct
  - Total Transaction Count (Last 12 months)

* Total_Ct_Chng_Q4_Q1
  - Change in Transaction Count (Q4 over Q1)

* Avg_Utilization_Ratio
  - Average Card Utilization Ratio




## 0. Import data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

In [2]:
path = 'dataset_bank.csv'
dataset = pd.read_csv(path)
pd.set_option('display.max_columns', None)

## 1. Feature Anaysis, Extraction & Selection
(you may need to perform feature selection after creating default models and compare to them)

In [6]:
#dataset['Months_on_book'].unique().tolist()   #javuva -2 i 3.21+11

In [7]:
dataset = dataset[dataset['Months_on_book'] >0]

In [8]:
dataset = dataset[dataset['Months_on_book'] <500]

In [9]:
dataset = dataset.drop(['CLIENTNUM'], axis = 1)

In [10]:
from sklearn.preprocessing import LabelEncoder     

labelencoder = LabelEncoder()
# Categorical encoding
dataset['Attrition_Flag'] = labelencoder.fit_transform(dataset['Attrition_Flag'])
dataset['Gender'] = labelencoder.fit_transform(dataset['Gender'])
dataset['Education_Level'] = labelencoder.fit_transform(dataset['Education_Level'])
dataset['Card_Category'] = labelencoder.fit_transform(dataset['Card_Category'])
dataset['Marital_Status'] = labelencoder.fit_transform(dataset['Marital_Status'])
#dataset.head(10)

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Income_Category,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio
0,1,45,1,3,3,1,$60K - $80K,0,39.0,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061
1,1,49,0,5,2,2,Less than $40K,0,44.0,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105
2,1,51,1,3,2,1,$80K - $120K,0,36.0,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0
3,1,40,0,4,3,3,Less than $40K,0,34.0,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76
4,1,40,1,3,5,1,$60K - $80K,0,21.0,5,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0
5,1,44,1,2,2,1,$40K - $60K,0,36.0,3,1,2,4010.0,1247,2763.0,1.376,1088,24,0.846,0.311
6,1,51,1,4,6,1,$120K +,1,46.0,6,1,3,34516.0,2264,32252.0,1.975,1330,31,0.722,0.066
7,1,32,1,0,3,3,$60K - $80K,3,27.0,2,2,2,29081.0,1396,27685.0,2.204,1538,36,0.714,0.048
8,1,37,1,3,5,2,$60K - $80K,0,36.0,5,2,0,22352.0,2517,19835.0,3.355,1350,24,1.182,0.113
9,1,48,1,2,2,2,$80K - $120K,0,36.0,6,3,3,11656.0,1677,9979.0,1.524,1441,32,0.882,0.144


In [11]:
##One hot encoding
dataset_dummies = pd.get_dummies(dataset[['Income_Category']])
dataset = pd.concat([dataset, dataset_dummies], axis='columns')
dataset = dataset.drop(['Income_Category'], axis = 1)

dataset.head(10)

Unnamed: 0,Attrition_Flag,Customer_Age,Gender,Dependent_count,Education_Level,Marital_Status,Card_Category,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,Total_Trans_Amt,Total_Trans_Ct,Total_Ct_Chng_Q4_Q1,Avg_Utilization_Ratio,Income_Category_$120K +,Income_Category_$40K - $60K,Income_Category_$60K - $80K,Income_Category_$80K - $120K,Income_Category_Less than $40K,Income_Category_Unknown
0,1,45,1,3,3,1,0,39.0,5,1,3,12691.0,777,11914.0,1.335,1144,42,1.625,0.061,0,0,1,0,0,0
1,1,49,0,5,2,2,0,44.0,6,1,2,8256.0,864,7392.0,1.541,1291,33,3.714,0.105,0,0,0,0,1,0
2,1,51,1,3,2,1,0,36.0,4,1,0,3418.0,0,3418.0,2.594,1887,20,2.333,0.0,0,0,0,1,0,0
3,1,40,0,4,3,3,0,34.0,3,4,1,3313.0,2517,796.0,1.405,1171,20,2.333,0.76,0,0,0,0,1,0
4,1,40,1,3,5,1,0,21.0,5,1,0,4716.0,0,4716.0,2.175,816,28,2.5,0.0,0,0,1,0,0,0
5,1,44,1,2,2,1,0,36.0,3,1,2,4010.0,1247,2763.0,1.376,1088,24,0.846,0.311,0,1,0,0,0,0
6,1,51,1,4,6,1,1,46.0,6,1,3,34516.0,2264,32252.0,1.975,1330,31,0.722,0.066,1,0,0,0,0,0
7,1,32,1,0,3,3,3,27.0,2,2,2,29081.0,1396,27685.0,2.204,1538,36,0.714,0.048,0,0,1,0,0,0
8,1,37,1,3,5,2,0,36.0,5,2,0,22352.0,2517,19835.0,3.355,1350,24,1.182,0.113,0,0,1,0,0,0
9,1,48,1,2,2,2,0,36.0,6,3,3,11656.0,1677,9979.0,1.524,1441,32,0.882,0.144,0,0,0,1,0,0


In [12]:
# Scaling of numerical data
from sklearn.preprocessing import RobustScaler
columns = ['Total_Amt_Chng_Q4_Q1','Total_Ct_Chng_Q4_Q1']
transformer = RobustScaler()
dataset[columns] = transformer.fit_transform(dataset[columns])

#dataset.head(10)

In [13]:
X = dataset.drop("Attrition_Flag", axis=1)
y = dataset['Attrition_Flag']

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state=0)

## 2. Prediction with Best model


In [16]:
modelxg_re = pickle.load(open('modelxg_re.bin','rb'))


In [17]:
print(type(modelxg_re))

<class 'xgboost.sklearn.XGBClassifier'>


In [18]:
modelxg_re.fit(X_train, y_train)
y_predict_XGB_re = modelxg_re.predict(X_test)

Parameters: { "bootstrap", "max_features", "min_samples_leaf", "min_samples_split" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




In [19]:
y_predict_XGB_re

array([1, 0, 0, ..., 1, 0, 1])

In [20]:
type(y_predict_XGB_re)

numpy.ndarray

In [21]:
predictions = pd.DataFrame(y_predict_XGB_re, columns = ['PredictedValue'])

In [22]:
predictions

Unnamed: 0,PredictedValue
0,1
1,0
2,0
3,1
4,1
...,...
2010,1
2011,1
2012,1
2013,0


In [23]:
y_test_frame = y_test.to_frame()

In [24]:
y_test_frame.reset_index()

Unnamed: 0,index,Attrition_Flag
0,1090,1
1,9112,0
2,8123,0
3,7712,1
4,2091,1
...,...,...
2010,635,1
2011,9116,1
2012,1643,1
2013,3922,0


In [25]:
FinalComparison = pd.concat([y_test_frame.reset_index(drop=True),
                             predictions.reset_index(drop=True)], axis='columns', ignore_index=False)

In [26]:
FinalComparison

Unnamed: 0,Attrition_Flag,PredictedValue
0,1,1
1,0,0
2,0,0
3,1,1
4,1,1
...,...,...
2010,1,1
2011,1,1
2012,1,1
2013,0,0


## 3 Discussion, Concusions, Future improvements
- which features are the most important
- how will you explain the model to the management of the bank
- how much benefit/improvement should the bank expect