# Data-driven markerting of a new banking product


### Business Case 
Bank has launched a special product offer and want to optimise their marketing strategies based on their customer information.

Based on the customer data and past marketing outcomes, marketing unit needs to know whether one of your existing clients is a good candidate for a special product offer.

*Bank identified John, who is a single divorcee, who works in management. He is not in credit default, he doesn't have a home loan, or any personal loans. Based on this information we need to recommend whether to include John for product marketing or not.*


### Translating business problem to analytics solution
Building a recommendation engine to predict the success of marketing of the new product.

Recommendations will be made based on similarities between customers in the banks database. So for this example recommender implies that clients who are similar to John liked or did not like A, B, or C offers in the past. This is referred as user-based collaborative filtering.


### Data
You have access to the data that represent both transaction history and user attributes. 
The transaction history provides simple attributes on customers who have buy or did not buy a product in past marketing efforts. 

A user attribute dataset provides a rich description or annotation of customer who buy or did not buy a product in the past.

#### Dataset description - 
Sample dataset for this case study is downloaded from UCI Machine Learning Repository.

Dataset size - 45200 customers and 19 transaction and user atrributes eg. age, gender, employment, marital status, loans, credits, defaults, investments, deposits, balance etc... and past marketing results (binary - 0 or 1).

Class Labels - ~39000 negative records and 6000 positive records.

New Customer John - 


### Problem formulation (Machine Learning / Deep Learning)
Based on the businness problem and information in hand, I identify this as a classification problem - decide whether to recommend a product to the customer or not.

Based on the data size, this can be formulated as a machine learning (thousands records) or deep learning problem (millons of records).


### Modeling algorithms - Classification problem
Based on this data, I'd use logistic regression as a classifier to recommend whether to include a customer in the marketing initiative. Setup the baseline model and the iterate to optimise. Since we are using user information only, its user based collaborative filtering method.


If the model predicts that he/she will subscribe to a product upon marketing efforts then they should be included. If the model predicts that they won't subscribe then they should not be contacted with the marketing.


### Evaluation criteria (Whats important to you)

Dataset is highly imbalanced, with negative to positive ratio is 1:7, so its very important to choose evaluation criteria carefully.

Figure out what are the key performance measures that you want to focus on. In this case, we want to identify potential candidates who can buy the product - which translates to precision (relevance) and accuracy.


In [2]:
# import libraries

import pandas as pd
import numpy as np

from pandas import Series, DataFrame
from sklearn.linear_model import LogisticRegression


# import libraries
import os
import string
import warnings
import re
import time


from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, make_scorer
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import classification_report

from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import svm


import matplotlib.pyplot as plt # plotting functions
%matplotlib inline
pd.set_option('display.max_colwidth', 50) # display setting


  from numpy.core.umath_tests import inner1d


In [3]:
# load data

bank_full_df = pd.read_csv('bank_data_with_dummy_variables_UCIML.csv')
bank_full_df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,...,job_unknown,job_retired,job_services,job_self_employed,job_unemployed,job_maid,job_student,married,single,divorced
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,...,0,0,0,0,0,0,0,1,0,0
1,44,technician,single,secondary,no,29,yes,no,unknown,5,...,0,0,0,0,0,0,0,0,1,1
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,...,0,0,0,0,0,0,0,1,0,0
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,...,0,0,0,0,0,0,0,1,0,0
4,33,unknown,single,unknown,no,1,no,no,unknown,5,...,1,0,0,0,0,0,0,0,1,1


In [5]:
bank_full_df.iloc[:,18:36].columns.tolist()

['housing_loan                ',
 'credit_in_default',
 'personal_loans',
 'prev_failed_to_subscribe    ',
 'prev_subscribed             ',
 'job_management              ',
 'job_tech                    ',
 'job_entrepreneur            ',
 'job_bluecollar              ',
 'job_unknown                 ',
 'job_retired                 ',
 'job_services                ',
 'job_self_employed           ',
 'job_unemployed              ',
 'job_maid                    ',
 'job_student                 ',
 'married                     ',
 'single                      ']

In [8]:
bank_full_df.columns[17]

'y_binary                    '

In [9]:
bank_full_df.iloc[1:6,17:36]

Unnamed: 0,y_binary,housing_loan,credit_in_default,personal_loans,prev_failed_to_subscribe,prev_subscribed,job_management,job_tech,job_entrepreneur,job_bluecollar,job_unknown,job_retired,job_services,job_self_employed,job_unemployed,job_maid,job_student,married,single
1,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
2,0,1,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0
3,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
4,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1
5,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0


#### Prepare training dataset

In [10]:
# prepare training dataset
X = bank_full_df.iloc[:,18:36].values
y = bank_full_df.iloc[:,17].values

#### Build baseline classification model

In [11]:
# build a baseline logistic regression model, default parameters
model_lr = LogisticRegression()
model_lr.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [12]:
# get training performance
y_train = model_lr.predict(X)
y_train

array([0, 0, 0, ..., 0, 1, 0])

#### Check performance

In [13]:
print(classification_report(y, y_train))

             precision    recall  f1-score   support

          0       0.90      0.99      0.94     39921
          1       0.67      0.17      0.27      5289

avg / total       0.87      0.89      0.86     45210



In [14]:
precision, recall, fscore, support = score(y, y_train, pos_label=1, average='binary')

print('Precision: {} / Recall: {} / F1-Score: {} / Accuracy: {} '.format(round(precision, 3),
                                                                          round(recall, 3),
                                                                          round(fscore, 3),
                                                                          round(accuracy_score(y, y_train), 3)))


Precision: 0.671 / Recall: 0.167 / F1-Score: 0.268 / Accuracy: 0.893 


In [18]:
### performance with the baseline model indicates results are biased towards the majority class which is '0' in this case. 

### Predict on new user

In [15]:
# target customer - works in management, married and single
john_profile = np.array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])
john_profile = john_profile.reshape(1, -1)

In [16]:
# recommendation for johns profile - 1 is recommend a product, 0 is don't recommend
y_pred = model_lr.predict(john_profile)
y_pred

array([1])

In [17]:
# probabilities of recommnendation for johns profile 
model_lr.predict_proba(john_profile)

array([[0.29046994, 0.70953006]])