In [1]:
# Add needed libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random

# Preprocessing
from sklearn.preprocessing import OneHotEncoder

# VIF for multi-collinearity detection
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Models and modeling tools
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
import statsmodels.api as sm

# Models and modeling tools
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import balanced_accuracy_score

from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

# Change inline plots default size
plt.rcParams['figure.figsize'] = [14, 10]

## Data description: ##
The data comes from direct marketing efforts of a European banking institution. The marketing campaign involves making a phone call to a customer, often multiple times to ensure a product subscription, in this case a term deposit. Term deposits are usually short-term deposits with maturities ranging from one month to a few years. The customer must understand when buying a term deposit that they can withdraw their funds only after the term ends. All customer information that might reveal personal information is removed due to privacy concerns.

- y = has the client subscribed to a term deposit? (binary)
- age = age of customer (numeric)
- job = type of job (categorical)
- marital = marital status (categorical)
- education = education level (categorical)
- balance = average yearly balance, in euros (numeric)
- housing = has a housing loan? (binary)
- loan = has personal loan? (binary)
- contact = contact communication type (categorical)
- day = last contact day of the month (numeric)
- month = last contact month of year (categorical)
- duration = last contact duration, in seconds (numeric)
- campaign = number of contacts performed during this campaign and for this client (numeric, includes last contact)

## Analysis Summary: ##

Goal(s):

Predict if the customer will subscribe (yes/no) to a term deposit (variable y)

Bonus(es):

We are also interested in finding customers who are more likely to buy the investment product. Determine the segment(s) of customers our client should prioritize.

What makes the customers buy? Tell us which feature we should be focusing more on.


<font size="3" >
    
Some demographic information, financial status and call center interaction data was collected for customers during a direct marketing campaign selling term deposits to clients at a European bank. The data was used to develop a model to predict whether a customer will subscribe to the term deposit. Naturally, the typical subscription success rate is low (only 7.2% in the dataset), thus it is more important for the model to maximize the number of sales at the cost of predicting additional false leads (this is maximizing recall in data scientist jargon).  
    
    
    
Survey data was collected from a customer cohort and responses were used to develop a model to determine what aspects of the ordering and delivery process were most likely to lead to customer happiness. The model suggests the most important characteristics leading to customer happiness were `Find Everything Customer Wanted` (30.3%) and `On Time Delivery` (27.3%).

The clearest story in the sample data was for `On Time Delivery`, where 65% of Happy respondents gave a 5/5 rating compared to only 35% of Unhappy respondents.  For `Find Everything Customer Wanted`, 48% of Happy respondents rated it 4/5 or above, compared to 30% of Unhappy respondents.

This suggests that to improve customer satisfaction, business development and investment should be focused on improving On Time Delivery and expanding inventory/partnerships to increase the likelihood a customer is able to find the products they are looking for.

The model, which uses survey responses on `On Time Delivery`, `Find Everything Customer Wanted`, `Good Prices` and `Easy to Use App`, provides about 70% accuracy in predicting the happiness of the customer.  A small survey sample size may be limiting the potential accuracy of the model, as well as the types of analysis available and thus for future surveys, depending on costs, larger samples would ideally be prioritized.

The analysis excluded two survey responses after initial data exploration, `Customer Order was as Expected` and `Delivery Satisfaction`.  `Customer Order was as Expected` showed the weakest relationship with customer happiness and the exclusion of `Delivery Satisfaction` was determined to lead to the largest improvement in model accuracy when dropped along with `Customer Order was as Expected`.

A potentially concerning insight from the sample data was that `Customer Order was as Expected` had the lowest average rating at only 2.53/5.  While the relationship of this survey response was weak with customer happiness in the sample data, one hypothesis is that since it appears that customers are not typically receiving what they expect, this characteristic isn't informative of their happiness.  Intuition would suggest that receiving what you would expect to receive should be important.  Future surveys may want to investigate this further, through additional questions, to better understand why customers are providing low ratings for `Customer Order was as Expected`.
