# Data 602 - Advanced Programming Techniques - Final Project
## Kishore Prasad

## Overview

Auto Insurance is a competetive area. Insurance companies vie with each other to woo customers. Since it is a yearly renewal, they also have to struggle to retain customers. While renewing, there are various considerations: Should this customer be renewed? Should we propose an alternative insurance policy? Should we cross-sell or upsell? etc. Various marketing campaigns are targetted towards customers. Each campaign will have a specific agenda and goal. We should not execute all campaigns for all the customers. Neither will a general campaign work for all the customers. This is not feasibe for the following reasons:

 - Each customer's needs will be different? A campaign will be more effective if we can address it to the right audience. 
 - It is costly to execute all campaigns for all the customers. This is a waste of efforts as well as money. Also, it will overwhelm the customer while making the right choice.
 - Too many campaign mails / SMSs etc will might make a customer annoyed and might make the customer churn.
 
That is the reason that a well planned strategy has to be executed to target the right audience. 

In this project, we will study customer demographic and transaction data to understand whether a customer will respond to a campaign. We will use logistic regression techniques to carry out this analysis. This exercise is primarily aimed at renewal of existing customers.  


In [626]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.tools as tools
import scipy.stats as stats
import statsmodels.api as sm
from patsy import dmatrices
from sklearn import linear_model
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import f_regression
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from sklearn.cross_validation import cross_val_score
from sklearn.metrics import classification_report

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly.graph_objs import *

plotly.tools.set_credentials_file(username='kishkp', api_key='DxJQrhnrXYCRF7MdMktU')

init_notebook_mode(connected=True) 

%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.options.display.max_colwidth = 0
plt.style.use('ggplot')

## Data Exploration Analysis and Data Preparation

In section we will explore and gain some insights into the dataset. We will also look at the variable types. Create some descriptive statistics for the variables. We will also transform the data to suit the task at hand. While doing so, we will also create some additional variables / features. Finally, we will do some sort of feature selection to include only those features that might influence the dependent variable ('Response'). 

The following are some of the activities in this section:

- Univariate / Bivariate / Descriptive statistics
- Outlier and Missing values treatment
- Creating Additional Variables
- Feature Selection

### Feature / Variable Identification

In this section we will have a look at the variables / features that are at our disposal. In addition, we will prepare an outline of the kind of analysis that we will carry out for each of the variable.

Below is the list of variables, their properties and the relevant analysis to be carried out:

In [698]:
file = "https://raw.githubusercontent.com/kishkp/Data-602-Advanced-Programming-Techniques/master/Project1_desc.csv"
df = pd.read_csv(file)
df

Unnamed: 0,Variable / Feature,Type,Comments Description
0,Response,Dependent Variable - Character,Whether the customer responded to the campaign and signed up for the respective offer. This is the dependent variable to be predicted.
1,State,Categorical,"We will check to see if some states have a higher tendency to subscribe to the offer. If so, we will retain this variable. If not, we will discard the variable."
2,Coverage,Categorical,What type of coverage is currently enjoyed by the customer. We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.
3,Education,Categorical,The education level of the customer. We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.
4,EmploymentStatus,Categorical,Is the person currently employed? We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.
5,Gender,Categorical,What is the gender of the customer? Does the gender have any impact on the subscription? We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.
6,Location Code,Categorical,Are Urban folks more likely to subscribe? We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.
7,Marital Status,Categorical,Are Married folks more likely to subscribe? We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.
8,Policy Type,Categorical,Does the type of policy impact the campaign response? Are personal policy holders more likely to subscribe? We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.
9,Policy,Categorical,Does the level of policy impact the campaign response? We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


From the table above, we can see that we have quite a few categorical and Numeric variables. The "Response" variable is dichotomous with "Yes" and "No" as the values and it is the dependent variable. We will be predicting for this variable.

The following analysis will be carried out for the various features / variables:

- Categorical : We will create bar plots / frequency tables for each of the categorical variables to understand the distribution of these variables in the data. In addition, we will use chi-square test to determine if these variables have a statistically significant relationship with the "Response" variable.

- Numerical : We will have a look at the histogram to see if the data is normally distributed. If not, we will look at carrying out some transformations.   


In [699]:
file = "https://raw.githubusercontent.com/kishkp/Data-602-Advanced-Programming-Techniques/master/WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv"
data = pd.read_csv(file)
data.columns = ['Customer','State','CustomerLifetimeValue','Response','Coverage','Education','EffectiveToDate','EmploymentStatus',
'Gender','Income','LocationCode','MaritalStatus', 'MonthlyPremiumAuto','MonthsSinceLastClaim', 'MonthsSincePolicyInception','NumberofOpenComplaints',
'NumberofPolicies','PolicyType','Policy','RenewOfferType', 'SalesChannel','TotalClaimAmount','VehicleClass', 'VehicleSize'] 


### Univariate / BiVariate / Descriptive Statistics

#### Categorical Variables

##### Customer

We discard the "customer" variable as it is a row identifier and does not add any value to the outcome.

In [131]:
# data.drop('Customer', axis=1, inplace=True)

##### State

We will check to see if some states have a higher tendency to subscribe to the offer. If so, we will retain this variable. If not, we will discard the variable.

In [5]:
# show box_plots

# Get the data for the 2 response types of 'Yes'  and 'No'
x_0 = data.State.unique()
y_0 = data[(data["Response"] == 'Yes')].State.value_counts()
y_1 = data[(data["Response"] == 'No')].State.value_counts()

# Create the percentage of the total 
totals = y_0 + y_1

y_2 = y_0 / totals
y_3 = y_1 / totals

# Assign each bar with the respective series
trace1 = Bar(x=x_0, y=y_0, name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1, name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2, name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

# Make subplots and add the series to the respective sub-plots
fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

# Make the layout as 'Stack'
fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [286]:
# Perform chi-sq test for independence

# create a cross tab from the data
cross_tab = pd.crosstab(data.State, data.Response, margins = False)
cross_tab

# show chi-sq statistics -  A high p-value ( > 0.05) indicates that the variables are independent.
chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

# UNUSED - split the chi-sq statistics into the respective components for further use if any.
chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]

#data.drop('State', axis=1, inplace=True)

Response,No,Yes
State,Unnamed: 1_level_1,Unnamed: 2_level_1
Arizona,1460,243
California,2694,456
Nevada,758,124
Oregon,2225,376
Washington,689,109


(0.43847752995883232,
 0.97920715123356927,
 4L,
 array([[ 1459.1283118 ,   243.8716882 ],
        [ 2698.91613751,   451.08386249],
        [  755.6965185 ,   126.3034815 ],
        [ 2228.53361069,   372.46638931],
        [  683.7254215 ,   114.2745785 ]]))

From the 100 percent stacked bar, it is clear that 'Response' variable is not dependent on 'State'. We do not see any variation in the "Response" based on the change in "State". The high p-value (0.9792)from the chi-square test of independence confirms this. 

Hence, we go ahead and exclude "state" from the analysis.

##### Coverage

What type of coverage is currently enjoyed by the customer. We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.



In [556]:
# show box_plots

x_0 = data.Coverage.unique()
y_0 = data[(data["Response"] == 'Yes')].Coverage.value_counts()
y_1 = data[(data["Response"] == 'No')].Coverage.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [287]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.Coverage, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
Coverage,Unnamed: 1_level_1,Unnamed: 2_level_1
Basic,4770,798
Extended,2352,390
Premium,704,120


(0.061276542077876382,
 0.96982632390239099,
 2L,
 array([[ 4770.65557259,   797.34442741],
        [ 2349.34223779,   392.65776221],
        [  706.00218962,   117.99781038]]))

From the 100 percent stacked bar, it is clear that 'Response' variable is not dependent on 'Coverage'. We do not see any variation in the "Response" based on the change in "Coverage". The high p-value (0.9698) from the chi-square test of independence confirms this. 

Hence, we go ahead and exclude "Coverage" from the analysis as well.

##### Education	

The variable denotes the education level of the customer. We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [9]:
# show box_plots

x_0 = data.Education.unique()
y_0 = data[(data["Response"] == 'Yes')].Education.value_counts()
y_1 = data[(data["Response"] == 'No')].Education.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [288]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.Education, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]

Response,No,Yes
Education,Unnamed: 1_level_1,Unnamed: 2_level_1
Bachelor,2370,378
College,2273,408
Doctor,282,60
High School or Below,2280,342
Master,621,120


(10.977692567761025,
 0.026815866387950998,
 4L,
 array([[ 2354.48303044,   393.51696956],
        [ 2297.07751259,   383.92248741],
        [  293.02518064,    48.97481936],
        [ 2246.52638494,   375.47361506],
        [  634.88789139,   106.11210861]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various education levels. A low p-value (0.0268) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we will 'dummify' this categorical variable and remove the original variable.


In [289]:
data = data.join(pd.get_dummies(data['Education'], prefix='Education').ix[:, 1:])


##### EmploymentStatus	

Is the person currently employed? We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [12]:
# show box_plots

x_0 = data.EmploymentStatus.unique()
y_0 = data[(data["Response"] == 'Yes')].EmploymentStatus.value_counts()
y_1 = data[(data["Response"] == 'No')].EmploymentStatus.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)



This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [290]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.EmploymentStatus, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]


Response,No,Yes
EmploymentStatus,Unnamed: 1_level_1,Unnamed: 2_level_1
Disabled,333,72
Employed,4942,756
Medical Leave,354,78
Retired,78,204
Unemployed,2119,198


(850.69262594458121,
 8.0205821207649546e-183,
 4L,
 array([[  347.00350339,    57.99649661],
        [ 4882.03941318,   815.96058682],
        [  370.13707029,    61.86292971],
        [  241.61725422,    40.38274578],
        [ 1985.20275892,   331.79724108]]))

From the 100 percent stacked bar, 'Response' variable seems to have a significant variation between the various Employment levels. A very low p-value (8.02e-183) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we will 'dummify' this categorical variable and remove the original variable.


In [291]:
data = data.join(pd.get_dummies(data['EmploymentStatus'], prefix='EmploymentStatus').ix[:, 1:])


##### Gender

What is the gender of the customer? Does the gender have any impact on the subscription?  We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.

In [15]:
# show box_plots

x_0 = data.Gender.unique()
y_0 = data[(data["Response"] == 'Yes')].Gender.value_counts()
y_1 = data[(data["Response"] == 'No')].Gender.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [292]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.Gender, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,3998,660
M,3828,648


(0.15231640387892736,
 0.69633147580094035,
 1L,
 array([[ 3990.96868842,   667.03131158],
        [ 3835.03131158,   640.96868842]]))

From the 100 percent stacked bar, it is clear that 'Response' variable is not dependent on 'Gender'. We do not see any variation in the "Response" based on the change in "Gender". The high p-value (0.6963) from the chi-square test of independence confirms this. 

Hence, we go ahead and exclude "Gender" from the analysis as well.

##### Location Code	

Are Urban folks more likely to subscribe?  We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [17]:
# show box_plots


x_0 = data.LocationCode.unique()
y_0 = data[(data["Response"] == 'Yes')].LocationCode.value_counts()
y_1 = data[(data["Response"] == 'No')].LocationCode.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)



This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [293]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.LocationCode, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]

Response,No,Yes
LocationCode,Unnamed: 1_level_1,Unnamed: 2_level_1
Rural,1611,162
Suburban,4771,1008
Urban,1444,138


(125.13009510410569,
 6.7351161253195568e-28,
 2L,
 array([[ 1519.10422597,   253.89577403],
        [ 4951.44011386,   827.55988614],
        [ 1355.45566017,   226.54433983]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various Locations. A low p-value (6.73e-28) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we will 'dummify' this categorical variable and remove the original variable.


In [294]:
data = data.join(pd.get_dummies(data['LocationCode'], prefix='LocationCode').ix[:, 1:])


##### Marital Status

Are Married folks more likely to subscribe?  We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [20]:
# show box_plots

x_0 = data.MaritalStatus.unique()
y_0 = data[(data["Response"] == 'Yes')].MaritalStatus.value_counts()
y_1 = data[(data["Response"] == 'No')].MaritalStatus.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)



This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [295]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.MaritalStatus, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]


Response,No,Yes
MaritalStatus,Unnamed: 1_level_1,Unnamed: 2_level_1
Divorced,1045,324
Married,4602,696
Single,2179,288


(117.59875352408642,
 2.9090764387287105e-26,
 2L,
 array([[ 1172.95752135,   196.04247865],
        [ 4539.31990366,   758.68009634],
        [ 2113.72257499,   353.27742501]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various Marital Statuses. A low p-value (2.91e-26) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we will 'dummify' this categorical variable and remove the original variable.



In [296]:
data = data.join(pd.get_dummies(data['MaritalStatus'], prefix='MaritalStatus').ix[:, 1:])


##### Policy Type	

Does the type of policy impact the campaign response? Are personal policy holders more likely to subscribe?  We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [23]:
# show box_plots

x_0 = data.PolicyType.unique()
y_0 = data[(data["Response"] == 'Yes')].PolicyType.value_counts()
y_1 = data[(data["Response"] == 'No')].PolicyType.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [297]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.PolicyType, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
PolicyType,Unnamed: 1_level_1,Unnamed: 2_level_1
Corporate Auto,1680,288
Personal Auto,5830,958
Special Auto,316,62


(1.7306298294691826,
 0.42091897816529777,
 2L,
 array([[ 1686.17998686,   281.82001314],
        [ 5815.95007664,   972.04992336],
        [  323.8699365 ,    54.1300635 ]]))

From the 100 percent stacked bar, we can see a slight variation for 'Special Auto'. However, it is not sufficient influence the "response" variable. The high p-value (0.4209) from the chi-square test of independence confirms this. 

Hence, we go ahead and exclude "PolicyType" from the analysis as well.

##### Policy	

Does the level of policy impact the campaign response? We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [25]:
# show box_plots

x_0 = data.Policy.unique()
y_0 = data[(data["Response"] == 'Yes')].Policy.value_counts()
y_1 = data[(data["Response"] == 'No')].Policy.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [298]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.Policy, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
Policy,Unnamed: 1_level_1,Unnamed: 2_level_1
Corporate L1,311,48
Corporate L2,507,88
Corporate L3,862,152
Personal L1,1055,185
Personal L2,1817,305
Personal L3,2958,468
Special L1,54,12
Special L2,145,19
Special L3,117,31


(9.4230219909540036,
 0.3078757089264722,
 8L,
 array([[  307.5907598 ,    51.4092402 ],
        [  509.79527042,    85.20472958],
        [  868.79395665,   145.20604335],
        [ 1062.43047953,   177.56952047],
        [ 1818.12699803,   303.87300197],
        [ 2935.39259908,   490.60740092],
        [   56.54871907,     9.45128093],
        [  140.51499891,    23.48500109],
        [  126.80621852,    21.19378148]]))

There seems to be some variation in "Response", especially around 'Special' policies. However, as with PolicyType, it is not sufficient enough to retain the variable. The high p-value (0.3079) from the chi-square test of independence confirms this. 

Hence, we go ahead and exclude "Policy" from the analysis as well.

##### Renew Offer Type	

Is a particular offer more attractive than the other?  We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [557]:
# show box_plots

x_0 = data.RenewOfferType.unique()
y_0 = data[(data["Response"] == 'Yes')].RenewOfferType.value_counts()
y_1 = data[(data["Response"] == 'No')].RenewOfferType.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [300]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.RenewOfferType, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]


Response,No,Yes
RenewOfferType,Unnamed: 1_level_1,Unnamed: 2_level_1
Offer1,3158,594
Offer2,2242,684
Offer3,1402,30
Offer4,1024,0


(548.16445142938346,
 1.7369503065426601e-118,
 3L,
 array([[ 3214.70899934,   537.29100066],
        [ 2506.99321217,   419.00678783],
        [ 1226.9358441 ,   205.0641559 ],
        [  877.36194438,   146.63805562]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various Renewal Offer Types. A low p-value (1.74e-118) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we will 'dummify' this categorical variable and remove the original variable.



In [301]:
data = data.join(pd.get_dummies(data['RenewOfferType'], prefix='RenewOfferType').ix[:, 1:])


##### Sales Channel	

Is a particular sales channel more likely to lead to a successful response? We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [558]:
# show box_plots

x_0 = data.SalesChannel.unique()
y_0 = data[(data["Response"] == 'Yes')].SalesChannel.value_counts()
y_1 = data[(data["Response"] == 'No')].SalesChannel.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [302]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.SalesChannel, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]


Response,No,Yes
SalesChannel,Unnamed: 1_level_1,Unnamed: 2_level_1
Agent,2811,666
Branch,2273,294
Call Center,1573,192
Web,1169,156


(107.47244037488014,
 3.8391117221024828e-23,
 3L,
 array([[ 2979.08933654,   497.91066346],
        [ 2199.40245238,   367.59754762],
        [ 1512.24983578,   252.75016422],
        [ 1135.2583753 ,   189.7416247 ]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various Sales Channels. A low p-value (3.84e-23) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we will 'dummify' this categorical variable and remove the original variable.



In [303]:
data = data.join(pd.get_dummies(data['SalesChannel'], prefix='SalesChannel').ix[:, 1:])


##### Vehicle Class	

Does the vehicle class impact campaign response?  We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [559]:
# show box_plots

x_0 = data.VehicleClass.unique()
y_0 = data[(data["Response"] == 'Yes')].VehicleClass.value_counts()
y_1 = data[(data["Response"] == 'No')].VehicleClass.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [304]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.VehicleClass, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]


Response,No,Yes
VehicleClass,Unnamed: 1_level_1,Unnamed: 2_level_1
Four-Door Car,3997,624
Luxury Car,151,12
Luxury SUV,154,30
SUV,1508,288
Sports Car,394,90
Two-Door Car,1622,264


(21.210243375074022,
 0.00073921311309917527,
 5L,
 array([[ 3959.26713379,   661.73286621],
        [  139.65820013,    23.34179987],
        [  157.65097438,    26.34902562],
        [ 1538.81059777,   257.18940223],
        [  414.69060653,    69.30939347],
        [ 1615.92248741,   270.07751259]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various Vehicle Classes. A low p-value (0.00074) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we will 'dummify' this categorical variable and remove the original variable.



In [305]:
data = data.join(pd.get_dummies(data['VehicleClass'], prefix='VehicleClass').ix[:, 1:])


##### Vehicle Size	

Does the vehicle size impact campaign response?  We will look at the frequency distribution and correlation to response to see if this is a variable worth considering.


In [560]:
# show box_plots

x_0 = data.VehicleSize.unique()
y_0 = data[(data["Response"] == 'Yes')].VehicleSize.value_counts()
y_1 = data[(data["Response"] == 'No')].VehicleSize.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [306]:
# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.VehicleSize, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]


Response,No,Yes
VehicleSize,Unnamed: 1_level_1,Unnamed: 2_level_1
Large,778,168
Medsize,5482,942
Small,1566,198


(23.513731678546584,
 7.8353435165186028e-06,
 2L,
 array([[  810.53164003,   135.46835997],
        [ 5504.07532297,   919.92467703],
        [ 1511.393037  ,   252.606963  ]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various Vehicle Sizes. A low p-value (7.84e-06) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we will 'dummify' this categorical variable and remove the original variable.



In [307]:
data = data.join(pd.get_dummies(data['VehicleSize'], prefix='VehicleSize').ix[:, 1:])


#### Continuous Variables

Now lets have a look at the continuous variables.

##### Effective To Date	

This variable tells us till when the current policy is covered? The dates in this variable range from '01-Jan-2011' to '28-Feb-2011'. We will assume that the campaigns were run on '01-Dec-2010'. So effectively, it is 1 month before the earliest Insurance expiry date. 

We will derive an 'ExpiryDays' variable that tells us how much time does the customer have to renew the current policy. Is the customer in a hurry to sign up. Does the age of the current policy have an impact on the campaign response? We will do some exploration to see if this variable is useful. 


In [308]:
data['ExpiryDays'] = (pd.to_datetime(data.EffectiveToDate) - pd.to_datetime('31-Dec-2010')) / np.timedelta64(1, 'D')
data['Resp_codes'] = pd.Categorical.from_array(data.Response).codes

In [682]:

x_0 = data[(data["Response"] == 'Yes')].ExpiryDays
x_1 = data[(data["Response"] == 'No')].ExpiryDays

trace1 = go.Histogram(x=x_0, opacity=0.75, name='Response = Yes', )
trace2 = go.Histogram(x=x_1, opacity=0.75, name='Response = No',)
trace3 = go.Box(y=x_0, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = Yes',)
trace4 = go.Box(y=x_1, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='overlay')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [309]:
data[['Resp_codes','ExpiryDays']].corr()



Unnamed: 0,Resp_codes,ExpiryDays
Resp_codes,1.0,-0.006047
ExpiryDays,-0.006047,1.0


From the above histogram and box-plot, it seems like Expiry days (and the Effective To Date) does not have a strong influence on Response. This is confirmed by the correlation between these two variables. 

We go ahead and drop this variable from further processing.

##### Customer Lifetime Value	

What is the value generated by the customer for the insurance company. We will bin this variable to see if a customer with higher LTV is more loyal to the company and subscribes more or if it is otherwise.


In [683]:
x_0 = data[(data["Response"] == 'Yes')].CustomerLifetimeValue
x_1 = data[(data["Response"] == 'No')].CustomerLifetimeValue

trace1 = go.Histogram(x=x_0, opacity=0.75, name='Response = Yes', )
trace2 = go.Histogram(x=x_1, opacity=0.75, name='Response = No',)
trace3 = go.Box(y=x_0, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = Yes',)
trace4 = go.Box(y=x_1, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='overlay')
py.iplot(fig)



This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [310]:
data[['Resp_codes','CustomerLifetimeValue']].corr()

Unnamed: 0,Resp_codes,CustomerLifetimeValue
Resp_codes,1.0,-0.00893
CustomerLifetimeValue,-0.00893,1.0


From the above histogram and box-plot, it seems like Customer Lifetime Value does not have a strong influence on Response. This is confirmed by the correlation between these two variables. 

Lets see if binning the customer life time value has a better influence.

In [311]:
bins = [data['CustomerLifetimeValue'].min()-10, 10000, 15000, data['CustomerLifetimeValue'].max()+1]
group_names = ['le_10000', '10000 to 15000', 'gt_15000']


In [690]:
data['CLTV_Bins'] = pd.Categorical.from_array(pd.cut(data['CustomerLifetimeValue'], bins, labels=group_names)).codes

# show box_plots

x_0 = data.CLTV_Bins.unique()
y_0 = data[(data["Response"] == 'Yes')].CLTV_Bins.value_counts()
y_1 = data[(data["Response"] == 'No')].CLTV_Bins.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [312]:
data['CLTV_Bins'] = pd.Categorical.from_array(pd.cut(data['CustomerLifetimeValue'], bins, labels=group_names))

# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.CLTV_Bins, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
CLTV_Bins,Unnamed: 1_level_1,Unnamed: 2_level_1
le_10000,6246,1002
10000 to 15000,735,180
gt_15000,845,126


(24.241564192409005,
 5.4451671933329041e-06,
 2L,
 array([[ 6210.07751259,  1037.92248741],
        [  783.97087804,   131.02912196],
        [  831.95160937,   139.04839063]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various CLTV Bins. A low p-value (5.45e-06) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we drop the Customer Lifetime Value variable and we will 'dummify' the categorical variable and remove the original variable.


In [313]:
data = data.join(pd.get_dummies(data['CLTV_Bins'], prefix='CLTV').ix[:, 1:])


##### Income	

Does a higher income indicate a successful campaign response. Is there a pattern to find here?. We will try binning this variable to see if we can get more predictive power. We will look at the frequency distribution / histograms and correlation to response to see if this is a variable worth considering.


In [691]:
x_0 = data[(data["Response"] == 'Yes')].Income
x_1 = data[(data["Response"] == 'No')].Income

trace1 = go.Histogram(x=x_0, opacity=0.75, name='Response = Yes', )
trace2 = go.Histogram(x=x_1, opacity=0.75, name='Response = No',)
trace3 = go.Box(y=x_0, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = Yes',)
trace4 = go.Box(y=x_1, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='overlay')
py.iplot(fig)

# data.drop('Income', axis=1, inplace=True)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [314]:
data[['Resp_codes','Income']].corr()

Unnamed: 0,Resp_codes,Income
Resp_codes,1.0,0.011932
Income,0.011932,1.0


From the above histogram and box-plot, it seems like Income does not have a strong influence on Response. This is confirmed by the correlation between these two variables. 

Lets see if binning has a better influence.

In [315]:
bins = [data['Income'].min()-10, 10000, 20000, 30000, 40000, 55000, 65000, 70000, data['Income'].max()+1]
group_names = ['le_10K', '10K to 20K', '20K to 30K', '30K to 40K', '40K to 55K', '55K to 65K', '65K to 70K', 'gt_70K']


In [692]:
data['Income_Bins'] = pd.Categorical.from_array(pd.cut(data['Income'], bins, labels=group_names)).codes

# show box_plots

x_0 = data.Income_Bins.unique()

y_0 = data[(data["Response"] == 'Yes')].Income_Bins.value_counts()
y_1 = data[(data["Response"] == 'No')].Income_Bins.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [316]:
data['Income_Bins'] = pd.Categorical.from_array(pd.cut(data['Income'], bins, labels=group_names))

# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.Income_Bins, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
Income_Bins,Unnamed: 1_level_1,Unnamed: 2_level_1
le_10K,2119,198
10K to 20K,360,162
20K to 30K,1099,282
30K to 40K,767,126
40K to 55K,1022,108
55K to 65K,659,162
65K to 70K,313,36
gt_70K,1487,234


(269.52539723729649,
 1.9218649037582163e-54,
 7L,
 array([[ 1985.20275892,   331.79724108],
        [  447.24895993,    74.75104007],
        [ 1183.23910663,   197.76089337],
        [  765.12130501,   127.87869499],
        [  968.18261441,   161.81738559],
        [  703.4317933 ,   117.5682067 ],
        [  299.02277206,    49.97722794],
        [ 1474.55068973,   246.44931027]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various Income Bins. A low p-value (1.92e-54) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we drop the Income variable and we will 'dummify' the categorical variable and remove the original variable.


In [317]:
data = data.join(pd.get_dummies(data['Income_Bins'], prefix='Income').ix[:, 1:])


##### Monthly Premium Auto	

Is there a pattern here? We will look at the histograms and/or create binned variables to see if this variable is useful for prediction.


In [681]:

x_0 = data[(data["Response"] == 'Yes')].MonthlyPremiumAuto
x_1 = data[(data["Response"] == 'No')].MonthlyPremiumAuto


trace1 = go.Histogram(x=x_0, opacity=0.75, name='Response = Yes', )
trace2 = go.Histogram(x=x_1, opacity=0.75, name='Response = No',)
trace3 = go.Box(y=x_0, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = Yes',)
trace4 = go.Box(y=x_1, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='overlay')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [318]:
data[['Resp_codes','MonthlyPremiumAuto']].corr()

Unnamed: 0,Resp_codes,MonthlyPremiumAuto
Resp_codes,1.0,0.010966
MonthlyPremiumAuto,0.010966,1.0


From the above histogram and box-plot, it seems like Monthly Premium Auto does not have a strong influence on Response. This is confirmed by the correlation between these two variables. 

Lets see if binning has a better influence.

In [320]:
bins = [data['MonthlyPremiumAuto'].min()-10, 75, 95, 120, 140, data['MonthlyPremiumAuto'].max()+1]
group_names = ['le_75', '75 to 95', '95 to 120', '120 to 140', 'gt_140']

In [693]:

data['MonthlyPremium_Bins'] = pd.Categorical.from_array(pd.cut(data['MonthlyPremiumAuto'], bins, labels=group_names)).codes

# show box_plots

x_0 = data.MonthlyPremium_Bins.unique()
y_0 = data[(data["Response"] == 'Yes')].MonthlyPremium_Bins.value_counts()
y_1 = data[(data["Response"] == 'No')].MonthlyPremium_Bins.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [321]:
data['MonthlyPremium_Bins'] = pd.Categorical.from_array(pd.cut(data['MonthlyPremiumAuto'], bins, labels=group_names))

# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.MonthlyPremium_Bins, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
MonthlyPremium_Bins,Unnamed: 1_level_1,Unnamed: 2_level_1
le_75,3433,540
75 to 95,1413,216
95 to 120,1924,390
120 to 140,577,102
gt_140,479,60


(20.057964791264904,
 0.00048641138326177601,
 4L,
 array([[ 3404.06152836,   568.93847164],
        [ 1395.72520254,   233.27479746],
        [ 1982.6323626 ,   331.3676374 ],
        [  581.76636742,    97.23363258],
        [  461.81453908,    77.18546092]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various Monthly Premium Bins. A low p-value (7.84e-06) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we drop the Monthly Premium Auto variable and we will 'dummify' the categorical variable and remove the original variable.


In [322]:
data = data.join(pd.get_dummies(data['MonthlyPremium_Bins'], prefix='MntlyPrem').ix[:, 1:])


##### Months Since Last Claim	

Is there a pattern here? We will look at the histograms and/or create binned variables to see if this variable is useful for prediction.


In [694]:

x_0 = data[(data["Response"] == 'Yes')].MonthsSinceLastClaim
x_1 = data[(data["Response"] == 'No')].MonthsSinceLastClaim

trace1 = go.Histogram(x=x_0, opacity=0.75, name='Response = Yes', )
trace2 = go.Histogram(x=x_1, opacity=0.75, name='Response = No',)
trace3 = go.Box(y=x_0, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = Yes',)
trace4 = go.Box(y=x_1, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='overlay')
py.iplot(fig)



This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [323]:
data[['Resp_codes','MonthsSinceLastClaim']].corr()


Unnamed: 0,Resp_codes,MonthsSinceLastClaim
Resp_codes,1.0,-0.016597
MonthsSinceLastClaim,-0.016597,1.0


From the above histogram and box-plot, it seems like Months Since Last Claim does not have a strong influence on Response. This is confirmed by the correlation between these two variables. We will drop this variable from further analysis.


##### Months Since Policy Inception	

Is there a pattern here? We will look at the histograms and/or create binned variables to see if this variable is useful for prediction.


In [695]:
x_0 = data[(data["Response"] == 'Yes')].MonthsSincePolicyInception
x_1 = data[(data["Response"] == 'No')].MonthsSincePolicyInception

trace1 = go.Histogram(x=x_0, opacity=0.75, name='Response = Yes', )
trace2 = go.Histogram(x=x_1, opacity=0.75, name='Response = No',)
trace3 = go.Box(y=x_0, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = Yes',)
trace4 = go.Box(y=x_1, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='overlay')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [324]:
data[['Resp_codes','MonthsSincePolicyInception']].corr()


Unnamed: 0,Resp_codes,MonthsSincePolicyInception
Resp_codes,1.0,0.002952
MonthsSincePolicyInception,0.002952,1.0


From the above histogram and box-plot, it seems like Months Since Policy Inception does not have a strong influence on Response. This is confirmed by the correlation between these two variables. We will drop this variable from further analysis.


##### Number of Open Complaints	

Is there a pattern here? We can treat this field as a category rather than a numerical. Lets see if this helps in the predictions. 


In [325]:
data["NumberofOpenComplaints_str"] =  data.NumberofOpenComplaints.astype('category')


# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.NumberofOpenComplaints_str, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
NumberofOpenComplaints_str,Unnamed: 1_level_1,Unnamed: 2_level_1
0,6190,1062
1,873,138
2,350,24
3,238,54
4,125,24
5,50,6


(25.155262642292971,
 0.00013003705338174944,
 5L,
 array([[ 6213.50470769,  1038.49529231],
        [  866.22356032,   144.77643968],
        [  320.44274141,    53.55725859],
        [  250.18524195,    41.81475805],
        [  127.6630173 ,    21.3369827 ],
        [   47.98073133,     8.01926867]]))

The low p-value (0.00013) from the chi-square test of independence confirms that treating this variable as a catrgorical makes it significant in the correlation.

We will retain this variable for further analysis. However, we drop the Number of Open Complaints variable and we will 'dummify' this categorical variable and remove the original variable.


In [326]:
data = data.join(pd.get_dummies(data['NumberofOpenComplaints_str'], prefix='OpnCmplnts').ix[:, 1:])


##### Number of Policies	

We will follow a similar pattern for this variable as well. We can treat this field as a category rather than a numerical. Lets see if this helps in the predictions.

In [327]:
data["NumberofPolicies_str"] =  data.NumberofPolicies.astype('category')


# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.NumberofPolicies_str, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
NumberofPolicies_str,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2735,516
2,1952,342
3,1036,132
4,367,42
5,347,60
6,330,42
7,373,60
8,342,42
9,344,72


(30.700029543086512,
 0.0001588618575107321,
 8L,
 array([[ 2785.45281366,   465.54718634],
        [ 1965.49638713,   328.50361287],
        [ 1000.74096781,   167.25903219],
        [  350.43069849,    58.56930151],
        [  348.71710094,    58.28289906],
        [  318.72914386,    53.27085614],
        [  370.99386906,    62.00613094],
        [  329.01072914,    54.98927086],
        [  356.42828991,    59.57171009]]))

The low p-value (0.000158) from the chi-square test of independence confirms that treating this variable as a catrgorical makes it significant in the correlation.

We will retain this variable for further analysis. However, we drop the Number of Policies variable and we will 'dummify' this categorical variable and remove the original variable.


In [328]:
data = data.join(pd.get_dummies(data['NumberofPolicies_str'], prefix='NumPolcs').ix[:, 1:])


##### Total Claim Amount	

Does the Total claim amount impact campaign response. Is there a pattern to find here?. We will try binning this variable to see if we can get more predictive power. We will look at the frequency distribution / histograms and correlation to response to see if this is a variable worth considering.


In [696]:
x_0 = data[(data["Response"] == 'Yes')].TotalClaimAmount
x_1 = data[(data["Response"] == 'No')].TotalClaimAmount

trace1 = go.Histogram(x=x_0, opacity=0.75, name='Response = Yes', )
trace2 = go.Histogram(x=x_1, opacity=0.75, name='Response = No',)
trace3 = go.Box(y=x_0, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = Yes',)
trace4 = go.Box(y=x_1, boxpoints='all', jitter=0.3, pointpos=-1.8, name='Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='overlay')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [236]:
data[['Resp_codes','TotalClaimAmount']].corr()

Unnamed: 0,Resp_codes,TotalClaimAmount
Resp_codes,1.0,0.016877
TotalClaimAmount,0.016877,1.0


From the above histogram and box-plot, it seems like Total Claim Amount does not have a strong influence on Response. This is confirmed by the correlation between these two variables. 

Lets see if binning has a better influence.

In [329]:
bins = [data['TotalClaimAmount'].min()-10, 250, 400, 600, 725, data['TotalClaimAmount'].max()+1]
group_names = ['le_250', '250 to 400', '400 to 600', '600 to 725', 'gt_725']

In [697]:

data['TotalClaimAmount_Bins'] = pd.Categorical.from_array(pd.cut(data['TotalClaimAmount'], bins, labels=group_names)).codes

# show box_plots

x_0 = data.MonthlyPremium_Bins.unique()
y_0 = data[(data["Response"] == 'Yes')].TotalClaimAmount_Bins.value_counts()
y_1 = data[(data["Response"] == 'No')].TotalClaimAmount_Bins.value_counts()

totals = y_0 + y_1

# Create the percentage of the total 
y_2 = y_0 / totals
y_3 = y_1 / totals

trace1 = Bar(x=x_0, y=y_0,  name = 'Response = Yes',)
trace2 = Bar(x=x_0,y=y_1,  name = 'Response = No',)
trace3 = Bar(x=x_0, y=y_2,  name = 'Response = Yes',)
trace4 = Bar(x=x_0,y=y_3,  name = 'Response = No',)

fig = tools.make_subplots(1, 2)
fig.append_trace(trace1, 1, 1)
fig.append_trace(trace2, 1, 1)
fig.append_trace(trace3, 1, 2)
fig.append_trace(trace4, 1, 2)

fig['layout'].update(barmode='stack')
py.iplot(fig)


This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]



In [330]:
data['TotalClaimAmount_Bins'] = pd.Categorical.from_array(pd.cut(data['TotalClaimAmount'], bins, labels=group_names))

# Perform chi-sq test for independence 
cross_tab = pd.crosstab(data.TotalClaimAmount_Bins, data.Response, margins = False)
cross_tab


chi_Sq_res = stats.chi2_contingency(observed= cross_tab)
chi_Sq_res

chi_sq_stat = chi_Sq_res[0]
p_val = chi_Sq_res[1]
deg_free = chi_Sq_res[2]
exp_counts = chi_Sq_res[3]



Response,No,Yes
TotalClaimAmount_Bins,Unnamed: 1_level_1,Unnamed: 2_level_1
le_250,1922,186
250 to 400,2227,450
400 to 600,2091,450
600 to 725,598,108
gt_725,988,114


(103.96805885041904,
 1.4052961510061792e-21,
 4L,
 array([[ 1806.1318152 ,   301.8681848 ],
        [ 2293.6503175 ,   383.3496825 ],
        [ 2177.12568426,   363.87431574],
        [  604.89993431,   101.10006569],
        [  944.19224874,   157.80775126]]))

From the 100 percent stacked bar, 'Response' variable seems to have a variation between the various Total Claim Amounts . A low p-value (1.41e-21) from the chi-square test of independence confirms this. 

We will retain this variable for further analysis. However, we drop the Total Claim Amount variable and we will 'dummify' the categorical variable and remove the original variable.


In [331]:
data = data.join(pd.get_dummies(data['TotalClaimAmount_Bins'], prefix='TotClmAmt').ix[:, 1:])


## 	Prepare Test / Train dataset

We will now proceed to prepare the test and train datasets. We will use a 80-20 split for the train and test.  Before we do that, we add an intercept column. We will also drop all the columns that we had decided to discard earlier in the analysis. We all also drop the intermediate bin columns. 

In [438]:
data.columns = data.columns.str.replace(' ', '')
data.columns = data.columns.str.replace('-', '')

data['intercept'] = 1.0

#data.drop('Customer', axis=1, inplace=True)
#data.drop('State', axis=1, inplace=True)
#data.drop('Coverage', axis=1, inplace=True)
#data.drop('Education', axis=1, inplace=True)
#data.drop('EmploymentStatus', axis=1, inplace=True)
#data.drop('Gender', axis=1, inplace=True)
#data.drop('LocationCode', axis=1, inplace=True)
#data.drop('MaritalStatus', axis=1, inplace=True)
#data.drop('PolicyType', axis=1, inplace=True)
#data.drop('Policy', axis=1, inplace=True)
#data.drop('RenewOfferType', axis=1, inplace=True)
#data.drop('SalesChannel', axis=1, inplace=True)
#data.drop('VehicleClass', axis=1, inplace=True)
#data.drop('VehicleSize', axis=1, inplace=True)
#data.drop('ExpiryDays', axis=1, inplace=True)
#data.drop('EffectiveToDate', axis=1, inplace=True)
#data.drop('CustomerLifetimeValue', axis=1, inplace=True)
#data.drop('CLTV_Bins', axis=1, inplace=True)
#data.drop('Income', axis=1, inplace=True)
#data.drop('Income_Bins', axis=1, inplace=True)
#data.drop('MonthlyPremiumAuto', axis=1, inplace=True)
#data.drop('MonthlyPremium_Bins', axis=1, inplace=True)
#data.drop('MonthsSinceLastClaim', axis=1, inplace=True)
#data.drop('MonthsSincePolicyInception', axis=1, inplace=True)
#data.drop('NumberofOpenComplaints', axis=1, inplace=True)
#data.drop('NumberofOpenComplaints_str', axis=1, inplace=True)
#data.drop('NumberofPolicies', axis=1, inplace=True)
#data.drop('NumberofPolicies_str', axis=1, inplace=True)
#data.drop('TotalClaimAmount', axis=1, inplace=True)
#data.drop('TotalClaimAmount_Bins', axis=1, inplace=True)
#data.drop('Resp_codes', axis=1, inplace=True)

Dependent_col = ['Resp_codes', 'Response']
Independent_cols = ['Education_College', 'Education_Doctor', 'Education_HighSchoolorBelow', 'Education_Master', \
                    'EmploymentStatus_Employed', 'EmploymentStatus_MedicalLeave', 'EmploymentStatus_Retired', \
                    'EmploymentStatus_Unemployed', 'LocationCode_Suburban', 'LocationCode_Urban', 'MaritalStatus_Married', \
                    'MaritalStatus_Single', 'RenewOfferType_Offer2', 'RenewOfferType_Offer3', 'RenewOfferType_Offer4', \
                    'SalesChannel_Branch', 'SalesChannel_CallCenter', 'SalesChannel_Web', 'VehicleClass_LuxuryCar', \
                    'VehicleClass_LuxurySUV', 'VehicleClass_SUV', 'VehicleClass_SportsCar', 'VehicleClass_TwoDoorCar', \
                    'VehicleSize_Medsize', 'VehicleSize_Small', 'CLTV_10000to15000', 'CLTV_gt_15000', \
                    'Income_10Kto20K', 'Income_20Kto30K', 'Income_30Kto40K', 'Income_40Kto55K', 'Income_55Kto65K', \
                    'Income_65Kto70K', 'Income_gt_70K', 'MntlyPrem_75to95', 'MntlyPrem_95to120', 'MntlyPrem_120to140', \
                    'MntlyPrem_gt_140', 'OpnCmplnts_1', 'OpnCmplnts_2', 'OpnCmplnts_3', 'OpnCmplnts_4', 'OpnCmplnts_5', \
                    'NumPolcs_2', 'NumPolcs_3', 'NumPolcs_4', 'NumPolcs_5', 'NumPolcs_6', 'NumPolcs_7', 'NumPolcs_8', \
                    'NumPolcs_9', 'TotClmAmt_250to400', 'TotClmAmt_400to600', 'TotClmAmt_600to725', 'TotClmAmt_gt_725', \
                    'intercept']

All_cols = Dependent_col + Independent_cols

msk = np.random.rand(len(data)) < 0.8

data_train = data.ix[msk, All_cols]
#data_train.head()


data_test = data.ix[~msk, All_cols]
#data_test.head()


## Build and Evaluate models

We will next build at least 2 models with different subsets of variables and / or different algorithms. We will next compare and evaluate these models. 

### Model 1 

In this model we will use logit() function from statsmodels package. 

In [534]:
#np.asarray(data)

logit = sm.Logit(data_train['Resp_codes'], data_train[Independent_cols])

# fit the model
mod1 = logit.fit()

         Current function value: 0.312477
         Iterations: 35



Maximum Likelihood optimization failed to converge. Check mle_retvals



#### Model 1 Summary

In [535]:
mod1.summary2()

Model Summary


0,1,2,3
Model:,Logit,Pseudo R-squared:,0.239
Dependent Variable:,Resp_codes,AIC:,4685.9173
Date:,2016-12-23 18:47,BIC:,5065.3424
No. Observations:,7322,Log-Likelihood:,-2288.0
Df Model:,54,LL-Null:,-3008.2
Df Residuals:,7267,LLR p-value:,7.7686e-266
Converged:,0.0000,Scale:,1.0
No. Iterations:,35.0000,,

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Education_College,0.1471,0.0994,1.4803,0.1388,-0.0477,0.3419
Education_Doctor,0.6445,0.1871,3.4439,0.0006,0.2777,1.0112
Education_HighSchoolorBelow,-0.0387,0.1029,-0.3767,0.7064,-0.2404,0.1629
Education_Master,0.4436,0.1449,3.0611,0.0022,0.1596,0.7277
EmploymentStatus_Employed,-0.2437,0.2357,-1.0341,0.3011,-0.7057,0.2182
EmploymentStatus_MedicalLeave,0.2066,0.2239,0.9227,0.3562,-0.2322,0.6453
EmploymentStatus_Retired,2.6552,0.2448,10.8466,0.0000,2.1754,3.1350
EmploymentStatus_Unemployed,-0.8060,,,,,
LocationCode_Suburban,1.1736,0.2493,4.7082,0.0000,0.6851,1.6622


#### Regression Equation

Using the coefficients from the above output, the regression equation is denoted by the following:


In [537]:
len = mod1.params.size
eq = 'Response = ' + str(round(mod1.params.values[mod1.params.size-1],4)) 

for i in range(0, mod1.params.size-1):
    eq = eq + ' + (' + mod1.params.index[i] + ' * ' +  str(round(mod1.params.values[i],4)) + ')'

                                                        
print(eq)   

Response = -1.1426 + (Education_College * 0.1471) + (Education_Doctor * 0.6445) + (Education_HighSchoolorBelow * -0.0387) + (Education_Master * 0.4436) + (EmploymentStatus_Employed * -0.2437) + (EmploymentStatus_MedicalLeave * 0.2066) + (EmploymentStatus_Retired * 2.6552) + (EmploymentStatus_Unemployed * -0.806) + (LocationCode_Suburban * 1.1736) + (LocationCode_Urban * -0.0984) + (MaritalStatus_Married * -0.5915) + (MaritalStatus_Single * -0.574) + (RenewOfferType_Offer2 * 0.6498) + (RenewOfferType_Offer3 * -2.0635) + (RenewOfferType_Offer4 * -18.7399) + (SalesChannel_Branch * -0.6023) + (SalesChannel_CallCenter * -0.4875) + (SalesChannel_Web * -0.5745) + (VehicleClass_LuxuryCar * 0.444) + (VehicleClass_LuxurySUV * 0.8558) + (VehicleClass_SUV * 0.3209) + (VehicleClass_SportsCar * 0.4277) + (VehicleClass_TwoDoorCar * 0.0006) + (VehicleSize_Medsize * -0.2182) + (VehicleSize_Small * -0.6494) + (CLTV_10000to15000 * 0.5561) + (CLTV_gt_15000 * 0.0525) + (Income_10Kto20K * -0.2931) + (Income


#### Predictor p-values

From the model summary above, we can see the p-values associated with the various independent variables. 

The following is a list of good predictors (based on their p-values being less than 0.05):
                                                     

In [533]:

print('Best Predictors with p-values less than 0.05:')

mod1.pvalues[mod1.pvalues < 0.05]


Best Predictors with p-values less than 0.05:


Education_Doctor            5.734542e-04
Education_Master            2.205225e-03
EmploymentStatus_Retired    2.069753e-27
LocationCode_Suburban       2.499023e-06
MaritalStatus_Married       4.158796e-09
MaritalStatus_Single        1.743096e-06
RenewOfferType_Offer2       1.355761e-14
RenewOfferType_Offer3       4.402376e-21
SalesChannel_Branch         2.788413e-10
SalesChannel_CallCenter     9.017594e-06
SalesChannel_Web            3.584498e-06
VehicleClass_SUV            3.509634e-02
VehicleClass_SportsCar      2.954764e-02
VehicleSize_Small           1.024849e-05
CLTV_10000to15000           8.631589e-04
MntlyPrem_95to120           3.900089e-02
OpnCmplnts_2                2.222515e-05
OpnCmplnts_5                4.874909e-02
NumPolcs_2                  2.247492e-02
NumPolcs_3                  3.268327e-04
NumPolcs_6                  4.621108e-02
NumPolcs_8                  4.550807e-02
TotClmAmt_600to725          3.318136e-02
TotClmAmt_gt_725            7.003467e-04
dtype: float64


#### Odds Ratio

The odds ratio tells us how a unit change in each variable / predictor affects the odds of success of buying the insurance. For example, in the output below, we can see that, Education_Master has an odds ratio of 1.558, meaning that if the education is masters, then there is an increase by about 155.8 % that the person will buy insurance. 

Below are the odds ratio for all variables:
 

In [555]:
odds_ratio = np.exp(mod1.params)
odds_ratio.round(3)


Education_College                1.159 
Education_Doctor                 1.905 
Education_HighSchoolorBelow      0.962 
Education_Master                 1.558 
EmploymentStatus_Employed        0.784 
EmploymentStatus_MedicalLeave    1.229 
EmploymentStatus_Retired         14.228
EmploymentStatus_Unemployed      0.447 
LocationCode_Suburban            3.234 
LocationCode_Urban               0.906 
MaritalStatus_Married            0.553 
MaritalStatus_Single             0.563 
RenewOfferType_Offer2            1.915 
RenewOfferType_Offer3            0.127 
RenewOfferType_Offer4            0.000 
SalesChannel_Branch              0.548 
SalesChannel_CallCenter          0.614 
SalesChannel_Web                 0.563 
VehicleClass_LuxuryCar           1.559 
VehicleClass_LuxurySUV           2.353 
VehicleClass_SUV                 1.378 
VehicleClass_SportsCar           1.534 
VehicleClass_TwoDoorCar          1.001 
VehicleSize_Medsize              0.804 
VehicleSize_Small                0.522 




### Model 2

In this model, we will use LogisticRegression() from sklearn package.


In [596]:
dm_text = ''.join(Dependent_col[0]) + ' ~ ' + ' + '.join(Independent_cols[0:np.size(Independent_cols)-1])
#dm_text
y, X = dmatrices(dm_text , data_train, return_type="dataframe")

y = np.ravel(y)

mod2 = LogisticRegression()
mod2 = model.fit(X, y)




#### Model 2 Score

To check the accuracy of the model, we can use the score. Also included are the coefficients for the regression equation.

In [618]:
print('Score : ')

print(mod2.score(X, y))


pd.DataFrame(zip(X.columns, np.transpose(mod2.coef_)))

Score : 
0.872848948375


Unnamed: 0,0,1
0,Intercept,[-0.584054862524]
1,Education_College,[0.137917225463]
2,Education_Doctor,[0.608657235556]
3,Education_HighSchoolorBelow,[-0.0522168011488]
4,Education_Master,[0.416126443884]
5,EmploymentStatus_Employed,[-0.286690439823]
6,EmploymentStatus_MedicalLeave,[0.144921487186]
7,EmploymentStatus_Retired,[2.48612931808]
8,EmploymentStatus_Unemployed,[-0.774881278774]
9,LocationCode_Suburban,[1.01678156189]


#### Regression Equation

Using the coefficients output, we can generate the regression equation as below:


In [619]:
a = pd.DataFrame(zip(X.columns, np.transpose(mod2.coef_)))
len = a[0].size

eq = 'Response = ' + str(round(a[1][0],4)) 
for i in range(1, len-1):
    eq = eq + ' + (' + a[0][i] + ' * ' +  str(round(a[1][i],4)) + ')'

                                                        
print(eq)   

Response = -0.5841 + (Education_College * 0.1379) + (Education_Doctor * 0.6087) + (Education_HighSchoolorBelow * -0.0522) + (Education_Master * 0.4161) + (EmploymentStatus_Employed * -0.2867) + (EmploymentStatus_MedicalLeave * 0.1449) + (EmploymentStatus_Retired * 2.4861) + (EmploymentStatus_Unemployed * -0.7749) + (LocationCode_Suburban * 1.0168) + (LocationCode_Urban * -0.1951) + (MaritalStatus_Married * -0.5915) + (MaritalStatus_Single * -0.5689) + (RenewOfferType_Offer2 * 0.6619) + (RenewOfferType_Offer3 * -1.9511) + (RenewOfferType_Offer4 * -3.6561) + (SalesChannel_Branch * -0.5951) + (SalesChannel_CallCenter * -0.4885) + (SalesChannel_Web * -0.574) + (VehicleClass_LuxuryCar * 0.2621) + (VehicleClass_LuxurySUV * 0.6626) + (VehicleClass_SUV * 0.2903) + (VehicleClass_SportsCar * 0.3904) + (VehicleClass_TwoDoorCar * -0.0034) + (VehicleSize_Medsize * -0.2095) + (VehicleSize_Small * -0.6404) + (CLTV_10000to15000 * 0.5359) + (CLTV_gt_15000 * 0.0472) + (Income_10Kto20K * -0.1915) + (Inco


We can see from the equation that anything with a -ve sign inversly impacts the chances of insurance response. So, for example, an unemployed status and total claims amount being greater than 725 leads to an unsuccessful response. Similarly, being retired or being a doctor increases the chances of success.


### Model Comparision and Selection 

Lets prepare classification matrices to check which of the 2 models performed better:



In [665]:

print('Model 1 performance')

predictions = mod1.predict()
predictions_nominal = [ "No" if x < 0.5 else "Yes" for x in predictions]
print classification_report(data_train["Response"], predictions_nominal, digits=3)

# generate evaluation metrics
print 'Accuracy of Model 1: %1.5f' %metrics.accuracy_score(data_train["Response"], predictions_nominal)
print 'Auc Score of Model 1: %1.5f' %metrics.roc_auc_score(data_train["Resp_codes"], predictions)


print('\n\nModel 2 performance')

predicted = mod2.predict(X)
print classification_report(data_train["Resp_codes"], predicted, digits=3)

probs = mod2.predict_proba(X)
# generate evaluation metrics
print 'Accuracy of Model 1: %1.5f' %metrics.accuracy_score(y, predicted)
print 'Auc Score of Model 1: %1.5f' %metrics.roc_auc_score(y, probs[:, 1])


Model 1 performance
             precision    recall  f1-score   support

         No      0.880     0.987     0.930      6273
        Yes      0.710     0.194     0.304      1049

avg / total      0.855     0.873     0.841      7322

Accuracy of Model 1: 0.87312
Auc Score of Model 1: 0.82698


Model 2 performance
             precision    recall  f1-score   support

          0      0.879     0.987     0.930      6273
          1      0.712     0.189     0.298      1049

avg / total      0.855     0.873     0.840      7322

Accuracy of Model 1: 0.87285
Auc Score of Model 1: 0.82679



From the output above, we can see that both models performed almost on par. However, model 1 seems to be slightly better.

We will go ahead apply both models on the test data and check for the performance.



## Final Model Performance on Test Data 





In [680]:
X = data_test[Independent_cols]
#X.drop('intercept', axis=1, inplace=True)

print('Model 1 performance')

predictions = mod1.predict(X)
predictions_nominal = [ "No" if x < 0.5 else "Yes" for x in predictions]
print classification_report(data_test["Response"], predictions_nominal, digits=3)

# generate evaluation metrics
print 'Accuracy of Model 1: %1.5f' %metrics.accuracy_score(data_test["Response"], predictions_nominal)
print 'Auc Score of Model 1: %1.5f' %metrics.roc_auc_score(data_test["Resp_codes"], predictions)


print('\n\nModel 2 performance')

predicted = mod2.predict(X)
print classification_report(data_test["Resp_codes"], predicted, digits=3)

probs = mod2.predict_proba(X)
# generate evaluation metrics
print 'Accuracy of Model 1: %1.5f' %metrics.accuracy_score(data_test["Resp_codes"], predicted)
print 'Auc Score of Model 1: %1.5f' %metrics.roc_auc_score(data_test["Resp_codes"], probs[:, 1])


Model 1 performance
             precision    recall  f1-score   support

         No      0.877     0.988     0.929      1553
        Yes      0.705     0.166     0.269       259

avg / total      0.852     0.871     0.835      1812

Accuracy of Model 1: 0.87086
Auc Score of Model 1: 0.81221


Model 2 performance
             precision    recall  f1-score   support

          0      0.855     0.936     0.894      1553
          1      0.108     0.046     0.065       259

avg / total      0.748     0.809     0.775      1812

Accuracy of Model 1: 0.80905
Auc Score of Model 1: 0.46094


## Conclusion

Based on the output above, it is clear that Model 1 performed better on unseen data (the test data). Hence it makes sense to use model 1 for further implementation.


#### Further steps

There are many different steps that could be tried in order to improve the models:

- Running a feature selection algorithm or PCA to find potent features  
- Look for interaction terms and include them in the model
- Look at the distribution of the features and see if any normalization / regularization needs to be carried out
- Finally try using a non-linear model
