# Case Study

## 1. Objective

1. Predicting most probable policy to a new customer
2. Recommending alternate policy to existing customers
3. Factors affecting life time value
4. Understanding demographics and customer behaviour

 
Let's start with first part

## 2. Section 1 - Data Extraction

In [124]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("policy_data.csv", index_col = 'Customer') 
columns = ['State','Coverage','Education','Gender','Income','Location Code',
          'Marital Status','Sales Channel','Vehicle Class','Vehicle Size']
df.drop(['EmploymentStatus'], axis = 1, inplace = True)
df.drop(['Customer Lifetime Value'], axis = 1, inplace = True)

df.head()

Unnamed: 0_level_0,Policy Type,State,Coverage,Education,Gender,Income,Location Code,Marital Status,Sales Channel,Vehicle Class,Vehicle Size
Customer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
QC35222,Corporate Auto,California,Basic,Bachelor,F,48269,Urban,Married,Web,Four-Door Car,Medsize
AE98193,Personal Auto,Washington,Basic,High School or Below,M,0,Suburban,Single,Branch,SUV,Medsize
TM23514,Personal Auto,Oregon,Extended,College,M,60145,Urban,Single,Web,SUV,Medsize
WB38524,Personal Auto,California,Basic,High School or Below,M,46131,Suburban,Married,Branch,Two-Door Car,Small
QZ42725,Personal Auto,Washington,Basic,Bachelor,F,0,Suburban,Single,Agent,Four-Door Car,Medsize


In [25]:
df.corr()

Unnamed: 0,Income,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Total Claim Amount
Income,1.0,-0.016665,-0.026715,-0.000875,0.006408,-0.008656,-0.355254
Monthly Premium Auto,-0.016665,1.0,0.005026,0.020257,-0.013122,-0.011233,0.632017
Months Since Last Claim,-0.026715,0.005026,1.0,-0.042959,0.005354,0.009136,0.007563
Months Since Policy Inception,-0.000875,0.020257,-0.042959,1.0,-0.001158,-0.013333,0.003335
Number of Open Complaints,0.006408,-0.013122,0.005354,-0.001158,1.0,0.001498,-0.014241
Number of Policies,-0.008656,-0.011233,0.009136,-0.013333,0.001498,1.0,-0.002354
Total Claim Amount,-0.355254,0.632017,0.007563,0.003335,-0.014241,-0.002354,1.0


## 3. Exploratory Data Analysis
First let us see the distribution of existing customers
![one](images/5.png)

## Selection of Policy based on customer characteristics
![one](images/policy_personal.png)

## Selection of Policy based on vehicle
![one](images/7.png)


## 4. Modeling

In [67]:
from sklearn.preprocessing import LabelEncoder
categorical_variables = df.dtypes[df.dtypes == 'object'].index
categorical_variables

Index([u'Policy Type', u'State', u'Coverage', u'Education', u'Gender',
       u'Location Code', u'Marital Status', u'Sales Channel', u'Vehicle Class',
       u'Vehicle Size'],
      dtype='object')

In [68]:
le = LabelEncoder()
for var in categorical_variables:
    df[var] = le.fit_transform(df[var])

df.head()

Unnamed: 0_level_0,Policy Type,State,Coverage,Education,Gender,Income,Location Code,Marital Status,Sales Channel,Vehicle Class,Vehicle Size
Customer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
QC35222,0,1,0,0,0,48269,2,1,3,0,1
AE98193,1,4,0,3,1,0,1,2,1,3,1
TM23514,1,3,1,1,1,60145,2,2,3,3,1
WB38524,1,1,0,3,1,46131,1,1,1,5,2
QZ42725,1,4,0,0,0,0,1,2,0,0,1


In [69]:
X = df.iloc[:, 1:]
y = df.iloc[:, 0]

In [70]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)

In [71]:
from sklearn.tree import DecisionTreeClassifier
decision = DecisionTreeClassifier()
decision = decision.fit(X_train, y_train)
y_pred = decision.predict(X_test)

In [72]:
from sklearn.metrics import accuracy_score
result = accuracy_score(y_test, y_pred) * 100
result

59.44170771756979

In [73]:
for df, importance in zip(columns, decision.feature_importances_):
    print(df, importance * 100)

('State', 4.142330515271204)
('Coverage', 5.959836921223922)
('Education', 9.563685604328315)
('Gender', 4.416232854200223)
('Income', 36.60680211433712)
('Location Code', 5.631057088638507)
('Marital Status', 5.88444657451218)
('Sales Channel', 11.594660769923756)
('Vehicle Class', 8.773590033980124)
('Vehicle Size', 7.4273575235846305)


In [85]:
from sklearn.naive_bayes import GaussianNB
GNBClassifier = GaussianNB()
GNBClassifier = GNBClassifier.fit(X_train, y_train.ravel())
y_pred = GNBClassifier.predict(X_test)
result = accuracy_score(y_test, y_pred) * 100
result

74.71264367816092

In [75]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier()
knnClassifier = classifier.fit(X_train, y_train.ravel())
y_pred = knnClassifier.predict(X_test)
result = accuracy_score(y_test, y_pred)*100
result

68.74657909140667

In [81]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(max_depth = 10, min_samples_leaf = 10, max_features= 'auto')
rfClassifier = classifier.fit(X_train, y_train.ravel())
y_pred = rfClassifier.predict(X_test)
result = accuracy_score(y_test, y_pred)*100
result

74.71264367816092

In [82]:
from sklearn.externals import joblib
joblib.dump(rfClassifier, 'model/nb.pkl')

['model/nb.pkl']

## 5. Serving end points with Flask API

On localhost 5000

![one](images/a.png)
![one](images/b.png)
![one](images/c.png)


## Section 2 - CLV

2.1 - Overall

![one](images/20.png)

2.2 - Based on Complaints

![one](images/24.png)

2.3 - Based on Demographics

![one](images/21.png)


2.4 - Based on Sales

![one](images/22.png)

2.5 - Based on Personal Characteristic

![one](images/23.png)





In [18]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

#Reading the dataset in a dataframe using Pandas
df = pd.read_csv("data.csv")


In [8]:
#finding missing values
df.isnull().sum()


State                            0
Customer Lifetime Value          0
Response                         0
Coverage                         0
Education                        0
Effective To Date                0
EmploymentStatus                 0
Gender                           0
Income                           0
Location Code                    0
Marital Status                   0
Monthly Premium Auto             0
Months Since Last Claim          0
Months Since Policy Inception    0
Number of Open Complaints        0
Number of Policies               0
Policy Type                      0
Policy                           0
Renew Offer Type                 0
Sales Channel                    0
Total Claim Amount               0
Vehicle Class                    0
Vehicle Size                     0
dtype: int64

In [9]:
from sklearn.preprocessing import LabelEncoder
categorical_variables = df.dtypes[df.dtypes == 'object'].index
categorical_variables

Index([u'State', u'Response', u'Coverage', u'Education', u'Effective To Date',
       u'EmploymentStatus', u'Gender', u'Location Code', u'Marital Status',
       u'Policy Type', u'Policy', u'Renew Offer Type', u'Sales Channel',
       u'Vehicle Class', u'Vehicle Size'],
      dtype='object')

In [10]:
le = LabelEncoder()
for var in categorical_variables:
    df[var] = le.fit_transform(df[var])

df.head()

Unnamed: 0_level_0,State,Customer Lifetime Value,Response,Coverage,Education,Effective To Date,EmploymentStatus,Gender,Income,Location Code,...,Months Since Policy Inception,Number of Open Complaints,Number of Policies,Policy Type,Policy,Renew Offer Type,Sales Channel,Total Claim Amount,Vehicle Class,Vehicle Size
Customer,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
QC35222,1,2683.470677,0,0,0,0,1,0,48269,2,...,79,3,1,0,1,2,3,282.151207,0,1
AE98193,4,7859.414569,0,0,3,0,4,1,0,1,...,10,0,7,1,3,0,1,813.6,3,1
TM23514,3,10272.6082,0,1,1,0,1,1,60145,2,...,28,0,3,1,5,2,3,580.473259,3,1
WB38524,1,2969.593296,0,0,3,0,1,1,46131,1,...,28,0,1,1,5,1,1,355.2,5,2
QZ42725,4,2310.882998,0,0,0,0,4,0,0,1,...,24,0,1,1,5,1,0,460.8,0,1


In [19]:
pd.cut(df['Customer Lifetime Value'], 8).head()

0    (1816.58, 12076.429]
1    (1816.58, 12076.429]
2    (1816.58, 12076.429]
3    (1816.58, 12076.429]
4    (1816.58, 12076.429]
Name: Customer Lifetime Value, dtype: category
Categories (8, interval[float64]): [(1816.58, 12076.429] < (12076.429, 22254.851] < (22254.851, 32433.273] < (32433.273, 42611.694] < (42611.694, 52790.116] < (52790.116, 62968.538] < (62968.538, 73146.96] < (73146.96, 83325.381]]

In [21]:
custom_bucket_array = np.linspace(0, 20, 9)
custom_bucket_array

array([  0. ,   2.5,   5. ,   7.5,  10. ,  12.5,  15. ,  17.5,  20. ])

In [24]:
df['Customer Lifetime Value'] = pd.cut(df['Customer Lifetime Value'], custom_bucket_array)

In [11]:
X = df.iloc[:, 1:]
y = df.iloc[:, 0]

In [12]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)

In [13]:
from sklearn.tree import DecisionTreeRegressor
decision = DecisionTreeRegressor(max_depth=2)
decision = decision.fit(X_train, y_train)
y_pred = decision.predict(X_test)

In [14]:
for df, importance in zip(df, decision.feature_importances_):
    print(df, importance * 100)

('State', 0.0)
('Customer Lifetime Value', 0.0)
('Response', 0.0)
('Coverage', 0.0)
('Education', 0.0)
('Effective To Date', 0.0)
('EmploymentStatus', 0.0)
('Gender', 0.0)
('Income', 0.0)
('Location Code', 0.0)
('Marital Status', 39.16369938736544)
('Monthly Premium Auto', 0.0)
('Months Since Last Claim', 34.031533891309593)
('Months Since Policy Inception', 0.0)
('Number of Open Complaints', 0.0)
('Number of Policies', 0.0)
('Policy Type', 0.0)
('Policy', 0.0)
('Renew Offer Type', 26.804766721324981)
('Sales Channel', 0.0)
('Total Claim Amount', 0.0)
('Vehicle Class', 0.0)


## Section 3 - Demographic