# The Root Insurance Project

## The goals of the project

- Understand the dataset: we have dataset with 10,000 "impressions" of ads, where the relation between customers and the insurance commercials hides.
- The marketing manager of the insurance company wants to understand how to bid differently for different customers so as to improve the ads performance.
- Bidding strategy: "optimize the cost per customer while having 4% customer rate over all ads shown". Bidding higher will make the ad to be shown higher in the ranking, but we do not know how the bidding change the ranking.
- Find some interesting relations for the website manager.

# Exploratory data analysis

We'll first explore how the dataset looks like and how the customers are featured.

In [2]:
# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
# read the data as dataframe
df = pd.read_csv("Root_insurance_data.csv")

df.head()

Unnamed: 0,Currently Insured,Number of Vehicles,Number of Drivers,Marital Status,bid,rank,click,policies_sold
0,unknown,2,1,M,10.0,2,False,0
1,Y,3,1,M,10.0,5,False,0
2,Y,3,2,M,10.0,5,False,0
3,N,3,2,S,10.0,4,False,0
4,unknown,2,2,S,10.0,2,False,0


In [4]:
df.describe()

Unnamed: 0,Number of Vehicles,Number of Drivers,bid,rank,policies_sold
count,10000.0,10000.0,10000.0,10000.0,10000.0
mean,1.9654,1.4999,10.0,3.1841,0.0783
std,0.807755,0.500025,0.0,1.377242,0.268657
min,1.0,1.0,10.0,1.0,0.0
25%,1.0,1.0,10.0,2.0,0.0
50%,2.0,1.0,10.0,3.0,0.0
75%,3.0,2.0,10.0,4.0,0.0
max,3.0,2.0,10.0,5.0,1.0


In [5]:
cols = df.columns
df_unique = df[cols[:4]].drop_duplicates()
print(len(df_unique))
df_unique.head()

35


Unnamed: 0,Currently Insured,Number of Vehicles,Number of Drivers,Marital Status
0,unknown,2,1,M
1,Y,3,1,M
2,Y,3,2,M
3,N,3,2,S
4,unknown,2,2,S


## The Customer features

There are four main features for each customers: `Currently Insured`, `Number of Vehicles`, `Number of Drivers` and `Marital Status`, where each could serve as categorical feature with 36 possibilities in total. Note that there are miss data for `Currently Insured` as `unknown`. We could interpret this missing data as the middle of insured or not, but given our limited feature size, we treat it as an independent category.

Moreover, when removing duplicated features, there are only 35 unique combination of features.

Only two features are numerical (integers): `Number of Vehicles` and `Number of Drivers`, ranging from [1, 3] and [1,2], respectively. So there are not many numerical relations to explore.

Since the total number of samples ($10,000$) is much larger than the number of unique features of customers ($35$), we expect that:
- 1). For each type of costumer, there is a distribution of the ads ranks to be shown $P_i(r)$, where $i$ is the costumer type and $r \in [1,5]$ denote the ads rank; 
- 2). Given costumer features and ranks, there is also a probability for the costumer at a rank to click on the ads, $P_C$;
- 3). Given the clicked costumer, there is also a conditional probability for the insurance to be sold, $P_{S|C}$.

Thus, statistically, we're not dealing with a classification problem but a probability problem, where the accuracy of prediction is not very important. However, from the perspective of the insurance company, we do want to invest more on the types of customers who are prone to buy the policy after click (Only clicked ads need to be paid). By investing more (bidding more), we could improve the rank of the ads and thus increase the click probability.

In [6]:
from sklearn.preprocessing import FunctionTransformer

In [7]:
def one_hot_encode(df):
    df_copy = df.copy()
    
    hot_encoding = pd.get_dummies(df_copy['Currently Insured'])
    df_copy[hot_encoding.columns] = hot_encoding[hot_encoding.columns]
    
    return df_copy

In [8]:
one_hot = FunctionTransformer(one_hot_encode)

In [9]:
df = one_hot.transform(df)
df.loc[df.click == True, "Click"] = int(1)
df.loc[df.click == False, "Click"] = int(0)
df.loc[df['Marital Status'] == "M", "Marital"] = int(1)
df.loc[df['Marital Status'] == "S", "Marital"] = int(0)
# convert to integers
df = df.astype({"Click": int}) 
#df.loc[df.click == True][:5]
df.describe()

Unnamed: 0,Number of Vehicles,Number of Drivers,bid,rank,policies_sold,N,Y,unknown,Click,Marital
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,1.9654,1.4999,10.0,3.1841,0.0783,0.3444,0.3419,0.3137,0.1878,0.5191
std,0.807755,0.500025,0.0,1.377242,0.268657,0.475196,0.47437,0.464019,0.390572,0.49966
min,1.0,1.0,10.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,10.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2.0,1.0,10.0,3.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,3.0,2.0,10.0,4.0,0.0,1.0,1.0,1.0,0.0,1.0
max,3.0,2.0,10.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0


In [10]:
df_unique = one_hot.transform(df_unique)
df_unique.loc[df['Marital Status'] == "M", "Marital"] = int(1)
df_unique.loc[df['Marital Status'] == "S", "Marital"] = int(0)
#df.loc[df.click == True][:5]
df_unique.describe()

Unnamed: 0,Number of Vehicles,Number of Drivers,N,Y,unknown,Marital
count,35.0,35.0,35.0,35.0,35.0,35.0
mean,1.971429,1.485714,0.342857,0.342857,0.314286,0.514286
std,0.821967,0.507093,0.481594,0.481594,0.471008,0.507093
min,1.0,1.0,0.0,0.0,0.0,0.0
25%,1.0,1.0,0.0,0.0,0.0,0.0
50%,2.0,1.0,0.0,0.0,0.0,1.0
75%,3.0,2.0,1.0,1.0,1.0,1.0
max,3.0,2.0,1.0,1.0,1.0,1.0


## The new target for the problem

Notice that we the probabilities are all conditioned, e.g. the probability of rank 2th and then click but not sold. Thus, we actually have a classification of 15 type (5 ranks, click and sold or not, no click). Here, we create this new class by specifying according to ranks:
$$ newtar = 3(r-1) + i, \quad i = 0, 1, 2,$$
where $i=0,1,2$ for sold, click but not sold and not click, respectively. However, the samples are limitted for some targets to stratify. So a more suitable one to assume that the sold rate is independent once the customer click, then we only need to have 10 type of new targets
$$ newtar = 2(r-1) + i, \quad i = 0, 1,$$

## However

How should we deal with the relation between bidding price and the ranking? We have no other info from the dataset, or we need to search for more supporting relations. But we could adopt a simple but very reasonable assumption:
`The overall buying probability of a particular type of clicked customers is independent of their ranking`
Afterall, the ranking is an evaluation of the market (other companies) to the customer (how much they want to earn this customer). Once the customer clicked, the probability of buying should be the internal feature of the customer. Thus, if we view a customer as a stock, ranking is more like the market price while the buying probability is the EPS (earning per share), measuring how profitable of the stock company.

In [11]:
# new target label by relating rank(5), click (2) 
# total 10 classes (no sold for no click),[0,9]
newtar = [0] * len(df)
for i in range(5):
    if df.iloc[i]["click"]:
        newtar[i] = int(2 * (df.iloc[i]["rank"] - 1))
    else:
        newtar[i] = int(2 * (df.iloc[i]["rank"] - 1) + 1)
df['newtarget'] = newtar

In [12]:
X = df[["Number of Vehicles", "Number of Drivers", "N","Y","unknown","Marital"]].copy()
X_unique = df_unique[["Number of Vehicles", "Number of Drivers", "N","Y","unknown","Marital"]].copy()
y_rank = df["rank"].copy()
y_click = df["Click"].copy()
y_sold = df["policies_sold"].copy()
y_newtar = df["newtarget"].copy()
#X[:5]
y_newtar.astype('int32').dtypes

dtype('int32')

In [13]:
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.base import clone

In [14]:
y = y_click
# tran test split by the policy sold as it has smallest pool
X_train,X_test,y_train,y_test = train_test_split(X, y,test_size = .25,
                                                 random_state = 614,
                                                 shuffle = True,
                                                 stratify = y)

# Machine Learning method for probability

Despite we're working on a probability problem, we could utilize powerful machine learning method with cross validation to obtain the probability conditioned on observed data. We'll explore a varity of classification methods and use the metric of <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a> to quantify the distribution difference between the training and testing set. Specifically, we'll show the probabilities obtained from <a href="https://github.com/dmlc/xgboost">XGBoost</a>, which is an efficient application of <a href="https://en.wikipedia.org/wiki/Gradient_boosting">Gradient boosting</a>.

In [15]:
import xgboost as xgb

from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.metrics import confusion_matrix, mean_squared_error
from sklearn.model_selection import GridSearchCV

In [16]:
def KLdivergence(a, b):
    a = np.asarray(a, dtype=np.float) # sum(a) should be 1, measured probability
    b = np.asarray(b, dtype=np.float) # theoretical probability

    return np.sum(np.where(a != 0, a * np.log(a / b), 0))

In [17]:
kfold = StratifiedKFold(n_splits = 5, shuffle = True, random_state = 440)

In [18]:
# class xgboost.XGBClassifier(*, objective='binary:logistic', use_label_encoder=True, **kwargs)
# https://xgboost.readthedocs.io/en/latest/python/python_api.html?highlight=xgbclassifier#xgboost.XGBClassifier
# cross-validation
probs = []
for train_index, test_index in kfold.split(X, y):
    xgb_model = xgb.XGBClassifier(n_jobs=1,use_label_encoder=False)
    xgb_model.fit(X.iloc[train_index], y.iloc[train_index],eval_metric='logloss')
    predictions = xgb_model.predict(X.iloc[test_index])
    actuals = y.iloc[test_index]
    print(confusion_matrix(actuals, predictions))
    prob = xgb_model.predict_proba(X_unique)
    probs.append(prob[:,1])

[[1625    0]
 [ 375    0]]
[[1625    0]
 [ 375    0]]
[[1624    0]
 [ 376    0]]
[[1624    0]
 [ 376    0]]
[[1624    0]
 [ 376    0]]


In [19]:
#probs

In [20]:
click_prob = np.mean(probs,axis=0)

In [21]:
y = y_sold
probs = []
for train_index, test_index in kfold.split(X, y):
    xgb_model = xgb.XGBClassifier(n_jobs=1,use_label_encoder=False)
    xgb_model.fit(X.iloc[train_index], y.iloc[train_index],eval_metric='logloss')
    predictions = xgb_model.predict(X.iloc[test_index])
    actuals = y.iloc[test_index]
    print(confusion_matrix(actuals, predictions))
    prob = xgb_model.predict_proba(X_unique)
    probs.append(prob[:,1])
sold_prob = np.mean(probs,axis=0)

[[1844    0]
 [ 156    0]]
[[1844    0]
 [ 156    0]]
[[1843    0]
 [ 157    0]]
[[1843    0]
 [ 157    0]]
[[1843    0]
 [ 157    0]]


In [22]:
meanPSPC=np.mean(sold_prob/click_prob)
print(meanPSPC)
np.round(10*sold_prob/click_prob/meanPSPC)

0.38695312


array([13.,  5.,  3.,  6., 10., 10., 10.,  8., 11., 14.,  5.,  7., 10.,
        8.,  8., 16., 14.,  8.,  9.,  7., 14., 11., 13., 11., 15., 12.,
        9., 12.,  8.,  5.,  9., 14., 14., 11., 10.], dtype=float32)

In [23]:
y = y_rank - 1 # set from [1,5] to [0, 4=n_class-1]
probs = []
for train_index, test_index in kfold.split(X, y):
    xgb_model = xgb.XGBClassifier(n_jobs=1,use_label_encoder=False)
    xgb_model.fit(X.iloc[train_index], y.iloc[train_index],eval_metric='logloss')
    predictions = xgb_model.predict(X.iloc[test_index])
    actuals = y.iloc[test_index]
    print(confusion_matrix(actuals, predictions))
    prob = xgb_model.predict_proba(X_unique)
    #print(prob[:5])
    probs.append(prob)

[[223  37  62   0   0]
 [150  49 105  18   0]
 [ 86  54 143  33 164]
 [  0  31 113  35 239]
 [  0   0  46  24 388]]
[[235  22  65   0   0]
 [149  40  94  39   0]
 [ 80  40 125  77 158]
 [  0  24  88  55 251]
 [  0   0   5  42 411]]
[[235   0  87   0   0]
 [140   0 155  27   0]
 [ 96   0 169  58 157]
 [  0   0 129  51 238]
 [  0   0  29  36 393]]
[[232  20  71   0   0]
 [137  32 136  16   0]
 [ 76  42 156  36 170]
 [  0  21 109  43 244]
 [  0   0  35  24 400]]
[[245  12  65   0   0]
 [132  17 137  35   0]
 [101  15 166  58 141]
 [  0  12 128  48 230]
 [  0   0  18  43 397]]


In [24]:
rank_prob = np.mean(probs,axis=0)

# From probability to the bidding price

Since the goal is to "optimize the cost per customer while having 4% customer rate over all ads shown". The simpliest intuition is to bid more on valuable customers. If we forget about the $4%$ constraint for a second, to decrease the cost per sold, we only need to consider the probability `P(sold|click)` for a customer as the company only need to pay when clicks happen
$$P(sold|click)=\frac{P(\text{sold and click})}{P(click)}=\frac{P(sold)}{P(click)}$$

Since current cost per customer is around 24.0 dollars per customer and the sold rate (sold/shown) is 7.83\% and average P(sold|click)=41.69\%, if we set the average as the baseline for 10 dollars and assume we invest linearly with the probability `P(sold|click)`, we'll have cost per customer even higher 24.19 dollars.

In [25]:
curCostPerCustomer = 10*np.sum(df["Click"])/np.sum(df["policies_sold"])
print(curCostPerCustomer)

23.984674329501917


In [26]:
avgPSC=np.sum(df["policies_sold"])/np.sum(df["Click"])
avgPSC

0.4169329073482428

In [27]:
biddings = np.round(10*sold_prob/click_prob/avgPSC)
biddings

array([12.,  5.,  3.,  6.,  9., 10.,  9.,  8., 11., 13.,  4.,  6., 10.,
        8.,  8., 14., 13.,  7.,  9.,  7., 13., 10., 12., 10., 14., 11.,
        9., 11.,  7.,  5.,  9., 13., 13., 10.,  9.], dtype=float32)

In [28]:
df_unique['newbid'] = biddings

In [29]:
newcols = list(df_unique.columns)
newcols = newcols[1:3] + newcols[4:8]
featuredict = {}
for i in range(len(df_unique)):
    temp = list(df_unique.iloc[i][newcols])
    string = ''
    for num in temp:
        string += str(num)
    if string not in featuredict:
        featuredict[string] = df_unique.iloc[i]['newbid']

df_copy = df.copy()
newbid = []
for i in range(len(df_copy)):
    temp = list(df_copy.iloc[i][newcols])
    string = ''
    for num in temp:
        string += str(num)
    newbid.append(featuredict[string])

In [30]:
df_copy['newbid'] = newbid

In [31]:
df_copy.head()

Unnamed: 0,Currently Insured,Number of Vehicles,Number of Drivers,Marital Status,bid,rank,click,policies_sold,N,Y,unknown,Click,Marital,newtarget,newbid
0,unknown,2,1,M,10.0,2,False,0,0,0,1,0,1.0,3,12.0
1,Y,3,1,M,10.0,5,False,0,0,1,0,0,1.0,9,5.0
2,Y,3,2,M,10.0,5,False,0,0,1,0,0,1.0,9,3.0
3,N,3,2,S,10.0,4,False,0,1,0,0,0,0.0,7,6.0
4,unknown,2,2,S,10.0,2,False,0,0,0,1,0,0.0,3,9.0


In [32]:
costs = 0
for i in range(len(df_copy)):
    if df_copy.iloc[i]['click']:
        costs += df_copy.iloc[i]['newbid']
costs/np.sum(df["policies_sold"])

24.18773946360153

However, we give ads to all the samples. In reality, we should have some budget and stop showing more ads once the number of click with paid price reaches our budget. Current cost is $18,780 and we could set it as our budget and stop once reached though random sampling.

In [33]:
budget = 10*np.sum(df["Click"])
budget

18780

In [34]:
import random

In [35]:
for trial in range(10):
    costs = 0
    sold = 0
    clicks = 0
    shown = 0
    random.seed(trial)
    while costs < budget:
        shown += 1
        ind = random.randint(0, len(df_copy)-1)
        if df_copy.iloc[ind]['click']:
            costs += df_copy.iloc[ind]['newbid']
            clicks += 1
            if df_copy.iloc[ind]['policies_sold']:
                sold += 1
    print(shown, costs,sold, costs/sold, 10*clicks/sold)

10647 18788.0 744 25.252688172043012 25.282258064516128
9909 18781.0 786 23.89440203562341 23.638676844783713
10022 18785.0 755 24.880794701986755 24.52980132450331
10081 18784.0 782 24.020460358056265 24.028132992327365
10401 18784.0 784 23.959183673469386 23.698979591836736
10183 18784.0 775 24.23741935483871 24.09032258064516
10028 18784.0 798 23.538847117794486 23.295739348370926
9771 18780.0 796 23.592964824120603 23.266331658291456
9941 18781.0 793 23.68348045397226 23.581336696090794
10201 18780.0 746 25.17426273458445 24.89276139410188


#### As we can see, from all these sampling trials, the simple linear strategy gives a bit higher cost per customer. We need to add more to current strategy. 

What if we take extreme cases? In the limit of infinite budget and customer samples, we should invest all the budget to the most valuable customer so as to obtain the best cost per customer. However, the limited budget and customer samples requires us invest on more customer with lower bound given by the 4% customer rate. Compared to previous strategy, the linear relation with the average `P(sold|click)` rate might be two slow. Thus, here we try a exponential function function 
$$ B= 1+e^{-C (P-\bar{P})}$$
where the bidding price has minimum 1 dollar. The coefficient in the exponent $C\equiv 20$. $\bar{P}$ is the average of $P(S|C)$.

In [146]:
PSCmean = np.mean(sold_prob/click_prob)

In [153]:
Cmax = 19
C = 20
PSC = sold_prob/click_prob
biddings = 1 + np.round(np.exp(C*(PSC-PSCmean))) #np.round(1+Cmax/(1+np.exp(-C*(PSC-PSCmean))))
biddings

array([ 9.,  1.,  1.,  1.,  2.,  2.,  2.,  1.,  4., 23.,  1.,  1.,  2.,
        1.,  1., 74., 33.,  1.,  2.,  1., 17.,  3.,  8.,  3., 49.,  5.,
        2.,  6.,  1.,  1.,  2., 18., 24.,  3.,  2.], dtype=float32)

In [154]:
#PSC
np.mean(biddings)

8.771428

In [155]:
df_unique['newbid'] = biddings
newcols = list(df_unique.columns)
newcols = newcols[1:3] + newcols[4:8]
featuredict = {}
for i in range(len(df_unique)):
    temp = list(df_unique.iloc[i][newcols])
    string = ''
    for num in temp:
        string += str(num)
    if string not in featuredict:
        featuredict[string] = df_unique.iloc[i]['newbid']

df_copy = df.copy()
newbid = []
for i in range(len(df_copy)):
    temp = list(df_copy.iloc[i][newcols])
    string = ''
    for num in temp:
        string += str(num)
    newbid.append(featuredict[string])
    
df_copy['newbid'] = newbid

In [156]:
for trial in range(5):
    costs = 0
    sold = 0
    clicks = 0
    shown = 0
    random.seed(trial)
    while costs < budget:
        shown += 1
        ind = random.randint(0, len(df_copy)-1)
        if df_copy.iloc[ind]['click'] and df_copy.iloc[ind]['newbid'] > 0:
            costs += df_copy.iloc[ind]['newbid']
            clicks += 1
            if df_copy.iloc[ind]['policies_sold']:
                sold += 1
    print("showned times", shown, "total", costs, "sold",sold, "sold rate", sold/shown)
    print("cost per customer", costs/sold, "Old", 10*clicks/sold,"\n")

showned times 12939 total 18794.0 sold 932 sold rate 0.072030296004328
cost per customer 20.165236051502145 Old 24.624463519313306 

showned times 11024 total 18782.0 sold 875 sold rate 0.0793722786647315
cost per customer 21.465142857142858 Old 23.588571428571427 

showned times 11905 total 18782.0 sold 910 sold rate 0.07643847123057539
cost per customer 20.63956043956044 Old 24.164835164835164 

showned times 12088 total 18790.0 sold 926 sold rate 0.07660489741892786
cost per customer 20.29157667386609 Old 24.37365010799136 

showned times 12133 total 18798.0 sold 929 sold rate 0.0765680375834501
cost per customer 20.234660925726587 Old 23.379978471474704 

