## Mini Project
### Machine Learning Applications in Business and Economics


Li, Jiazhen \
Victor

### Introduction

In the realm of e-commerce, a common observation is that a significant proportion of customers engage in a single transaction and then cease further purchases. This phenomenon can be attributed to a multitude of factors. To mitigate this, e-commerce platforms employ a variety of strategies aimed at fostering customer loyalty. One such strategy involves the distribution of discount vouchers subsequent to the initial purchase, with the goal of incentivizing repeat transactions. However, indiscriminate distribution of these vouchers may not be an optimal strategy. This is because a segment of customers might have engaged in repeat purchases even in the absence of such incentives. Consequently, the redemption of these vouchers by such customers translates into a reduction in the retailer’s profit. Empirical analyses conducted by the media retailer have demonstrated that for 10% of non-buyers, the voucher instigates a purchase with an average order value of €20. Thus, if a voucher is dispatched to a customer who would not have actually made another purchase, the revenue increases by an average of €1.5. On the other hand, sending a voucher to a customer who would have made a purchase anyway results in a revenue loss equivalent to the voucher value of €5. For customers who don’t receive a voucher, there is no impact on revenues. Therefore, it is crucial to devise a more targeted approach for the distribution of these vouchers.

### Task

The task at hand involves constructing a predictive model that leverages various features associated with a customer’s initial order. The objective is to determine whether a €5.00 voucher should be issued to a specific customer. Detailed descriptions of these features can be found in the `data_dictionary.pdf` file. \
The model should be designed to predict if a customer will place a subsequent order within a 90-day period following their initial purchase. This information is represented by the `target90` variable in the dataset. \
The model’s performance is evaluated based on the expected revenue across all customers in a given dataset. This is computed by considering the model’s predictions in conjunction with the associated costs and revenues. It’s crucial to note that the model’s effectiveness is directly tied to its ability to maximize this expected revenue. Hence, the model should be optimized with this specific goal in mind.

Before starting, we first load necessary libraries for this project.

In [3]:
import numpy as np
import pandas as pd
import sklearn

### Data Preparation and Preprocessing

Before starting with model training, we have to:
- Gert familiarized with the data.
- Preprocessed the data in a meaningful manner

In [33]:
# Import data
df = pd.read_csv('train.csv', sep=';')
df

  df = pd.read_csv('train.csv', sep=';')


Unnamed: 0,customernumber,date,salutation,title,domain,datecreated,newsletter,model,paymenttype,deliverytype,...,w2,w3,w4,w5,w6,w7,w8,w9,w10,target90
0,41191,2008-12-01,0,0,9,2008-12-01,0,2,2,0,...,0,0,0,0,0,0,0,0,0,0
1,38860,2008-12-16,1,0,4,2008-12-16,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,61917,2008-08-19,0,0,12,2008-08-19,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,40647,2008-06-16,1,0,8,2008-06-16,0,1,0,0,...,0,0,0,2,0,0,0,0,0,0
4,1347,2008-08-08,0,0,1,2008-08-08,0,1,1,1,...,2,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32423,7784,2008-10-21,1,0,8,2008-10-21,0,1,2,0,...,0,0,0,0,0,0,0,0,0,0
32424,41695,2008-11-09,1,0,4,2008-11-09,0,1,3,0,...,0,0,0,1,0,0,0,0,0,1
32425,7612,2008-04-12,2,0,9,2008-04-12,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0
32426,31941,2008-11-15,0,0,12,2008-11-15,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [45]:
df.isnull().sum()

customernumber              0
date                        0
salutation                  0
title                       0
domain                      0
datecreated                 0
newsletter                  0
model                       0
paymenttype                 0
deliverytype                0
invoicepostcode             0
delivpostcode           31036
voucher                     0
advertisingdatacode     25905
case                        0
numberitems                 0
gift                        0
entry                       0
points                      0
shippingcosts               0
deliverydatepromised        0
deliverydatereal            0
weight                      0
remi                        0
cancel                      0
used                        0
w0                          0
w1                          0
w2                          0
w3                          0
w4                          0
w5                          0
w6                          0
w7        

None of the features, except for `delivpostcode` and `advertisingdatacode`, contains missing values. For the missing delivery postal codes, it makes sense to assume that they coincide with the `invoicepostcode`

In [36]:
df.describe()

Unnamed: 0,customernumber,salutation,title,domain,newsletter,model,paymenttype,deliverytype,invoicepostcode,voucher,...,w2,w3,w4,w5,w6,w7,w8,w9,w10,target90
count,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0,...,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0,32428.0
mean,33389.298569,0.541569,0.006969,7.517115,0.169483,1.64691,1.000987,0.201955,48.752282,0.16202,...,0.276644,0.018903,0.047027,0.180986,0.027908,0.023128,0.000185,0.164981,0.092883,0.186598
std,19148.090449,0.657044,0.083192,3.683945,0.375184,0.825981,1.092677,0.401465,24.361425,0.368475,...,1.353981,0.253596,0.434265,0.561751,0.299862,0.401782,0.013601,0.836705,0.610509,0.389594
min,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16802.75,0.0,0.0,4.0,0.0,1.0,0.0,0.0,30.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,33552.5,0.0,0.0,9.0,0.0,1.0,1.0,0.0,47.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,50034.25,1.0,0.0,11.0,0.0,2.0,2.0,0.0,66.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,66251.0,2.0,1.0,12.0,1.0,3.0,3.0,1.0,99.0,1.0,...,90.0,15.0,36.0,14.0,27.0,55.0,1.0,48.0,50.0,1.0


In [37]:
temp = df['date'] == df['datecreated']
temp.value_counts()

True     30955
False     1473
Name: count, dtype: int64

In [40]:
temp = df['deliverydatepromised'] == df['deliverydatereal']
temp.value_counts()

False    29029
True      3399
Name: count, dtype: int64

In [42]:
df[['deliverydatepromised', 'deliverydatereal']]

Unnamed: 0,deliverydatepromised,deliverydatereal
0,2008-12-03,2008-12-02
1,2008-12-30,2009-02-03
2,2008-09-02,2008-08-28
3,2008-06-17,0000-00-00
4,2008-08-11,2008-08-08
...,...,...
32423,2008-10-22,2008-10-22
32424,2008-11-11,0000-00-00
32425,2008-04-15,2008-04-14
32426,2008-11-18,2008-11-17


In [43]:
df['deliverydatereal'].value_counts()

deliverydatereal
0000-00-00    5472
2008-12-16     351
2008-12-17     252
2008-12-18     246
2008-12-19     227
              ... 
2009-06-18       1
2009-06-17       1
2009-10-26       1
2009-08-04       1
2009-09-26       1
Name: count, Length: 412, dtype: int64