## Mini Project
### Machine Learning Applications in Business and Economics


Fong, Victor \
Li, Jiazhen

### Introduction

In the realm of e-commerce, a common observation is that a significant proportion of customers engage in a single transaction and then cease further purchases. This phenomenon can be attributed to a multitude of factors. To mitigate this, e-commerce platforms employ a variety of strategies aimed at fostering customer loyalty. One such strategy involves the distribution of discount vouchers subsequent to the initial purchase, with the goal of incentivizing repeat transactions. However, indiscriminate distribution of these vouchers may not be an optimal strategy. This is because a segment of customers might have engaged in repeat purchases even in the absence of such incentives. Consequently, the redemption of these vouchers by such customers translates into a reduction in the retailer’s profit. Empirical analyses conducted by the media retailer have demonstrated that for 10% of non-buyers, the voucher instigates a purchase with an average order value of €20. Thus, if a voucher is dispatched to a customer who would not have actually made another purchase, the revenue increases by an average of €1.5. On the other hand, sending a voucher to a customer who would have made a purchase anyway results in a revenue loss equivalent to the voucher value of €5. For customers who don’t receive a voucher, there is no impact on revenues. Therefore, it is crucial to devise a more targeted approach for the distribution of these vouchers.

### Task

The task at hand involves constructing a predictive model that leverages various features associated with a customer’s initial order. The objective is to determine whether a €5.00 voucher should be issued to a specific customer. Detailed descriptions of these features can be found in the `data_dictionary.pdf` file. \
The model should be designed to predict if a customer will place a subsequent order within a 90-day period following their initial purchase. This information is represented by the `target90` variable in the dataset. \
The model’s performance is evaluated based on the expected revenue across all customers in a given dataset. This is computed by considering the model’s predictions in conjunction with the associated costs and revenues. It’s crucial to note that the model’s effectiveness is directly tied to its ability to maximize this expected revenue. Hence, the model should be optimized with this specific goal in mind.

Before starting, we first load necessary libraries for this project.

In [1]:
import numpy as np
import pandas as pd
import sklearn

### Data Preparation and Preprocessing

Before starting with model training, we have to:
- Gert familiarized with the data.
- Preprocessed the data in a meaningful manner

In [2]:
# Import data
df = pd.read_csv('train.csv', sep=';')
df

  df = pd.read_csv('train.csv', sep=';')


Unnamed: 0,customernumber,date,salutation,title,domain,datecreated,newsletter,model,paymenttype,deliverytype,...,w2,w3,w4,w5,w6,w7,w8,w9,w10,target90
0,41191,2008-12-01,0,0,9,2008-12-01,0,2,2,0,...,0,0,0,0,0,0,0,0,0,0
1,38860,2008-12-16,1,0,4,2008-12-16,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,61917,2008-08-19,0,0,12,2008-08-19,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
3,40647,2008-06-16,1,0,8,2008-06-16,0,1,0,0,...,0,0,0,2,0,0,0,0,0,0
4,1347,2008-08-08,0,0,1,2008-08-08,0,1,1,1,...,2,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32423,7784,2008-10-21,1,0,8,2008-10-21,0,1,2,0,...,0,0,0,0,0,0,0,0,0,0
32424,41695,2008-11-09,1,0,4,2008-11-09,0,1,3,0,...,0,0,0,1,0,0,0,0,0,1
32425,7612,2008-04-12,2,0,9,2008-04-12,0,3,0,0,...,0,0,0,0,0,0,0,0,0,0
32426,31941,2008-11-15,0,0,12,2008-11-15,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [3]:
# Check for missing values
df.isnull().sum()

customernumber              0
date                        0
salutation                  0
title                       0
domain                      0
datecreated                 0
newsletter                  0
model                       0
paymenttype                 0
deliverytype                0
invoicepostcode             0
delivpostcode           31036
voucher                     0
advertisingdatacode     25905
case                        0
numberitems                 0
gift                        0
entry                       0
points                      0
shippingcosts               0
deliverydatepromised        0
deliverydatereal            0
weight                      0
remi                        0
cancel                      0
used                        0
w0                          0
w1                          0
w2                          0
w3                          0
w4                          0
w5                          0
w6                          0
w7        

None of the features, except for `delivpostcode` and `advertisingdatacode`, contains missing values. For the missing delivery postal codes, it makes sense to assume that they coincide with the `invoicepostcode`

In [4]:
# Update delivpostcode so that missing values are the same as invoicepostcode
df['delivpostcode'] = np.where(df['delivpostcode'].isna(), df['invoicepostcode'], df['delivpostcode'])

As a second step, the date variables should be processed in a way so that we can use their information for training the model. For instance, for the expected and actual delivery dates, we can create a feature `delivontime` to indicate whether or not an item was delivered within the expected shipping time. \
Similarly, for `date` and `datecreated`, we can create a new feature `purchasecreated` to indicate whether or not the first purchase took place on the same day as account creation.

In [5]:
# Create delivontime column based on condition
df['delivontime'] = (df['deliverydatereal'] <= df['deliverydatepromised']).astype(int)

# Drop delivery date columns
df = df.drop(['deliverydatepromised', 'deliverydatereal'], axis=1)

In [6]:
# Create purchasecreated column based on condition
df['purchasecreated'] = (df['date'] == df['datecreated']).astype(int)

# Drop delivery date columns
df = df.drop(['date', 'datecreated'], axis=1)

Here we check whether the `numberitems` is the sum of features `remi` to `w10`

In [13]:
# Define a function to check the sum
def check_sum(row):
    columns_to_sum = ['remi', 'cancel', 'used', 'w0', 'w1', 'w2', 'w3', 'w4', 'w5', 'w6', 'w7', 'w8', 'w9', 'w10']
    return row['numberitems'] == row[columns_to_sum].sum()

# Apply the function to each row
df['is_correct'] = df.apply(check_sum, axis=1)

df['is_correct'].value_counts(dropna=False)

is_correct
True     28479
False     3949
Name: count, dtype: int64

### Feature Selection

Instead of using all features for model training, we can use Lasso to perform feature selection. In this way, we reduce the number of trainable inputs to enhance computation time

### Model Training

### Model Evaluation