# Capstone Projecy: Prediction Delivery Time For Online Shopping

Huy Hoang Vuong | June 25, 2023

This Project is focused on predicting the estimated delivery time for the online shopping order, which helps to improve the customer experience by assisting them to answer the question: "When do I get my order ?" as close as possible.

***Please Note:*** This is Notebook 2 of 2 that is used to do feature selection, design model, run, evaluate and optimize models.

In [110]:
#base import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Model Import
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from xgboost import XGBRegressor
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import Sequential
from tensorflow.keras import layers
from tensorflow.keras.layers import Dense, Dropout, Embedding, LSTM
from tensorflow.keras.optimizers import Adam
#preprocessing import
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
np.random.seed(123)


In [190]:
ebay_clean= pd.read_csv('../data/cleaned/Ebay_cleaned.csv', index_col=0)
pd.set_option('display.max_columns', None)
ebay_clean.head()

Unnamed: 0,b2c_c2c,seller_id,declared_handling_days,acceptance_scan_timestamp,shipment_method_id,shipping_fee,carrier_min_estimate,carrier_max_estimate,item_zip,buyer_zip,category_id,item_price,quantity,payment_datetime,delivery_date,weight,weight_units,package_size,record_number,distance,handling_date,shipping_date,total_time,pay_year,pay_month,pay_date
0,B2C,25454,3.0,2019-03-27,0,0.0,3.0,5.0,97219,49040,13,28.0,1,2019-03-24,2019-03-29,5,1,LETTER,1,3002.0,3,2,5,2019,3,24
1,C2C,6727381,2.0,2018-06-03,0,3.0,3.0,5.0,11415-3528,62521,0,20.0,1,2018-06-02,2018-06-05,0,1,PACKAGE_THICK_ENVELOPE,2,1283.0,1,2,3,2018,6,2
2,B2C,18507,1.0,2019-01-08,0,4.0,3.0,5.0,27292,53010,1,20.0,1,2019-01-06,2019-01-10,9,1,PACKAGE_THICK_ENVELOPE,3,1104.0,2,2,4,2019,1,6
3,B2C,4677,1.0,2018-12-18,0,0.0,3.0,5.0,90703,80022,1,36.0,1,2018-12-17,2018-12-21,8,1,PACKAGE_THICK_ENVELOPE,4,1353.0,1,3,4,2018,12,17
4,B2C,4677,1.0,2018-07-28,0,0.0,3.0,5.0,90703,55070,1,25.0,1,2018-07-27,2018-07-30,3,1,PACKAGE_THICK_ENVELOPE,5,2456.0,1,2,3,2018,7,27


In [191]:
ebay_clean.isna().sum()

b2c_c2c                      0
seller_id                    0
declared_handling_days       0
acceptance_scan_timestamp    0
shipment_method_id           0
shipping_fee                 0
carrier_min_estimate         0
carrier_max_estimate         0
item_zip                     0
buyer_zip                    0
category_id                  0
item_price                   0
quantity                     0
payment_datetime             0
delivery_date                0
weight                       0
weight_units                 0
package_size                 0
record_number                0
distance                     0
handling_date                0
shipping_date                0
total_time                   0
pay_year                     0
pay_month                    0
pay_date                     0
dtype: int64

In [192]:
ebay_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 493951 entries, 0 to 499999
Data columns (total 26 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   b2c_c2c                    493951 non-null  object 
 1   seller_id                  493951 non-null  int64  
 2   declared_handling_days     493951 non-null  float64
 3   acceptance_scan_timestamp  493951 non-null  object 
 4   shipment_method_id         493951 non-null  int64  
 5   shipping_fee               493951 non-null  float64
 6   carrier_min_estimate       493951 non-null  float64
 7   carrier_max_estimate       493951 non-null  float64
 8   item_zip                   493951 non-null  object 
 9   buyer_zip                  493951 non-null  object 
 10  category_id                493951 non-null  int64  
 11  item_price                 493951 non-null  float64
 12  quantity                   493951 non-null  int64  
 13  payment_datetime           49

Declare feature and target columns


In [114]:
X= ebay_clean[['b2c_c2c', 'declared_handling_days', 'shipment_method_id', 'shipping_fee', 'item_price', 'weight', 'package_size', 'distance']]
y= ebay_clean['total_time']

In [115]:
#Check shape 
print(y.shape)
print(X.shape)

(493951,)
(493951, 8)


In [116]:
#split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

Scale data


In [117]:
scaler= StandardScaler()
X_train_ss= scaler.fit_transform(X_train, y_train)
X_test_ss= scaler.transform(X_test)

In [118]:
#get value of y train and test to use in the accuracy function
y_true_test= y_test.values
y_true_train= y_train.values

**Accuracy function**

In [119]:
def define_late(y_actual, pred):
    '''
        This function is used to calculate the accuracy of the model in the different aspects:
       When the model runs, the accuracy will calculate the exact match number for the true prediction; the prediction given late or early delivery will be counted as false.
       But from the business point of view, The order delivered earlier than the prediction will not get any complaint from a customer and will be considered an on-time delivery.
       In this function, we will modify the accuracy of the model base on the logic above:
           day predict > actual delivered: Ontime
           day predict = actual predictions: Ontime
           day predict < actual delivered: Late
    '''
    ontime= 0
    late= 0
    accuracy_sc=0
    for i in range(len(pred)):
        if pred[i] == y_actual[i] or pred[i]> y_actual[i]:
            ontime+=1
        else:
            late+=1
    accuracy_sc= ontime/len(y_actual)*100
    return accuracy_sc

**Loss Function Calulate**

The loss function was provide by the organizer.
$$L = \frac{1}{N}.abs([P_E.\sum_{early shipments}(actual delivery days - predicted deliveryday)+ P_L.\sum_{late shipments}(actual delivery days - predicted deliveryday)])$$
while $P_E = 0.4$, $P_L = 0.6$ and N is number of record in the dataset

In [120]:
def evaluate_loss(preds, actual):
    ''''
        This Loss function was provided by the eBay team, who was given out the dataset for their Machine learning challenge.
        From a business point of view, it is a worse experience for a buyer if a shipment arrives after the estimated delivery date (“late shipment”) 
            as compared to arriving before the estimated delivery date (“early shipment”). 
            The formula for the loss function was mentioned above.
        
    '''
    early_loss, late_loss = 0,0 
    for i in range(len(preds)):
        if preds[i] < actual[i]:
            #early shipment
            early_loss += actual[i] - preds[i]
        elif preds[i] > actual[i]:
            #late shipment
            late_loss += preds[i] - actual[i]
    loss = (1/len(preds)) * (0.4 * (early_loss) + 0.6 * (late_loss))
    return loss

## Model

#### LinearRegression model

In [55]:
#Initialize
linear_model= LinearRegression()
#fit model
linear_model.fit(X_train_ss, y_train)
#Predict
linear_preds= linear_model.predict(X_test_ss)
linear_train_pred= linear_model.predict(X_train_ss)
#rouding
linear_train_pred= np.round(linear_train_pred)
linear_preds= np.round(linear_preds)
linear_accuracy_test= define_late(y_true_test, linear_preds)
linear_accuracy_train= define_late(y_true_train, linear_train_pred)

Calculation Loss

In [200]:
print(f'Accuracy score test:  {linear_accuracy_test}')
#Loss Calculation
print(f"Linear regression Loss= {evaluate_loss(linear_preds, y_true_test)}")

Accuracy score test:  70.61371987326781
Linear regression Loss= 0.8006579546719841


#### Ridge model

In [57]:
# Initialize
ridge_model= Ridge(solver='lsqr')

# fit
ridge_model.fit(X_train_ss, y_train)
#Predict and round
ridge_preds= ridge_model.predict(X_test_ss)
ridge_preds= np.round(ridge_preds)
ridge_train_pred= ridge_model.predict(X_train_ss)
ridge_train_pred= np.round(ridge_train_pred)

#Calculate loss
ridge_accuracy_test= define_late(y_true_test, ridge_preds)
ridge_accuracy_train= define_late(y_true_train, ridge_train_pred)


Calculation Loss

In [201]:
print(f'Ridge Accuracy score test:  {ridge_accuracy_test}')
#Loss calculation
print(f"Ridge regression Loss= {evaluate_loss(ridge_preds, y_true_test)}")

Ridge Accuracy score test:  70.61878106305231
Ridge regression Loss= 0.8006822483829498


#### XGboost

In [150]:
# initialize 
xg_boost= XGBRegressor()
#fit
xg_boost.fit(X_train_ss, y_train)
#Predict and round
xg_pred= xg_boost.predict(X_test_ss)
xg_pred= np.round(xg_pred)
xg_train_pred= xg_boost.predict(X_train_ss)
xg_train_pred= np.round(xg_train_pred)
#Calculate accuracy
xg_train_accuracy=define_late(y_true_train, xg_train_pred)
xg_test_accuracy= define_late(y_true_test, xg_pred)


Evaluate XGboost Regression model

In [189]:
print(f'Test Accuracy : {xg_test_accuracy}')
#Loss calculate
print(f'xgboost lost= {evaluate_loss(xg_pred, y_true_test)}')

Test Accuracy : 71.60368859511495
xgboost lost= 0.758265429037058


#### Neural Network

In [184]:
#build model
tf.random.set_seed(123)
# Create a new sequential model
nn_model= keras.Sequential()
# regularizer= keras.regularizers.l2(0.02)
#hidden layers
nn_model.add(Dense(128, activation="relu"))
nn_model.add(Dropout(0.2))
nn_model.add(Dense(64, activation="relu"))
nn_model.add(Dense(32, activation="relu"))
# nn_model.add(Dropout(0.2))
#output layer
nn_model.add(Dense(1))

#compile nn_model
nn_model.compile(
     optimizer=keras.optimizers.Adam(),
     loss=keras.losses.MeanSquaredError(),
     metrics=[keras.metrics.BinaryAccuracy()]
)

In [185]:
history= nn_model.fit(X_train_ss, y_train, epochs=50,batch_size=64, verbose=0)

In [186]:
#predict 
NN_pred_test = np.round(nn_model.predict(X_test_ss))
NN_pred_train = np.round(nn_model.predict(X_train_ss))



Calculation Loss

In [202]:
print(f'Neural Network Test Accuracy: {define_late(y_true_test, NN_pred_test)}')
#calculate Loss
print(f"Neural Network regression Loss= {evaluate_loss(NN_pred_test, y_true_test)}")

Neural Network Test Accuracy: 72.00352258809001
Neural Network regression Loss= [0.75889915]


#### Recurrent Neural Network


In [155]:
#check Shape of feature and target
print(X_train_ss.shape, y_train.shape)

(395160, 8) (395160,)


In [34]:
#embedding configure
number_class= X_train_ss.shape[1]
embedding_dim= 8

In [71]:
tf.random.set_seed(123)
#define rnn
rnn_mode= keras.Sequential()
# add layers
embedding_layer=Embedding(number_class, embedding_dim)

rnn_mode.add(LSTM(64, activation='relu', input_shape=(X_train_ss.shape[1], 1)))

rnn_mode.add(Dense(64, activation= 'relu'))

#output layer
rnn_mode.add(Dense(1))

# Compile mode
rnn_mode.compile(
    loss='mean_squared_error', 
    optimizer=Adam(learning_rate=0.02),
    metrics='accuracy'
)

In [157]:
# fit model
rnn_history= rnn_mode.fit(X_train_ss, y_train, epochs=50, batch_size=64, validation_split=0.2, verbose= 1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [158]:
#Predict and rounding
rnn_pred_test= np.round(rnn_mode.predict(X_test_ss))
rnn_pred_train= np.round(rnn_mode.predict(X_train_ss))



In [159]:
rnn_mode.summary()

Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 lstm_1 (LSTM)               (None, 64)                16896     
                                                                 
 dense_10 (Dense)            (None, 64)                4160      
                                                                 
 dense_11 (Dense)            (None, 1)                 65        
                                                                 
Total params: 21,121
Trainable params: 21,121
Non-trainable params: 0
_________________________________________________________________


Evaluation Loss

In [199]:
print(f'Test  accuracy= {define_late(y_true_test, rnn_pred_test)}')
#loss calculation
print(f"Recurrent Neural Network Loss= {evaluate_loss(rnn_pred_test, y_true_test)}")


Test  accuracy= 71.93873935884848
Recurrent Neural Network Loss= [0.824561]


#### Soft Summary

So far we have tried : Linear Regression, Ridge regression, XGboost, Neural Network, and Recurrent Neural Network

The score we have for each model is:

  | Model | Accurancy Score | Loss | 
  | ----------- | ----------- |----|
  | Linear Regression | 70.6137 |0.8007 |
  | Ridge Regression | 70.6187 |0.8007 |
  | XGboost | 71.6037 |0.7583 |
  | Neural Network | 72.0035 |0.7589 |
  | Recurrent Neural Network | 71.9387 |0.8246 |

At the moment, Neural Network have the best accuracy for the datase with the `Loss= 0.7589`. Coming very close behind is Neural Network with the `Loss= 0.7589` and the accuracy is bit higher 72.0035 . We are going to do tune hyperparameter to see if we can reduce the `loss` of the model.

### Tuning Hyperparameter


##### Turning Hyperparameter for `Ridge` and `XGboost`

In [193]:
#Estimator
estimators= [
    ('normalise', StandardScaler()),
    ('model', LinearRegression())
] 
#Pipeline
my_pipe= Pipeline(estimators)
#Ridge
grid1= [
    {
        'model': [Ridge()],
        'normalise':[StandardScaler()],
        'model__alpha':[0.001, 0.01, 0.1, 1],
        'model__solver':['auto', ]
    }]
gridCV1= GridSearchCV(my_pipe, grid1, cv=5, verbose=0)
fit_grid1= gridCV1.fit(X_train, y_train)


In [123]:
#XGBoost
grid2=[    
    {
        'model':[XGBRegressor()],
        'normalise':[StandardScaler()],
        'model__subsample': np.arange(0.1, 1, 0.2),
        'model__max_depth': range (4, 12, 2),
        'model__n_estimators': [60, 120, 180],
        'model__learning_rate': [0.1, 0.01, 0.05]
    }
]
gridCV2= GridSearchCV(my_pipe, grid2, cv=10, verbose=0)
fit_grid2= gridCV2.fit(X_train, y_train)

In [103]:
#Find Best hyperparameter for Ridge
fit_grid1.best_estimator_

In [122]:
#Find Best hyperparameter for Ridge
fit_grid1.best_params_

{'model': Ridge(alpha=1),
 'model__alpha': 1,
 'model__solver': 'auto',
 'normalise': StandardScaler()}

In [127]:
#Find Best hyperparameter for XGBoost
fit_grid2.best_estimator_

In [128]:
fit_grid2.best_params_

{'model': XGBRegressor(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.05, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=4, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=180, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...),
 'model__learning_rate': 0.05,
 'model__max_depth': 4,
 'model__n_estimators': 180,
 'model__subsample': 0.9000000000000001,
 'normalise': StandardScaler()}

In [196]:
grid1_pred_test= np.round(fit_grid1.predict(X_test))
print(f'Ridge tuned hyperparameter Test Accuracy: {define_late(y_true_test, grid1_pred_test)}')

Ridge tuned hyperparameter Test Accuracy: 70.61371987326781


In [197]:
print(f'XGBoost Test Accuracy: {define_late(y_true_test, np.round(fit_grid2.predict(X_test)))}')

XGBoost Test Accuracy: 72.08348938668502


Calculate the loss of  model after tuning hyper parameter

In [198]:
print(f"Ridge tuned loss= {evaluate_loss(grid1_pred_test, y_true_test)}")
print(f"XGboost tuned loss= {evaluate_loss(np.round(fit_grid2.predict(X_test)), y_true_test)}")

Ridge tuned loss= 0.8006579546719841
XGboost tuned loss= 0.7612191393952891


After doing the hyperparameter for XGboost and Ridge Regression, The output we have is:

  | Model | Accurancy | Loss | 
  | ----------- | ----------- |----|
  | Linear Regression | 70.6137 |0.8007 |
  | Ridge Regression | 70.6187 |0.8007 |
  | XGboost | 71.6037 |0.7583 |
  | Neural Network | 72.0035 |0.7589 |
  | Recurrent Neural Network | 71.9387 |0.8246 |
  | Tuned hyperparameter Ridge Regression | 70.6137 |0.8007 |
  | Tuned hyperparameter XGboost| 72.0835 |0.7612 |

Overall, the XGboost model after tuned hyperparameter is bring the low loss and highest accuracy for the dataset. 

### Next steps
The model successfully estimated the days needed to deliver packages for customers when they place orders. Around 28 percent of the predicted will be late, and the average error for the prediction is 0.76 days.

In the future, these can improve the model's performance by looking for more impact features which can bring more information to the model.

This model also can add an extra step to help businesses become aware of the possibility of being late by predicting the handling days and being able to give early warning to enterprises about which orders might be late and notify customers to be able to increase their experiences.