In this notebook, we'll evalute the best model chosen from our modeling experiments on our test data.

There are some feature engineering steps that we did on our train data that we'll have to duplicate here. Specifically, we'll need to create the following features:
- store_id_freq
- store_category_type
- item_price_range
- hour_of_day

The rest of the data cleaning/preprocessing steps (imputing missing values, scaling data, dropping correlated features) will be taken care of in the pipeline defined in the model artifact.

In [53]:
# ensures notebook automatically recieves updates from relevant .py files
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [54]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

Load test data.

In [55]:
data_path = '../datasets/test_data.csv'

test_df = pd.read_csv(data_path)

test_df.head()

Unnamed: 0,market_id,created_at,actual_delivery_time,store_id,store_primary_category,order_protocol,total_items,subtotal,num_distinct_items,min_item_price,max_item_price,total_onshift_dashers,total_busy_dashers,total_outstanding_orders,estimated_order_place_duration,estimated_store_to_consumer_driving_duration,seconds_to_delivery
0,4.0,2015-01-26 03:38:40+00:00,2015-01-26 04:00:55+00:00,3502,mediterranean,3.0,2,2450,2,1200,1250,32.0,50.0,29.0,251,487.0,1335.0
1,4.0,2015-02-07 01:59:48+00:00,2015-02-07 02:54:38+00:00,6917,vietnamese,5.0,1,875,1,875,875,124.0,121.0,201.0,251,601.0,3290.0
2,6.0,2015-02-03 03:32:09+00:00,2015-02-03 04:15:45+00:00,704,indian,3.0,4,2340,3,225,995,,,,251,666.0,2616.0
3,1.0,2015-01-26 02:17:51+00:00,2015-01-26 02:45:13+00:00,1424,pizza,5.0,1,2594,1,1599,1599,43.0,43.0,55.0,251,452.0,1642.0
4,4.0,2015-02-12 21:00:01+00:00,2015-02-12 21:32:57+00:00,5583,sandwich,2.0,6,3871,5,420,703,47.0,36.0,36.0,251,304.0,1976.0


**Prep Test Data**

Create store_id_freq

In [56]:
import sys
sys.path.append('../utils')
import feature_eng_utils

In [57]:
from feature_eng_utils import encode_frequency

value_counts = test_df['store_id'].value_counts()
percentiles = np.percentile(value_counts, [50, 75, 90, 99]) 

# apply encode_frequency to each store_id based on their number of orders
test_df['store_id_freq'] = test_df['store_id'].apply(lambda x: encode_frequency(value_counts[x], percentiles))

pd.DataFrame({'Count':test_df['store_id_freq'].value_counts()}).rename_axis('Frequency Bin')


Unnamed: 0_level_0,Count
Frequency Bin,Unnamed: 1_level_1
[90-99),14434
[75-90),9725
[50-75),7440
99+,4832
[0-50),3054


Create store_category_type

In [58]:
from feature_eng_utils import map_to_category_type

test_df['store_category_type'] = test_df['store_primary_category'].apply(lambda x: map_to_category_type(x))

value_counts = test_df['store_category_type'].value_counts()

pd.DataFrame({'Count':test_df['store_category_type'].value_counts()}).rename_axis('Store Category Type')

Unnamed: 0_level_0,Count
Store Category Type,Unnamed: 1_level_1
other,13583
asian,9526
american,7990
italian,4965
mexican,3421


Create item_price_range

In [59]:
test_df['item_price_range'] = test_df['max_item_price'] - test_df['min_item_price']

test_df[['max_item_price', 'min_item_price', 'item_price_range']].head()

Unnamed: 0,max_item_price,min_item_price,item_price_range
0,1250,1200,50
1,875,875,0
2,995,225,770
3,1599,1599,0
4,703,420,283


Create hour_of_day

In [60]:
time_info = test_df['created_at'].astype(str).str.split().str[1]
test_df['hour_of_day'] = time_info.str.split(":").str[0]

Establish numeric vs. categorical features, & separate features from target.

In [61]:
numeric_feats = [
    'total_items',
    'subtotal',
    'num_distinct_items',
    'total_onshift_dashers',
    'total_busy_dashers',
    'total_outstanding_orders',
    'estimated_order_place_duration',
    'estimated_store_to_consumer_driving_duration',
    'item_price_range',
    'hour_of_day',
]

categorical_feats = [
    'market_id',
    'order_protocol',
    'store_id_freq',
    'store_category_type',
]

target = 'seconds_to_delivery'

In [62]:
test_df_X = test_df[numeric_feats + categorical_feats]

test_df_y = test_df[target]

In [63]:
test_df_X.head()

Unnamed: 0,total_items,subtotal,num_distinct_items,total_onshift_dashers,total_busy_dashers,total_outstanding_orders,estimated_order_place_duration,estimated_store_to_consumer_driving_duration,item_price_range,hour_of_day,market_id,order_protocol,store_id_freq,store_category_type
0,2,2450,2,32.0,50.0,29.0,251,487.0,50,3,4.0,3.0,[90-99),other
1,1,875,1,124.0,121.0,201.0,251,601.0,0,1,4.0,5.0,99+,asian
2,4,2340,3,,,,251,666.0,770,3,6.0,3.0,[0-50),asian
3,1,2594,1,43.0,43.0,55.0,251,452.0,0,2,1.0,5.0,[75-90),italian
4,6,3871,5,47.0,36.0,36.0,251,304.0,283,21,4.0,2.0,[90-99),american


In [64]:
test_df_y.head()

0    1335.0
1    3290.0
2    2616.0
3    1642.0
4    1976.0
Name: seconds_to_delivery, dtype: float64

**Evaluate on Test Set**

Our test data is prepped, and we're ready to evaluate our model on the test data.

In [65]:
import sys
sys.path.append('../utils')
import custom_transformers

In [66]:
import joblib
from custom_transformers import DropHighlyCorrelatedFeatures

# Load the saved model
loaded_model = joblib.load('../models/best_model.pkl')

In [68]:
from sklearn.metrics import root_mean_squared_error

# Make predictions
y_pred = loaded_model.predict(test_df_X)

# Compute root mean squared error
test_rmse = root_mean_squared_error(test_df_y, y_pred)

print("Test RMSE:", test_rmse)

Test RMSE: 1080.7058538615915




Final Result Summary:
- On average, our predictions for delivery duration are ~18 minutes off from the true delivery duration, approximately 30 seconds worse than the performance we were seeing during cross-validation.
- So overall, pretty bad. For any food delivery service looking to model delivery duration time, I certainly would not recommend relying on this model as an endpoint. 

Future Considerations:
- From here, maybe looking at important features that came up during prediction & looking to gather more data related to those identified features would be a logical next step to improve model performance? 
- Additionally, a lot of the original feature set were ID values without any context (market_id, order_protocol), so getting more info on what those features meant may be helpful to improve some of the mapping/encoding decisions we made during the feature engineering process.