# Assignment

When a consumer places an order, we show the expected time of delivery. It is very important to get this right, as it has a big impact on consumer experience. In this exercise, I will build a model to predict the estimated time taken for a delivery.

Concretely, for a given delivery I must predict the total delivery duration seconds , i.e., the time taken from:

Start: the time consumer submits the order (`created_at`) to

End: when the order will be delivered to the consumer (`actual_delivery_time`)


# Data Description

The file `historical_data.csv` contains a subset of deliveries received in early 2015 in a subset of the cities. Each row in this file corresponds to one unique delivery. We have added noise to the dataset to obfuscate certain business details. Each column corresponds to a feature as explained below. Note all money (dollar) values given in the data are in cents and all time duration values given are in seconds.

The target value to predict here is the total seconds value between `created_at` and `actual_delivery_time`.



**Time features**
>
>- `market_id`: A city/region in which the company operates, e.g., Los Angeles, given in the data as an id
>- `created_at`: Timestamp in UTC when the order was submitted by the consumer to the company. (Note this timestamp is in UTC, but in case you need it, the actual timezone of the region was US/Pacific)
>- `actual_delivery_time`: Timestamp in UTC when the order was delivered to the consumer
>
**Store features**
>
>- `store_id`: an id representing the restaurant the order was submitted for
>- `store_primary_category`: cuisine category of the restaurant, e.g., italian, asian
>- `order_protocol`: a store can receive orders from the company through many modes. This field represents an id denoting the protocol
>
**Order features**
>
>- `total_items`: total number of items in the order
>- `subtotal`: total value of the order submitted (in cents)
>- `num_distinct_items`: number of distinct items included in the order
>- `min_item_price`: price of the item with the least cost in the order (in cents)
>- `max_item_price`: price of the item with the highest cost in the order (in cents)
>
**Market features**
>
>The company being a marketplace, it has information on the state of marketplace when the order is placed, that can be used to estimate delivery time. The following features are values at the time of `created_at` (order submission time):
>
>- `total_onshift_dashers`: Number of available dashers who are within 10 miles of the store at the time of order creation.
>- `total_busy_dashers`: Subset of above `total_onshift_dashers` who are currently working on an order
>- `total_outstanding_orders`: Number of orders within 10 miles of this order that are currently being processed.
>
**Predictions from other models**
>
>We have predictions from other models for various stages of delivery process that we can use:
>
>- `estimated_order_place_duration`: Estimated time for the restaurant to receive the order from the company (in seconds)
>- `estimated_store_to_consumer_driving_duration`: Estimated travel time between store and consumer (in seconds)
>

# Practicalities

Build a model to predict the total delivery duration seconds (as defined above). Explain:

- model(s) used,
- how you evaluated your model performance on the historical data,
any data processing you performed on the data,
- feature engineering choices you made,
- other information you would like to share your modeling approach.


In [1]:
import pandas as pd
import numpy as np
from skimpy import skim
import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px
import os

In [3]:
path_to_data=os.getenv("path_to_data")

In [5]:
df=pd.read_csv(f"{path_to_data}/historical_data.csv")

In [6]:
df.head()

Unnamed: 0,market_id,created_at,actual_delivery_time,store_id,store_primary_category,order_protocol,total_items,subtotal,num_distinct_items,min_item_price,max_item_price,total_onshift_dashers,total_busy_dashers,total_outstanding_orders,estimated_order_place_duration,estimated_store_to_consumer_driving_duration
0,1.0,2015-02-06 22:24:17,2015-02-06 23:27:16,1845,american,1.0,4,3441,4,557,1239,33.0,14.0,21.0,446,861.0
1,2.0,2015-02-10 21:49:25,2015-02-10 22:56:29,5477,mexican,2.0,1,1900,1,1400,1400,1.0,2.0,2.0,446,690.0
2,3.0,2015-01-22 20:39:28,2015-01-22 21:09:09,5477,,1.0,1,1900,1,1900,1900,1.0,0.0,0.0,446,690.0
3,3.0,2015-02-03 21:21:45,2015-02-03 22:13:00,5477,,1.0,6,6900,5,600,1800,1.0,1.0,2.0,446,289.0
4,3.0,2015-02-15 02:40:36,2015-02-15 03:20:26,5477,,1.0,3,3900,3,1100,1600,6.0,6.0,9.0,446,650.0
