# Dataset
Daatset includes various data regarding: delivery person, restaurant and deliry location, date and time, weather traffic. Dataset contains around 45.5 k deliveries. 

In [1]:
import pandas as pd
df = pd.read_csv("raw_data.csv", na_values="NaN ")
# print(len(df))

# Dataset clearing
Firstly, deliveries with missing data and incorrect geographical coordinates (outside of India) were filltered and deleted. There are 4225 rows with missing values. Additionally there are 3450 rows where restaurant or delivery location are outside India. After initial preprocessing almost 38k deliveries are suitable for further processing.

In [2]:
india_bbox = (68.1766451354, 7.96553477623, 97.4025614766, 35.4940095078)
india_high_lat, india_low_lat = 35.4940095078, 7.96553477623
india_high_long, india_low_long = 97.4025614766, 68.1766451354

df.dropna(inplace=True)

print(45593-len(df))


4225


In [3]:
outside_locations_df = df.loc[(df['Restaurant_latitude'] < india_low_lat) |
                              (df['Restaurant_latitude'] > india_high_lat) |
                              (df['Restaurant_longitude'] < india_low_long) |
                              (df['Restaurant_longitude'] > india_high_long) |
                              (df['Delivery_location_latitude'] < india_low_lat) |
                              (df['Delivery_location_latitude'] > india_high_lat) |
                              (df['Delivery_location_longitude'] < india_low_long) |
                              (df['Delivery_location_longitude'] > india_high_long)]
print(len(outside_locations_df))

df.drop(outside_locations_df.index, inplace=True)
print(len(df))

3450
37918


# Preprocessing using OSRM API
Next step is to map geographic coordinates onto road distance. To produse routes we used OSRM API that allows to find the fastest route between coordinates (not sure if we use route service ot trip service) via HTTP request. API is prefered as in contrast to library based clients as it produces higher quality data (less errors such impossible routes, incorrect distance etc.). Created OSRM client uses asynchnonic HTTP request to produce route biteewn restaurant and delivery location and if response is successful it save the route in csv file. Additionally logger was creates which save information about unsuccessful request and its reason. While genereting routes only one type of error occured - "Too many request". This indicates that server was reached max numer of request per minute (5000). Request limit is common to all users.

In [4]:
routes_df = pd.read_csv("routes2.csv")
routes_df.head(5)

Unnamed: 0,Og-data-row-number,Route-coordinates,Distance [km]
0,0,"[[(22.745049, 75.892472), (22.745074, 75.89247...",4.1606
1,1,"[[(12.913096, 77.682969), (12.911958, 77.68272...",28.9704
2,2,"[[(12.914273, 77.678365), (12.913841, 77.67824...",5.1441
3,3,"[[(11.00363, 76.975974), (11.004857, 76.975877...",13.1787
4,5,"[[(17.431655, 78.408324), (17.431691, 78.40848...",9.6564


# Final preprocessing
Final processing includes merging results from OSRM API and dataset into one and removing unused collumns (like IDs, festivals, vehicle). Then to simplify modeling meal preparation time is calculated.

In [35]:
full_data = df.merge(routes_df, left_index=True, right_on="Og-data-row-number", how="inner")
full_data.drop(columns=["ID", "Delivery_person_ID","Delivery_person_ID","Type_of_vehicle","Festival","City","Vehicle_condition","Og-data-row-number", "Delivery_person_Age", "Restaurant_latitude",
                        "Restaurant_longitude", "Delivery_location_latitude", "Delivery_location_longitude", "Order_Date", "Weatherconditions", "Type_of_order", "Route-coordinates"],inplace=True)
full_data.head(5)

Unnamed: 0,Delivery_person_Ratings,Time_Orderd,Time_Order_picked,Road_traffic_density,multiple_deliveries,Time_taken(min),Distance [km]
0,4.9,11:30:00,11:45:00,High,0.0,(min) 24,4.1606
1,4.5,19:45:00,19:50:00,Jam,1.0,(min) 33,28.9704
2,4.4,08:30:00,08:45:00,Low,1.0,(min) 26,5.1441
3,4.7,18:00:00,18:10:00,Medium,1.0,(min) 21,13.1787
5,4.6,13:30:00,13:45:00,High,1.0,(min) 30,8.8685


In [43]:
full_data["Time_taken(min)"] = full_data["Time_taken(min)"].str.removeprefix("(min) ")
full_data[["Time_Orderd","Time_Order_picked"]] = full_data[["Time_Orderd","Time_Order_picked"]] .apply(pd.to_datetime, format="%H:%M:%S")
full_data["Meal_preparation_time"] = full_data["Time_Order_picked"]-full_data["Time_Orderd"]
full_data["Meal_preparation_time"] = full_data["Meal_preparation_time"].dt.total_seconds()/60
full_data.loc[full_data["Meal_preparation_time"]<0, "Meal_preparation_time"] = full_data["Meal_preparation_time"] + 1440
model_data = full_data.drop(columns=["Time_Orderd","Time_Order_picked"])
model_data.head(5)


Unnamed: 0,Delivery_person_Ratings,Road_traffic_density,multiple_deliveries,Time_taken(min),Distance [km],Meal_preparation_time
0,4.9,High,0.0,24,4.1606,15.0
1,4.5,Jam,1.0,33,28.9704,5.0
2,4.4,Low,1.0,26,5.1441,15.0
3,4.7,Medium,1.0,21,13.1787,10.0
5,4.6,High,1.0,30,8.8685,15.0


Check if delivery is possible considering Time taken and Meal preparation time. Arounf 1600 deliveries are impossible (total time is equal or smaller than time of preparing the meal).

In [48]:
model_data["Time_taken(min)"] = model_data["Time_taken(min)"].astype(int)
impossible_deliery = model_data.loc[model_data["Time_taken(min)"] <= model_data["Meal_preparation_time"]]
model_data.drop(impossible_deliery.index, inplace=True)
model_data.head(5)

Unnamed: 0,Delivery_person_Ratings,Road_traffic_density,multiple_deliveries,Time_taken(min),Distance [km],Meal_preparation_time
0,4.9,High,0.0,24,4.1606,15.0
1,4.5,Jam,1.0,33,28.9704,5.0
2,4.4,Low,1.0,26,5.1441,15.0
3,4.7,Medium,1.0,21,13.1787,10.0
5,4.6,High,1.0,30,8.8685,15.0


# Additional preparation for model
To be able to use categorical data like traffic level or multiple deliveries we mapped non-numerical values to index variable in range 1-4 to use it in stan model. Also standarization for road distances, meal preparation and delivery person rating was made as data was to widely spread which cause problems inmodeling. This also simplified interpretation of model and its parts. 

Filtering data to manualy set maximum distance of the delivery still is a vaiable option.

In [51]:
# Standardization
model_data["Distance [km]"].max()
# model_data["Normalized Distance"] = (model_data["Distance [km]"] - model_data["Distance [km]"].mean())/ model_data

121.8955