# Information

`order_datetime` - time of the order

`origin_longitude` - longitude of the order

`origin_latitude` - latitude of the order

`m_order_eta` - time before order arrival

`order_gk` - order number

`order_status_key` - status, an enumeration consisting of the following mapping:

- `4` - cancelled by client,
- `9` - cancelled by system, i.e., a reject

`is_driver_assigned_key` - whether a driver has been assigned

`cancellation_time_in_seconds` - how many seconds passed before cancellation

The data_offers data set is a simple map with 2 columns:

`order_gk` - order number, associated with the same column from the orders data set

`offer_id` - ID of an offer

## Uploading data

In [2]:
from ipywidgets import interact
from skimpy import skim
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [3]:
data_source = os.environ.get("path_data")
df_data_offers=pd.read_csv(f"{data_source}/raw_data/data_offers.csv")
df_data_orders=pd.read_csv(f"{data_source}/raw_data/data_orders.csv")

In [4]:
df_data_offers.head()

Unnamed: 0,order_gk,offer_id
0,3000579625629,300050936206
1,3000627306450,300052064651
2,3000632920686,300052408812
3,3000632771725,300052393030
4,3000583467642,300051001196


In [5]:
df_data_orders.head()

Unnamed: 0,order_datetime,origin_longitude,origin_latitude,m_order_eta,order_gk,order_status_key,is_driver_assigned_key,cancellations_time_in_seconds
0,18:08:07,-0.978916,51.456173,60.0,3000583041974,4,1,198.0
1,20:57:32,-0.950385,51.456843,,3000583116437,4,0,128.0
2,12:07:50,-0.96952,51.455544,477.0,3000582891479,4,1,46.0
3,13:50:20,-1.054671,51.460544,658.0,3000582941169,4,1,62.0
4,21:24:45,-0.967605,51.458236,,3000583140877,9,0,


## Exploratory data analysis

In [6]:
df_data_offers.shape

(334363, 2)

In [7]:
df_data_orders.shape

(10716, 8)

In [8]:
skim(df_data_orders)

In [9]:
df_data_orders.describe()

Unnamed: 0,origin_longitude,origin_latitude,m_order_eta,order_gk,order_status_key,is_driver_assigned_key,cancellations_time_in_seconds
count,10716.0,10716.0,2814.0,10716.0,10716.0,10716.0,7307.0
mean,-0.964323,51.450541,441.415423,3000598000000.0,5.590612,0.262598,157.892021
std,0.022818,0.011984,288.006379,23962610.0,2.328845,0.440066,213.366963
min,-1.066957,51.399323,60.0,3000550000000.0,4.0,0.0,3.0
25%,-0.974363,51.444643,233.0,3000583000000.0,4.0,0.0,45.0
50%,-0.966386,51.451972,368.5,3000595000000.0,4.0,0.0,98.0
75%,-0.949605,51.456725,653.0,3000623000000.0,9.0,1.0,187.5
max,-0.867088,51.496169,1559.0,3000633000000.0,9.0,1.0,4303.0


**Comment**: Mean and median similar in all the variables but the `m_order_eta` where the mean is much higher than the median (skewed to the right).

In [12]:
df_data_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10716 entries, 0 to 10715
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   order_datetime                 10716 non-null  object 
 1   origin_longitude               10716 non-null  float64
 2   origin_latitude                10716 non-null  float64
 3   m_order_eta                    2814 non-null   float64
 4   order_gk                       10716 non-null  int64  
 5   order_status_key               10716 non-null  int64  
 6   is_driver_assigned_key         10716 non-null  int64  
 7   cancellations_time_in_seconds  7307 non-null   float64
dtypes: float64(4), int64(3), object(1)
memory usage: 669.9+ KB


In [11]:
df_data_orders.isnull().sum()/len(df_data_orders)

order_datetime                   0.000000
origin_longitude                 0.000000
origin_latitude                  0.000000
m_order_eta                      0.737402
order_gk                         0.000000
order_status_key                 0.000000
is_driver_assigned_key           0.000000
cancellations_time_in_seconds    0.318122
dtype: float64

**Comment**: The amount of null values in the variable `m_order_eta` is higher in proportion than the observations we have for this variable. Also the data is skewed (is there a relationship between these two variables?). Consequently it'd be better to discard this variable, since it could provide some noise to the dataset.
Regarding `cancellations_time_in_seconds`, it has a high proportion of null values but it doesn't seem to be extremely skweded.

In [14]:
df_data_orders['order_status_key'].value_counts(normalize=True)*100

4    68.187757
9    31.812243
Name: order_status_key, dtype: float64

**Comment**: The 68% of the cancellations were performed by the client, while the 32% were cancelled by the system.

In [16]:
df_data_orders['is_driver_assigned_key'].value_counts(normalize=True)*100

0    73.740202
1    26.259798
Name: is_driver_assigned_key, dtype: float64

**Comment**: Only 26% of the cancelled orders had a driver assigned.