# Acquire training and testing data

Data has been gathered through the UberEats Database and it has about 45 thousand recent deliveries from different cities.

## Libraries

In [1]:
# Warnings
import warnings
warnings.filterwarnings('ignore')

# Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Settings
pd.set_option('display.max_columns', None)
sns.set_style('whitegrid')
%matplotlib inline

## Loading data

In [2]:
data = pd.read_csv("../data/uber-eats-deliveries.csv")

---
# Wrangle, prepare, cleanse the data

## Understanding data information

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45593 entries, 0 to 45592
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   ID                           45593 non-null  object 
 1   Delivery_person_ID           45593 non-null  object 
 2   Delivery_person_Age          45593 non-null  object 
 3   Delivery_person_Ratings      45593 non-null  object 
 4   Restaurant_latitude          45593 non-null  float64
 5   Restaurant_longitude         45593 non-null  float64
 6   Delivery_location_latitude   45593 non-null  float64
 7   Delivery_location_longitude  45593 non-null  float64
 8   Order_Date                   45593 non-null  object 
 9   Time_Orderd                  45593 non-null  object 
 10  Time_Order_picked            45593 non-null  object 
 11  Weatherconditions            45593 non-null  object 
 12  Road_traffic_density         45593 non-null  object 
 13  Vehicle_conditio

In [4]:
data.shape

(45593, 20)

Uber collection systems already introduced a NaN value as a string in case of not available information, so later when working with missing values I need to implement an algorithm to capture this NaN string values and change them into np.nan. Besides that, the dataset has the information I need to train a model and provide a solution.

## Understanding data dictionary

|Column|Description |
| :------------ |:---------------:|
|**ID**|order ID number| 
|**Delivery_person_ID**|ID number of the delivery partner|
|**Delivery_person_Age**|Age of the delivery partner|
|**Delivery_person_Ratings**|Ratings of the delivery partner based on past deliveries|
|**Restaurant_latitude**|The latitude of the restaurant|
|**Restaurant_longitude**|The longitude of the restaurant|
|**Delivery_location_latitude**|The latitude of the delivery location|
|**Delivery_location_longitude**|The longitude of the delivery location|
|**Order_Date**|Date of the order|
|**Time_Orderd**|Time the order was placed|
|**Time_Order_picked**|Time the order was picked|
|**Weatherconditions**|Weather conditions of the day|
|**Road_traffic_density**|Density of the traffic|
|**Vehicle_condition**|Condition of the vehicle|
|**Type_of_order**|The type of meal ordered by the customer|
|**Type_of_vehicle**|The type of vehicle delivery partner rides|
|**multiple_deliveries**|Amount of deliveries driver picked|
|**Festival**|If there was a Festival or no.|
|**City**|Type of city|
|**Time_taken(min)**| The time taken by the delivery partner to complete the order|

## Understanding data structure
Understanding the data structure is an essential aspect of data science projects as it plays a crucial role, because data have been organized, stored and manipulated for eficiency. Using the data they way it was thought to be use helps to reduce complexity, improve data quality, and enable faster and more accurate analysis.

In [5]:
data.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min)
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,19-03-2022,11:30:00,11:45:00,conditions Sunny,High,2,Snack,motorcycle,0,No,Urban,(min) 24
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,25-03-2022,19:45:00,19:50:00,conditions Stormy,Jam,2,Snack,scooter,1,No,Metropolitian,(min) 33
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,19-03-2022,08:30:00,08:45:00,conditions Sandstorms,Low,0,Drinks,motorcycle,1,No,Urban,(min) 26
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,05-04-2022,18:00:00,18:10:00,conditions Sunny,Medium,0,Buffet,motorcycle,1,No,Metropolitian,(min) 21
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,26-03-2022,13:30:00,13:45:00,conditions Cloudy,High,1,Snack,scooter,1,No,Metropolitian,(min) 30


From the data structure I can see that there is some variables can be group together such as Order_Date and Time_Orderd, then engineered using time format. The target (Time_taken(min)) also need to be transform in a descrete variable.

## Data Cleanse 
Data cleansing, also known as data cleaning or data scrubbing, is the process of identifying and correcting or removing errors, inconsistencies, and inaccuracies in a dataset. The goal of data cleansing is to improve the quality of data so that it can be used effectively in data analysis, decision-making, and other applications.

Fortunately the data is already very clean, but there are some columns that need a touch, so for this section my focus will be:

- Map all the missing values.
- Transform target (Time_taken(min)) into an int.
- Combine time features, and transform it into pd.to_datetime then engineer new features out of it.
- Remove spaces from str features

### Mapping all the missing values

In [6]:
for column in data.columns:
    data[column] = data[column].apply(lambda value: np.nan if value == 'NaN ' else value)

### Transform target

In [33]:
data['Time_taken(min)'] = data['Time_taken(min)'].apply(lambda value: int(value[-2:]))

### Order_Date, Time_Orderd & Time_Order_picked

In [None]:
# Combine time variables

data['Time_Orderd'] = data['Order_Date'] + ' ' + data['Time_Orderd']
data['Time_Order_picked'] = data['Order_Date'] + ' ' + data['Time_Order_picked']

In [7]:
# Convert to pd.to_datetime

data['Time_Orderd'] = pd.to_datetime(data['Time_Orderd'], format = '%d-%m-%Y %H:%M:%S')
data['Time_Order_picked'] = pd.to_datetime(data['Time_Order_picked'], format = '%d-%m-%Y %H:%M:%S')

In [25]:
# Feature Engineering new variables out of Time_Orderd & Time_Order_picked

# Time_To_Pick: Total time it took the restaurant to make and the Delivery_person to pick the order

data['Time_To_Pick'] = (data['Time_Order_picked'] - data['Time_Orderd']).astype('timedelta64[m]')
