## Food Delivery Estimator


#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project understands how long it takes to delivery a food based  some caracteristcs.


### 2) Data Collection
- Dataset Source -https://www.kaggle.com/datasets/gauravmalik26/food-delivery-dataset

### 3) About Dataset

Food delivery is a courier service in which a restaurant, store, or independent food-delivery company delivers food to a customer. An order is typically made either through a restaurant or grocer's website or mobile app, or through a food ordering company. The delivered items can include entrees, sides, drinks, desserts, or grocery items and are typically delivered in boxes or bags. The delivery person will normally drive a car, but in bigger cities where homes and restaurants are closer together, they may use bikes or motorized scooters.

# EDA

In [21]:
import numpy as np
import pandas as pd
import math
import seaborn as sns
import haversine as hs
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame

In [22]:
df = pd.read_csv('../data/deliverytime.csv')


#### Show Top 5 Records to understand the data

In [23]:
df.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min)
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,19-03-2022,11:30:00,11:45:00,conditions Sunny,High,2,Snack,motorcycle,0,No,Urban,(min) 24
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,25-03-2022,19:45:00,19:50:00,conditions Stormy,Jam,2,Snack,scooter,1,No,Metropolitian,(min) 33
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,19-03-2022,08:30:00,08:45:00,conditions Sandstorms,Low,0,Drinks,motorcycle,1,No,Urban,(min) 26
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,05-04-2022,18:00:00,18:10:00,conditions Sunny,Medium,0,Buffet,motorcycle,1,No,Metropolitian,(min) 21
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,26-03-2022,13:30:00,13:45:00,conditions Cloudy,High,1,Snack,scooter,1,No,Metropolitian,(min) 30


In [24]:
# Dropping duplicates
df.drop_duplicates(inplace= True)

In [25]:
# Get Info about the dataset / features

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45593 entries, 0 to 45592
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   ID                           45593 non-null  object 
 1   Delivery_person_ID           45593 non-null  object 
 2   Delivery_person_Age          45593 non-null  object 
 3   Delivery_person_Ratings      45593 non-null  object 
 4   Restaurant_latitude          45593 non-null  float64
 5   Restaurant_longitude         45593 non-null  float64
 6   Delivery_location_latitude   45593 non-null  float64
 7   Delivery_location_longitude  45593 non-null  float64
 8   Order_Date                   45593 non-null  object 
 9   Time_Orderd                  45593 non-null  object 
 10  Time_Order_picked            45593 non-null  object 
 11  Weatherconditions            45593 non-null  object 
 12  Road_traffic_density         45593 non-null  object 
 13  Vehicle_conditio

In [26]:
# Check if there is any unll value

df.isnull().sum()

ID                             0
Delivery_person_ID             0
Delivery_person_Age            0
Delivery_person_Ratings        0
Restaurant_latitude            0
Restaurant_longitude           0
Delivery_location_latitude     0
Delivery_location_longitude    0
Order_Date                     0
Time_Orderd                    0
Time_Order_picked              0
Weatherconditions              0
Road_traffic_density           0
Vehicle_condition              0
Type_of_order                  0
Type_of_vehicle                0
multiple_deliveries            0
Festival                       0
City                           0
Time_taken(min)                0
dtype: int64

    Even though the data doesn't appear to show missing values, some columns are labelled as "object" and might still contain NaN values. Let's investigate this further to ensure data integrity.

In [27]:
df.sort_values(by='Time_Orderd', ascending=False).head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min)
43053,0x1386,HYDRES14DEL02,,,17.426228,78.407495,17.506228,78.487495,31-03-2022,,22:25:00,conditions NaN,,2,Drinks,electric_scooter,0,No,Metropolitian,(min) 27
1427,0x41d,SURRES04DEL03,,,21.173493,72.801953,21.203493,72.831953,17-03-2022,,23:25:00,conditions NaN,,0,Snack,motorcycle,1,No,Metropolitian,(min) 28
7294,0xe5d,COIMBRES09DEL02,,,11.008638,76.984311,11.028638,77.004311,24-03-2022,,09:00:00,conditions Cloudy,Low,2,Meal,scooter,1,No,Metropolitian,(min) 16
22817,0x747,MYSRES11DEL03,,,12.323225,76.630028,12.353225,76.660028,01-04-2022,,18:35:00,conditions NaN,,0,Drinks,motorcycle,1,No,Metropolitian,(min) 29
22812,0xe62,CHENRES12DEL02,,,12.972793,80.249982,12.992793,80.269982,01-04-2022,,08:15:00,conditions Windy,Low,1,Drinks,scooter,1,No,Metropolitian,(min) 10


In [28]:
# lets replace these 'NaN' string to null.

df.replace({"NaN": np.nan}, regex=True, inplace = True)

In [29]:
### As we can see, there are NaN values.
df.isnull().sum()

ID                                0
Delivery_person_ID                0
Delivery_person_Age            1854
Delivery_person_Ratings        1908
Restaurant_latitude               0
Restaurant_longitude              0
Delivery_location_latitude        0
Delivery_location_longitude       0
Order_Date                        0
Time_Orderd                    1731
Time_Order_picked                 0
Weatherconditions               616
Road_traffic_density            601
Vehicle_condition                 0
Type_of_order                     0
Type_of_vehicle                   0
multiple_deliveries             993
Festival                        228
City                           1200
Time_taken(min)                   0
dtype: int64

In [30]:
# Now we can drop NaN values and check the remain data
df.dropna(inplace=True)
df.shape

(41368, 20)

In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41368 entries, 0 to 45592
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   ID                           41368 non-null  object 
 1   Delivery_person_ID           41368 non-null  object 
 2   Delivery_person_Age          41368 non-null  object 
 3   Delivery_person_Ratings      41368 non-null  object 
 4   Restaurant_latitude          41368 non-null  float64
 5   Restaurant_longitude         41368 non-null  float64
 6   Delivery_location_latitude   41368 non-null  float64
 7   Delivery_location_longitude  41368 non-null  float64
 8   Order_Date                   41368 non-null  object 
 9   Time_Orderd                  41368 non-null  object 
 10  Time_Order_picked            41368 non-null  object 
 11  Weatherconditions            41368 non-null  object 
 12  Road_traffic_density         41368 non-null  object 
 13  Vehicle_condition    

### Adjusting the column timetaken and Converting Columns type from object to numeric


In [32]:
df["Time_taken(min)"] = df["Time_taken(min)"].str.split().str[1]
df.rename(columns={"Time_taken(min)": "Time_taken_min"}, inplace=True)
df["Delivery_person_Age"] = pd.to_numeric(df["Delivery_person_Age"]).astype(int)
df["Time_taken_min"] = pd.to_numeric(df["Time_taken_min"]).astype(int)
df["Delivery_person_Ratings"] = pd.to_numeric(df["Delivery_person_Ratings"]).astype(float)
df["Weatherconditions"] = df["Weatherconditions"].str.split().str[1]

In [33]:
df.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken_min
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,19-03-2022,11:30:00,11:45:00,Sunny,High,2,Snack,motorcycle,0,No,Urban,24
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,25-03-2022,19:45:00,19:50:00,Stormy,Jam,2,Snack,scooter,1,No,Metropolitian,33
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,19-03-2022,08:30:00,08:45:00,Sandstorms,Low,0,Drinks,motorcycle,1,No,Urban,26
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,05-04-2022,18:00:00,18:10:00,Sunny,Medium,0,Buffet,motorcycle,1,No,Metropolitian,21
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,26-03-2022,13:30:00,13:45:00,Cloudy,High,1,Snack,scooter,1,No,Metropolitian,30


###  Calculate the distance using haversine - Create column distance 

In [35]:
# Function distance: calculates the distance based on given lat and long, it will be used in a lambd fuction to create a new columns "distance"
# returns a float , distance in kilometers, it uses the haversine library
def distance(rest_lat, rest_log, dest_lat, dest_long):
    rest = (rest_lat,rest_log )
    dest = (dest_lat,dest_long)

    return hs.haversine(rest,dest)

In [36]:
# using lambda to apply the function distance, also rounding the distance
df['distance'] = df.apply(lambda row: round(distance(row['Restaurant_latitude'], row['Restaurant_longitude'], row['Delivery_location_latitude'], row['Delivery_location_longitude'])), axis=1)


In [37]:
# Checing the result
df.sort_values(by='distance',ascending=False).head()


Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,...,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken_min,distance
4075,0xc06c,LUDHRES19DEL01,36,5.0,-30.902872,75.826808,31.012872,75.936808,12-02-2022,18:10:00,...,Windy,Medium,1,Meal,motorcycle,0,No,Metropolitian,25,6885
2908,0xc05a,LUDHRES16DEL01,30,4.8,-30.895817,75.813112,31.005817,75.923112,18-02-2022,23:55:00,...,Stormy,Low,1,Meal,scooter,0,No,Metropolitian,19,6883
2636,0xc0f7,LUDHRES19DEL01,24,4.7,-30.902872,75.826808,30.972872,75.896808,12-02-2022,22:35:00,...,Cloudy,Low,1,Meal,motorcycle,1,No,Urban,20,6880
25871,0xc0e9,LUDHRES16DEL01,35,3.9,-30.895817,75.813112,30.965817,75.883112,14-02-2022,20:25:00,...,Sandstorms,Jam,1,Buffet,scooter,1,No,Metropolitian,32,6879
23157,0xc0e4,LUDHRES15DEL03,25,4.5,-30.899584,75.809346,30.959584,75.869346,15-02-2022,21:10:00,...,Cloudy,Jam,0,Snack,motorcycle,1,No,Urban,29,6878


    The result reveals an inaccuracy in the distance calculation. This is likely due to incorrect latitude and longitude coordinates. For example, the coordinates Restaurant_latitude: -30.902872 and Restaurant_longitude: 75.826808 correspond to the middle of the ocean, indicating an error. To address this, negative values should be replaced with their absolute values  and remove rows where both latitude and longitude are zero.

### Swap negative localization to positive

In [38]:
### Swap negative localization to positive
df["Restaurant_latitude"] = abs(df["Restaurant_latitude"])
df["Restaurant_longitude"] = abs(df["Restaurant_longitude"])

df = df[df["Restaurant_latitude"] > 0 ]

In [39]:
df.sort_values(by='Delivery_location_longitude',ascending=True).head()


Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,...,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken_min,distance
6460,0x4e1d,SURRES14DEL01,23,4.5,21.157729,72.768726,21.167729,72.778726,24-03-2022,08:15:00,...,Sunny,Low,0,Meal,motorcycle,1,No,Metropolitian,14,2
9939,0x1b05,SURRES14DEL01,26,4.7,21.157729,72.768726,21.167729,72.778726,26-03-2022,08:15:00,...,Stormy,Low,2,Drinks,electric_scooter,0,No,Urban,10,2
38556,0x7754,SURRES14DEL01,31,4.7,21.157729,72.768726,21.167729,72.778726,11-03-2022,09:55:00,...,Cloudy,Low,1,Buffet,motorcycle,1,No,Metropolitian,18,2
34951,0x6b1d,SURRES14DEL01,33,5.0,21.157729,72.768726,21.167729,72.778726,07-03-2022,08:10:00,...,Sunny,Low,1,Drinks,scooter,0,No,Urban,13,2
863,0x7da4,SURRES14DEL01,28,4.6,21.157729,72.768726,21.167729,72.778726,17-03-2022,10:00:00,...,Stormy,Low,1,Meal,motorcycle,0,No,Metropolitian,11,2


In [40]:
# Reapply the lambda function to get the right distance values
df['distance'] = df.apply(lambda row: round(distance(row['Restaurant_latitude'], row['Restaurant_longitude'], row['Delivery_location_latitude'], row['Delivery_location_longitude'])), axis=1)


In [41]:
df.sort_values(by='distance',ascending=False).head()


Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,...,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken_min,distance
24845,0x1a96,JAPRES09DEL03,32,4.6,26.911378,75.789034,27.051378,75.929034,25-03-2022,20:25:00,...,Sandstorms,Jam,1,Snack,scooter,1,No,Metropolitian,25,21
19538,0xadf,JAPRES15DEL03,21,4.1,26.891191,75.802083,27.031191,75.942083,08-03-2022,20:40:00,...,Stormy,Jam,0,Meal,motorcycle,2,No,Metropolitian,35,21
18100,0x9102,JAPRES20DEL03,21,4.7,26.956431,75.776649,27.096431,75.916649,02-03-2022,17:45:00,...,Sandstorms,Medium,1,Drinks,motorcycle,1,No,Urban,15,21
35412,0xd488,AGRRES08DEL03,39,5.0,27.160934,78.044095,27.300934,78.184095,18-02-2022,21:25:00,...,Fog,Jam,2,Snack,scooter,2,No,Metropolitian,42,21
8378,0x4f46,JAPRES06DEL03,39,4.8,26.911927,75.797282,27.051927,75.937282,10-03-2022,23:25:00,...,Windy,Low,2,Meal,scooter,1,No,Metropolitian,26,21


In [46]:
df["City"].unique()

array(['Urban ', 'Metropolitian ', 'Semi-Urban '], dtype=object)

### 4. EDA - Exploring Data ( Visualization )
Grouping restaurants and creating a MAP.

#### 4.1.1 Histogram & KDE

In [42]:
rest_localization = df.groupby(['Restaurant_latitude'])

In [1]:
import plotly.express as px

restaurant_map  = px.scatter_mapbox(df, 
                        lat="Restaurant_latitude",
                        lon="Restaurant_longitude", 
                        
                        size_max=15, 
                        zoom=12, 
                        mapbox_style="open-street-map")

# Update layout for larger figure size
restaurant_map.update_layout(
    title="Restaurant Location",
    autosize=False,
    width=1200,
    height=500,  
)
restaurant_map.update_layout(title_x=0.5)

restaurant_map.show()

NameError: name 'df' is not defined

# Machine Learning

In [58]:
# Basic Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
# Modelling
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
import warnings

### Selecting Important features for Machine Learning

In [27]:
#  lets drop columns we know that is unicessary for machine learning


df = df[['Delivery_person_Age',
       'Delivery_person_Ratings', 'Time_Order_picked', 'Weatherconditions', 'Road_traffic_density',
       'Type_of_vehicle', 'multiple_deliveries', 'Festival', 'City',
       'distance', 'Time_taken_min']]

In [41]:
# Removing spaces in categorical columns

num_features = (    "Delivery_person_Age",
                    "Delivery_person_Ratings",
                    "multiple_deliveries",
                    "distance",)
                
            
cat_features = [
                    "Time_Order_picked",
                    "Weatherconditions",
                    "Road_traffic_density",
                    "Type_of_vehicle",
                    "Festival",
                    "City",
                
            ]



try:
    for feature in cat_features:
       df[feature] = cat_features = df[feature].str.strip()
    
except Exception as e:
     print(e)



In [42]:
X = df.drop(columns=['Time_taken_min'],axis=1)
y = df['Time_taken_min']

In [56]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    [
        ("StandardScaler", numeric_transformer, num_features),        
       
    ]
)


X_encoded = preprocessor.fit_transform(X)

In [48]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_encoded,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

((30451, 4), (7613, 4))

In [49]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [52]:
# List of bodels that will be used
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(), 
    "AdaBoost Regressor": AdaBoostRegressor()
}
model_list = []
r2_list =[]

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)
    
    print('='*35)
    print('\n')

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 7.4565
- Mean Absolute Error: 5.8953
- R2 Score: 0.3606
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 7.4457
- Mean Absolute Error: 5.8952
- R2 Score: 0.3661


Lasso
Model performance for Training set
- Root Mean Squared Error: 7.6635
- Mean Absolute Error: 6.0999
- R2 Score: 0.3246
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 7.6536
- Mean Absolute Error: 6.0972
- R2 Score: 0.3302


Ridge
Model performance for Training set
- Root Mean Squared Error: 7.4565
- Mean Absolute Error: 5.8953
- R2 Score: 0.3606
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 7.4457
- Mean Absolute Error: 5.8952
- R2 Score: 0.3661


K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 6.4477
- Mean Absolute Error: 4.9644
- R2 Score: 0.5219
-----------------------

In [53]:
lin_model = LinearRegression(fit_intercept=True)
lin_model = lin_model.fit(X_train, y_train)
y_pred = lin_model.predict(X_test)
score = r2_score(y_test, y_pred)*100
print(" Accuracy of the model is %.2f" %score)

 Accuracy of the model is 36.61
