## Food Delivery Estimator


#### Life cycle of Machine learning Project

- Understanding the Problem Statement
- Data Collection
- Data Checks to perform
- Exploratory data analysis
- Data Pre-Processing
- Model Training
- Choose best model

### 1) Problem statement
- This project understands how long it takes to delivery a food based  some caracteristcs.


### 2) Data Collection
- Dataset Source -https://www.kaggle.com/datasets/gauravmalik26/food-delivery-dataset

### 3) About Dataset

Food delivery is a courier service in which a restaurant, store, or independent food-delivery company delivers food to a customer. An order is typically made either through a restaurant or grocer's website or mobile app, or through a food ordering company. The delivered items can include entrees, sides, drinks, desserts, or grocery items and are typically delivered in boxes or bags. The delivery person will normally drive a car, but in bigger cities where homes and restaurants are closer together, they may use bikes or motorized scooters.

# EDA

In [3]:
import numpy as np
import pandas as pd
import math
import seaborn as sns
import haversine as hs
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

#### Import the CSV Data as Pandas DataFrame

In [4]:
df = pd.read_csv('../data/deliverytime.csv')


#### Show Top 5 Records to understand the data

In [152]:
df.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min)
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,19-03-2022,11:30:00,11:45:00,conditions Sunny,High,2,Snack,motorcycle,0,No,Urban,(min) 24
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,25-03-2022,19:45:00,19:50:00,conditions Stormy,Jam,2,Snack,scooter,1,No,Metropolitian,(min) 33
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,19-03-2022,08:30:00,08:45:00,conditions Sandstorms,Low,0,Drinks,motorcycle,1,No,Urban,(min) 26
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,05-04-2022,18:00:00,18:10:00,conditions Sunny,Medium,0,Buffet,motorcycle,1,No,Metropolitian,(min) 21
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,26-03-2022,13:30:00,13:45:00,conditions Cloudy,High,1,Snack,scooter,1,No,Metropolitian,(min) 30


In [5]:
# Dropping duplicates
df.drop_duplicates(inplace= True)

In [154]:
# Get Info about the dataset / features

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45593 entries, 0 to 45592
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   ID                           45593 non-null  object 
 1   Delivery_person_ID           45593 non-null  object 
 2   Delivery_person_Age          45593 non-null  object 
 3   Delivery_person_Ratings      45593 non-null  object 
 4   Restaurant_latitude          45593 non-null  float64
 5   Restaurant_longitude         45593 non-null  float64
 6   Delivery_location_latitude   45593 non-null  float64
 7   Delivery_location_longitude  45593 non-null  float64
 8   Order_Date                   45593 non-null  object 
 9   Time_Orderd                  45593 non-null  object 
 10  Time_Order_picked            45593 non-null  object 
 11  Weatherconditions            45593 non-null  object 
 12  Road_traffic_density         45593 non-null  object 
 13  Vehicle_conditio

In [155]:
# Check if there is any unll value

df.isnull().sum()

ID                             0
Delivery_person_ID             0
Delivery_person_Age            0
Delivery_person_Ratings        0
Restaurant_latitude            0
Restaurant_longitude           0
Delivery_location_latitude     0
Delivery_location_longitude    0
Order_Date                     0
Time_Orderd                    0
Time_Order_picked              0
Weatherconditions              0
Road_traffic_density           0
Vehicle_condition              0
Type_of_order                  0
Type_of_vehicle                0
multiple_deliveries            0
Festival                       0
City                           0
Time_taken(min)                0
dtype: int64

    Even though the data doesn't appear to show missing values, some columns are labelled as "object" and might still contain NaN values. Let's investigate this further to ensure data integrity.

In [156]:
df.sort_values(by='Time_Orderd', ascending=False).head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken(min)
43053,0x1386,HYDRES14DEL02,,,17.426228,78.407495,17.506228,78.487495,31-03-2022,,22:25:00,conditions NaN,,2,Drinks,electric_scooter,0,No,Metropolitian,(min) 27
1427,0x41d,SURRES04DEL03,,,21.173493,72.801953,21.203493,72.831953,17-03-2022,,23:25:00,conditions NaN,,0,Snack,motorcycle,1,No,Metropolitian,(min) 28
7294,0xe5d,COIMBRES09DEL02,,,11.008638,76.984311,11.028638,77.004311,24-03-2022,,09:00:00,conditions Cloudy,Low,2,Meal,scooter,1,No,Metropolitian,(min) 16
22817,0x747,MYSRES11DEL03,,,12.323225,76.630028,12.353225,76.660028,01-04-2022,,18:35:00,conditions NaN,,0,Drinks,motorcycle,1,No,Metropolitian,(min) 29
22812,0xe62,CHENRES12DEL02,,,12.972793,80.249982,12.992793,80.269982,01-04-2022,,08:15:00,conditions Windy,Low,1,Drinks,scooter,1,No,Metropolitian,(min) 10


In [157]:
# lets replace these 'NaN' string to null.

df.replace({"NaN": np.nan}, regex=True, inplace = True)

In [158]:
### As we can see, there are NaN values.
df.isnull().sum()

ID                                0
Delivery_person_ID                0
Delivery_person_Age            1854
Delivery_person_Ratings        1908
Restaurant_latitude               0
Restaurant_longitude              0
Delivery_location_latitude        0
Delivery_location_longitude       0
Order_Date                        0
Time_Orderd                    1731
Time_Order_picked                 0
Weatherconditions               616
Road_traffic_density            601
Vehicle_condition                 0
Type_of_order                     0
Type_of_vehicle                   0
multiple_deliveries             993
Festival                        228
City                           1200
Time_taken(min)                   0
dtype: int64

In [159]:
# Now we can drop NaN values and check the remain data
df.dropna(inplace=True)
df.shape

(41368, 20)

In [160]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 41368 entries, 0 to 45592
Data columns (total 20 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   ID                           41368 non-null  object 
 1   Delivery_person_ID           41368 non-null  object 
 2   Delivery_person_Age          41368 non-null  object 
 3   Delivery_person_Ratings      41368 non-null  object 
 4   Restaurant_latitude          41368 non-null  float64
 5   Restaurant_longitude         41368 non-null  float64
 6   Delivery_location_latitude   41368 non-null  float64
 7   Delivery_location_longitude  41368 non-null  float64
 8   Order_Date                   41368 non-null  object 
 9   Time_Orderd                  41368 non-null  object 
 10  Time_Order_picked            41368 non-null  object 
 11  Weatherconditions            41368 non-null  object 
 12  Road_traffic_density         41368 non-null  object 
 13  Vehicle_condition    

    Adjusting the column timetaken and Converting Columns type from object to numeric


In [161]:
df["Time_taken(min)"] = df["Time_taken(min)"].str.split().str[1]
df.rename(columns={"Time_taken(min)": "Time_taken_min"}, inplace=True)
df["Delivery_person_Age"] = pd.to_numeric(df["Delivery_person_Age"]).astype(int)
df["Time_taken_min"] = pd.to_numeric(df["Time_taken_min"]).astype(int)
df["Delivery_person_Ratings"] = pd.to_numeric(df["Delivery_person_Ratings"]).astype(float)


In [162]:
df.head()

Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken_min
0,0x4607,INDORES13DEL02,37,4.9,22.745049,75.892471,22.765049,75.912471,19-03-2022,11:30:00,11:45:00,conditions Sunny,High,2,Snack,motorcycle,0,No,Urban,24
1,0xb379,BANGRES18DEL02,34,4.5,12.913041,77.683237,13.043041,77.813237,25-03-2022,19:45:00,19:50:00,conditions Stormy,Jam,2,Snack,scooter,1,No,Metropolitian,33
2,0x5d6d,BANGRES19DEL01,23,4.4,12.914264,77.6784,12.924264,77.6884,19-03-2022,08:30:00,08:45:00,conditions Sandstorms,Low,0,Drinks,motorcycle,1,No,Urban,26
3,0x7a6a,COIMBRES13DEL02,38,4.7,11.003669,76.976494,11.053669,77.026494,05-04-2022,18:00:00,18:10:00,conditions Sunny,Medium,0,Buffet,motorcycle,1,No,Metropolitian,21
4,0x70a2,CHENRES12DEL01,32,4.6,12.972793,80.249982,13.012793,80.289982,26-03-2022,13:30:00,13:45:00,conditions Cloudy,High,1,Snack,scooter,1,No,Metropolitian,30


In [163]:
# Checking if there are any duplicated rows

df.duplicated().sum()

0

###  Calculate the distance using haversine - Create column distance 

In [164]:
# Function distance: calculates the distance based on given lat and long, it will be used in a lambd fuction to create a new columns "distance"
# returns a float , distance in kilometers, it uses the haversine library
def distance(rest_lat, rest_log, dest_lat, dest_long):
    rest = (rest_lat,rest_log )
    dest = (dest_lat,dest_long)

    return hs.haversine(rest,dest)

In [165]:
# using lambda to apply the function distance, also rounding the distance
df['distance'] = df.apply(lambda row: round(distance(row['Restaurant_latitude'], row['Restaurant_longitude'], row['Delivery_location_latitude'], row['Delivery_location_longitude'])), axis=1)


In [166]:
# Checing the result
df.sort_values(by='distance',ascending=False).head()


Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,...,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken_min,distance
4075,0xc06c,LUDHRES19DEL01,36,5.0,-30.902872,75.826808,31.012872,75.936808,12-02-2022,18:10:00,...,conditions Windy,Medium,1,Meal,motorcycle,0,No,Metropolitian,25,6885
2908,0xc05a,LUDHRES16DEL01,30,4.8,-30.895817,75.813112,31.005817,75.923112,18-02-2022,23:55:00,...,conditions Stormy,Low,1,Meal,scooter,0,No,Metropolitian,19,6883
2636,0xc0f7,LUDHRES19DEL01,24,4.7,-30.902872,75.826808,30.972872,75.896808,12-02-2022,22:35:00,...,conditions Cloudy,Low,1,Meal,motorcycle,1,No,Urban,20,6880
25871,0xc0e9,LUDHRES16DEL01,35,3.9,-30.895817,75.813112,30.965817,75.883112,14-02-2022,20:25:00,...,conditions Sandstorms,Jam,1,Buffet,scooter,1,No,Metropolitian,32,6879
23157,0xc0e4,LUDHRES15DEL03,25,4.5,-30.899584,75.809346,30.959584,75.869346,15-02-2022,21:10:00,...,conditions Cloudy,Jam,0,Snack,motorcycle,1,No,Urban,29,6878


    The result reveals an inaccuracy in the distance calculation. This is likely due to incorrect latitude and longitude coordinates. For example, the coordinates Restaurant_latitude: -30.902872 and Restaurant_longitude: 75.826808 correspond to the middle of the ocean, indicating an error. To address this, negative values should be replaced with their absolute values  and remove rows where both latitude and longitude are zero.

### Swap negative localization to positive

In [167]:
### Swap negative localization to positive
df["Restaurant_latitude"] = abs(df["Restaurant_latitude"])
df["Restaurant_longitude"] = abs(df["Restaurant_longitude"])

df = df[df["Restaurant_latitude"] > 0 ]

In [168]:
df.sort_values(by='Delivery_location_longitude',ascending=True).head()


Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,...,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken_min,distance
6460,0x4e1d,SURRES14DEL01,23,4.5,21.157729,72.768726,21.167729,72.778726,24-03-2022,08:15:00,...,conditions Sunny,Low,0,Meal,motorcycle,1,No,Metropolitian,14,2
9939,0x1b05,SURRES14DEL01,26,4.7,21.157729,72.768726,21.167729,72.778726,26-03-2022,08:15:00,...,conditions Stormy,Low,2,Drinks,electric_scooter,0,No,Urban,10,2
38556,0x7754,SURRES14DEL01,31,4.7,21.157729,72.768726,21.167729,72.778726,11-03-2022,09:55:00,...,conditions Cloudy,Low,1,Buffet,motorcycle,1,No,Metropolitian,18,2
34951,0x6b1d,SURRES14DEL01,33,5.0,21.157729,72.768726,21.167729,72.778726,07-03-2022,08:10:00,...,conditions Sunny,Low,1,Drinks,scooter,0,No,Urban,13,2
863,0x7da4,SURRES14DEL01,28,4.6,21.157729,72.768726,21.167729,72.778726,17-03-2022,10:00:00,...,conditions Stormy,Low,1,Meal,motorcycle,0,No,Metropolitian,11,2


In [169]:
# Reapply the lambda function to get the right distance values
df['distance'] = df.apply(lambda row: round(distance(row['Restaurant_latitude'], row['Restaurant_longitude'], row['Delivery_location_latitude'], row['Delivery_location_longitude'])), axis=1)


In [170]:
df.sort_values(by='distance',ascending=False).head()


Unnamed: 0,ID,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Restaurant_latitude,Restaurant_longitude,Delivery_location_latitude,Delivery_location_longitude,Order_Date,Time_Orderd,...,Weatherconditions,Road_traffic_density,Vehicle_condition,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,Time_taken_min,distance
24845,0x1a96,JAPRES09DEL03,32,4.6,26.911378,75.789034,27.051378,75.929034,25-03-2022,20:25:00,...,conditions Sandstorms,Jam,1,Snack,scooter,1,No,Metropolitian,25,21
19538,0xadf,JAPRES15DEL03,21,4.1,26.891191,75.802083,27.031191,75.942083,08-03-2022,20:40:00,...,conditions Stormy,Jam,0,Meal,motorcycle,2,No,Metropolitian,35,21
18100,0x9102,JAPRES20DEL03,21,4.7,26.956431,75.776649,27.096431,75.916649,02-03-2022,17:45:00,...,conditions Sandstorms,Medium,1,Drinks,motorcycle,1,No,Urban,15,21
35412,0xd488,AGRRES08DEL03,39,5.0,27.160934,78.044095,27.300934,78.184095,18-02-2022,21:25:00,...,conditions Fog,Jam,2,Snack,scooter,2,No,Metropolitian,42,21
8378,0x4f46,JAPRES06DEL03,39,4.8,26.911927,75.797282,27.051927,75.937282,10-03-2022,23:25:00,...,conditions Windy,Low,2,Meal,scooter,1,No,Metropolitian,26,21


### 4. EDA - Exploring Data ( Visualization )
Grouping restaurants and creating a MAP.

#### 4.1.1 Histogram & KDE

In [171]:
rest_localization = df.groupby(['Restaurant_latitude'])

In [172]:
import plotly.express as px

restaurant_map  = px.scatter_mapbox(df, 
                        lat="Restaurant_latitude",
                        lon="Restaurant_longitudes", 
                        
                        size_max=15, 
                        zoom=12, 
                        mapbox_style="open-street-map")

# Update layout for larger figure size
restaurant_map.update_layout(
    title="Bike Stations in Dublin",
    autosize=False,
    width=800,
    height=500,  
)
restaurant_map.update_layout(title_x=0.5)

restaurant_map.show()

# Machine Learning

In [173]:
# Basic Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
# Modelling
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
import warnings

# Selecting Important features for Machine Learning

In [174]:
# First lets drop columns we know that is unicessary for machine learning


df = df[['Delivery_person_ID', 'Delivery_person_Age',
       'Delivery_person_Ratings', 'Order_Date', 'Time_Orderd',
       'Time_Order_picked', 'Weatherconditions', 'Road_traffic_density',
       'Type_of_order', 'Type_of_vehicle',
       'multiple_deliveries', 'Festival', 'City',
       'distance', 'Time_taken_min']]

### From these columns, lets check correlation between these and Time_taken_min to define wich ones must remain for Machine Learning Purpose



In [175]:

# encoding Categorical Columns

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

cat_features = df.select_dtypes(include="object").columns   # Selecting only categorical features


for column in cat_features:
    df[column] = encoder.fit_transform(df[column])



df


Unnamed: 0,Delivery_person_ID,Delivery_person_Age,Delivery_person_Ratings,Order_Date,Time_Orderd,Time_Order_picked,Weatherconditions,Road_traffic_density,Type_of_order,Type_of_vehicle,multiple_deliveries,Festival,City,distance,Time_taken_min
0,544,37,4.9,32,38,46,4,0,3,1,0,0,2,3,24
1,187,34,4.5,37,129,143,3,1,3,2,1,0,0,20,33
2,189,23,4.4,32,5,10,2,2,1,1,1,0,2,2,26
3,334,38,4.7,9,110,123,4,3,0,1,1,0,0,8,21
4,270,32,4.6,38,60,70,0,0,3,2,1,0,0,6,30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45587,1065,35,4.2,13,151,168,5,1,1,1,1,0,0,17,33
45588,579,30,4.8,36,39,46,5,0,2,1,0,0,0,1,32
45590,263,30,4.9,17,174,1,0,2,1,2,0,0,0,5,16
45591,327,20,4.7,12,61,69,0,0,3,1,1,0,0,6,26


In [176]:
'''
# Correlation
import seaborn as sns
# Correlation between Bike Stands and Bikes in Use
plt.figure(figsize=(12,10))
corr = df.corr()

sns.heatmap(corr, annot=True, cmap=plt.cm.CMRmap_r)

plt.show()
'''


'\n# Correlation\nimport seaborn as sns\n# Correlation between Bike Stands and Bikes in Use\nplt.figure(figsize=(12,10))\ncorr = df.corr()\n\nsns.heatmap(corr, annot=True, cmap=plt.cm.CMRmap_r)\n\nplt.show()\n'

In [177]:
# Based on this correlation map, we will maitain the strong and moderate positive and negative correlation.
df = df[['Delivery_person_Age',
       'Delivery_person_Ratings', 'Time_Order_picked', 'Weatherconditions', 'Road_traffic_density',
       'Type_of_vehicle', 'multiple_deliveries', 'Festival', 'City',
       'distance', 'Time_taken_min']]


In [178]:
df

Unnamed: 0,Delivery_person_Age,Delivery_person_Ratings,Time_Order_picked,Weatherconditions,Road_traffic_density,Type_of_vehicle,multiple_deliveries,Festival,City,distance,Time_taken_min
0,37,4.9,46,4,0,1,0,0,2,3,24
1,34,4.5,143,3,1,2,1,0,0,20,33
2,23,4.4,10,2,2,1,1,0,2,2,26
3,38,4.7,123,4,3,1,1,0,0,8,21
4,32,4.6,70,0,0,2,1,0,0,6,30
...,...,...,...,...,...,...,...,...,...,...,...
45587,35,4.2,168,5,1,1,1,0,0,17,33
45588,30,4.8,46,5,0,1,0,0,0,1,32
45590,30,4.9,1,0,2,2,0,0,0,5,16
45591,20,4.7,69,0,0,1,1,0,0,6,26


In [179]:
df['Time_Order_picked'].unique()

array([ 46, 143,  10, 123,  70, 163, 139, 115, 158, 171,  86, 117,  19,
       146, 152,  87, 153, 150, 178,   7, 142,  55, 131, 187, 164, 190,
       179, 177, 188,  69, 135,  78,  39,  24,  12, 183, 138,  37, 166,
       136, 108,  45,  88, 180, 156, 182, 121, 181, 176, 169, 192,  93,
        31, 157, 148,  52,  94,   1,  58, 160, 125, 126,  47,  59,  36,
       165,  29, 120, 174, 191,  60,   2,  33,  13, 149, 154,  35, 145,
       184, 128, 173, 109,   3, 161,  80,  30,  14,   9, 155, 189, 167,
        85, 113, 114, 147, 116,   0, 133, 134,  65, 122, 137,  26, 162,
        15, 168, 141,  34,  23,  95, 127, 124, 100,  40, 159,  91, 175,
        89, 185,  42,  72, 130, 170,  48, 132, 129,  22, 112,  50, 140,
         6,  38,  98,  49,  32,  57,  54, 186, 119,   4,  21, 151,  11,
        82, 118,  25,   8, 144,  44,  41,  79,   5,  51, 110,  27,  43,
       102, 107,  68,  16,  96, 172,  28,  61,  92,  18,  63,  62,  20,
       101, 103, 106,  17,  66,  76, 104, 105,  74,  67,  81,  5

In [180]:
X = df.drop(columns=['Time_taken_min'],axis=1)
y = df['Time_taken_min']

In [181]:
X

Unnamed: 0,Delivery_person_Age,Delivery_person_Ratings,Time_Order_picked,Weatherconditions,Road_traffic_density,Type_of_vehicle,multiple_deliveries,Festival,City,distance
0,37,4.9,46,4,0,1,0,0,2,3
1,34,4.5,143,3,1,2,1,0,0,20
2,23,4.4,10,2,2,1,1,0,2,2
3,38,4.7,123,4,3,1,1,0,0,8
4,32,4.6,70,0,0,2,1,0,0,6
...,...,...,...,...,...,...,...,...,...,...
45587,35,4.2,168,5,1,1,1,0,0,17
45588,30,4.8,46,5,0,1,0,0,0,1
45590,30,4.9,1,0,2,2,0,0,0,5
45591,20,4.7,69,0,0,1,1,0,0,6


In [182]:
# Create Column Transformer with 3 types of transformers
  # Selecting only categorical features

num_features = [
                [
                    "Delivery_person_Age",
                    "Delivery_person_Ratings",
                    "multiple_deliveries",
                    "distance",
                ]
            ]
cat_features = [
                [
                    "Weatherconditions",
                    "Road_traffic_density",
                    "Type_of_vehicle",
                    "Festival",
                    "City",
                ]
            ]



try:
    for feature in cat_features:
       df[feature] = cat_features = df[feature].str.strip()
    
except Exception as e:
     print(e)



from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer





numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    [
        ("StandardScaler", numeric_transformer, num_features),        
    ]
)

encoded = preprocessor.fit_transform(X)

TypeError: unhashable type: 'list'

In [None]:
X

array([], shape=(38064, 0), dtype=float64)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)
X_train.shape, X_test.shape

((30451, 10), (7613, 10))

In [None]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [None]:
# List of bodels that will be used
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
    "XGBRegressor": XGBRegressor(), 
    "AdaBoost Regressor": AdaBoostRegressor()
}
model_list = []
r2_list =[]

for i in range(len(list(models))):
    model = list(models.values())[i]
    model.fit(X_train, y_train) # Train model

    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    
    # Evaluate Train and Test dataset
    model_train_mae , model_train_rmse, model_train_r2 = evaluate_model(y_train, y_train_pred)

    model_test_mae , model_test_rmse, model_test_r2 = evaluate_model(y_test, y_test_pred)

    
    print(list(models.keys())[i])
    model_list.append(list(models.keys())[i])
    
    print('Model performance for Training set')
    print("- Root Mean Squared Error: {:.4f}".format(model_train_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_train_mae))
    print("- R2 Score: {:.4f}".format(model_train_r2))

    print('----------------------------------')
    
    print('Model performance for Test set')
    print("- Root Mean Squared Error: {:.4f}".format(model_test_rmse))
    print("- Mean Absolute Error: {:.4f}".format(model_test_mae))
    print("- R2 Score: {:.4f}".format(model_test_r2))
    r2_list.append(model_test_r2)
    
    print('='*35)
    print('\n')

Linear Regression
Model performance for Training set
- Root Mean Squared Error: 6.7944
- Mean Absolute Error: 5.4078
- R2 Score: 0.4691
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.7371
- Mean Absolute Error: 5.3680
- R2 Score: 0.4810


Lasso
Model performance for Training set
- Root Mean Squared Error: 7.2576
- Mean Absolute Error: 5.8065
- R2 Score: 0.3943
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 7.2152
- Mean Absolute Error: 5.7740
- R2 Score: 0.4047


Ridge
Model performance for Training set
- Root Mean Squared Error: 6.7944
- Mean Absolute Error: 5.4078
- R2 Score: 0.4691
----------------------------------
Model performance for Test set
- Root Mean Squared Error: 6.7371
- Mean Absolute Error: 5.3680
- R2 Score: 0.4810


K-Neighbors Regressor
Model performance for Training set
- Root Mean Squared Error: 4.1055
- Mean Absolute Error: 3.1827
- R2 Score: 0.8062
-----------------------

In [None]:
lin_model = LinearRegression(fit_intercept=True)
lin_model = lin_model.fit(X_train, y_train)
y_pred = lin_model.predict(X_test)
score = r2_score(y_test, y_pred)*100
print(" Accuracy of the model is %.2f" %score)

 Accuracy of the model is 48.10
