# Feature Engineering Project

## 1. Defining the research question:



### Background:

Sendy
is a business-to-business platform established in 2014, to enable businesses of all types
and sizes to transport goods more efficiently across East Africa. The company is
headquartered in Kenya with a team of more than 100 staff, focused on building practical
solutions for Africa’s dynamic transportation needs, from developing apps and web
solutions to providing dedicated support for goods on the move.


### Problem Statement:

Sendy has hired you to help predict the estimated time of delivery of orders, from the
point of driver pickup to the point of arrival at the final destination. Build a model that
predicts an accurate delivery time, from picking up a package arriving at the final
destination. An accurate arrival time prediction will help all business to improve their
logistics and communicate the accurate time their time to their customers. You will be
required to perform various feature engineering techniques while preparing your data for
further analysis.

## 2. Import the required libraries

In [97]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR 
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics   
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis



## 3. Data Cleaning and preparation

In [98]:
# Load data

sendy = pd.read_csv('https://bit.ly/3deaKEM')
sendy.head()

Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,...,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,...,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,...,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214


In [99]:
# Check shape of the data

sendy.shape

(21201, 29)

In [100]:
# Describe the dataset

sendy.describe()

Unnamed: 0,Platform Type,Placement - Day of Month,Placement - Weekday (Mo = 1),Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),Arrival at Pickup - Day of Month,Arrival at Pickup - Weekday (Mo = 1),Pickup - Day of Month,Pickup - Weekday (Mo = 1),Arrival at Destination - Day of Month,Arrival at Destination - Weekday (Mo = 1),Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Time from Pickup to Arrival
count,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,21201.0,16835.0,552.0,21201.0,21201.0,21201.0,21201.0,21201.0
mean,2.752182,15.653696,3.240083,15.653837,3.240225,15.653837,3.240225,15.653837,3.240225,15.653837,3.240225,9.506533,23.258889,7.905797,-1.28147,36.811264,-1.282581,36.81122,1556.920947
std,0.625178,8.798916,1.567295,8.798886,1.567228,8.798886,1.567228,8.798886,1.567228,8.798886,1.567228,5.668963,3.615768,17.089971,0.030507,0.037473,0.034824,0.044721,987.270788
min,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,11.2,0.1,-1.438302,36.653621,-1.430298,36.606594,1.0
25%,3.0,8.0,2.0,8.0,2.0,8.0,2.0,8.0,2.0,8.0,2.0,5.0,20.6,1.075,-1.300921,36.784605,-1.301201,36.785661,882.0
50%,3.0,15.0,3.0,15.0,3.0,15.0,3.0,15.0,3.0,15.0,3.0,8.0,23.5,2.9,-1.279395,36.80704,-1.284382,36.808002,1369.0
75%,3.0,23.0,5.0,23.0,5.0,23.0,5.0,23.0,5.0,23.0,5.0,13.0,26.0,4.9,-1.257147,36.829741,-1.261177,36.829477,2040.0
max,4.0,31.0,7.0,31.0,7.0,31.0,7.0,31.0,7.0,31.0,7.0,49.0,32.1,99.1,-1.14717,36.991046,-1.030225,37.016779,7883.0


In [101]:
# Check for missing values

sendy.isnull().sum()

Order No                                         0
User Id                                          0
Vehicle Type                                     0
Platform Type                                    0
Personal or Business                             0
Placement - Day of Month                         0
Placement - Weekday (Mo = 1)                     0
Placement - Time                                 0
Confirmation - Day of Month                      0
Confirmation - Weekday (Mo = 1)                  0
Confirmation - Time                              0
Arrival at Pickup - Day of Month                 0
Arrival at Pickup - Weekday (Mo = 1)             0
Arrival at Pickup - Time                         0
Pickup - Day of Month                            0
Pickup - Weekday (Mo = 1)                        0
Pickup - Time                                    0
Arrival at Destination - Day of Month            0
Arrival at Destination - Weekday (Mo = 1)        0
Arrival at Destination - Time  

In [102]:
# Drop Precipitation in millimeters column that is missing most values
# Replace missing values in temperature column with a mean of the temperatures

sendy = sendy.drop('Precipitation in millimeters', axis =1)
sendy["Temperature"].fillna(sendy["Temperature"].mean(), inplace = True)

# Recheck for missing values

sendy.isnull().sum()

Order No                                     0
User Id                                      0
Vehicle Type                                 0
Platform Type                                0
Personal or Business                         0
Placement - Day of Month                     0
Placement - Weekday (Mo = 1)                 0
Placement - Time                             0
Confirmation - Day of Month                  0
Confirmation - Weekday (Mo = 1)              0
Confirmation - Time                          0
Arrival at Pickup - Day of Month             0
Arrival at Pickup - Weekday (Mo = 1)         0
Arrival at Pickup - Time                     0
Pickup - Day of Month                        0
Pickup - Weekday (Mo = 1)                    0
Pickup - Time                                0
Arrival at Destination - Day of Month        0
Arrival at Destination - Weekday (Mo = 1)    0
Arrival at Destination - Time                0
Distance (KM)                                0
Temperature  

In [103]:
# Check for duplicates

sendy.duplicated().sum()

0

In [104]:
# Find and drop outliers using the interquartile range (IQR)

Q1 = sendy.quantile(0.25)
Q3 = sendy.quantile(0.75)
IQR = Q3 - Q1
IQR

sendy_clean = sendy[~((sendy< (Q1 - 1.5 * IQR)) | (sendy > (Q3 + 1.5 * IQR))).any(axis=1)]

sendy_clean.shape

  


(14730, 28)

## 4. Feature preparation and spliting the data

In [105]:
# Define features to use

features = sendy_clean[['Distance (KM)', 'Pickup Lat', 'Pickup Long', 'Destination Lat', 'Destination Long', 'Time from Pickup to Arrival']]
target = sendy_clean['Time from Pickup to Arrival']

# Split the data into training and test sets

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.3, random_state=45
)

## 5. Feature scaling 

### Check model without any scaling

In [106]:
# First we check for modeling without without normalisation and standardisation

# Fitting the model

svm_regressor = SVR(kernel='rbf', C=10)
knn_regressor = KNeighborsRegressor()
dec_regressor = DecisionTreeRegressor(random_state=27)

svm_regressor.fit(features_train, target_train)
knn_regressor.fit(features_train, target_train)
dec_regressor.fit(features_train, target_train)

# Make Predictions  
svm_target_pred = svm_regressor.predict(features_test)
knn_target_pred = knn_regressor.predict(features_test)
dec_target_pred = dec_regressor.predict(features_test)

# Finally, we evaluatethe model 

print('SVM RMSE:', np.sqrt(metrics.mean_squared_error(target_test, svm_target_pred)))
print('KNN RMSE:', np.sqrt(metrics.mean_squared_error(target_test, knn_target_pred)))
print('Decision Tree RMSE:', np.sqrt(metrics.mean_squared_error(target_test, dec_target_pred)))

SVM RMSE: 29.565333421236268
KNN RMSE: 0.8457553455630903
Decision Tree RMSE: 0.9134285211620193


### Normalization

In [107]:
# Performing normalisation 

norm = MinMaxScaler().fit(features_train) 
features_train = norm.transform(features_train) 
features_test = norm.transform(features_test)

# Fitting the model

svm_regressor = SVR(kernel='rbf', C=10)
knn_regressor = KNeighborsRegressor()
dec_regressor = DecisionTreeRegressor(random_state=27)

svm_regressor.fit(features_train, target_train)
knn_regressor.fit(features_train, target_train)
dec_regressor.fit(features_train, target_train)

# Make Predictions  
svm_target_pred = svm_regressor.predict(features_test)
knn_target_pred = knn_regressor.predict(features_test)
dec_target_pred = dec_regressor.predict(features_test)

# Finally, we evaluatethe model 

print('SVM RMSE:', np.sqrt(metrics.mean_squared_error(target_test, svm_target_pred)))
print('KNN RMSE:', np.sqrt(metrics.mean_squared_error(target_test, knn_target_pred)))
print('Decision Tree RMSE:', np.sqrt(metrics.mean_squared_error(target_test, dec_target_pred)))

SVM RMSE: 148.01541465119323
KNN RMSE: 95.4126782073876
Decision Tree RMSE: 0.9129329006582698


### Standardization

In [108]:
# Performing standardisation

sc = StandardScaler() 
features_train = sc.fit_transform(features_train)
features_test = sc.fit_transform(features_test)

# Fitting the model

svm_regressor = SVR(kernel='rbf', C=10)
knn_regressor = KNeighborsRegressor()
dec_regressor = DecisionTreeRegressor(random_state=27)

svm_regressor.fit(features_train, target_train)
knn_regressor.fit(features_train, target_train)
dec_regressor.fit(features_train, target_train)

# Make Predictions  
svm_target_pred = svm_regressor.predict(features_test)
knn_target_pred = knn_regressor.predict(features_test)
dec_target_pred = dec_regressor.predict(features_test)

# Finally, we evaluatethe model 

print('SVM RMSE:', np.sqrt(metrics.mean_squared_error(target_test, svm_target_pred)))
print('KNN RMSE:', np.sqrt(metrics.mean_squared_error(target_test, knn_target_pred)))
print('Decision Tree RMSE:', np.sqrt(metrics.mean_squared_error(target_test, dec_target_pred)))

SVM RMSE: 182.2590081878989
KNN RMSE: 114.16117150713866
Decision Tree RMSE: 18.700483382812557


## 6. Feature Selection:

### Feature Transformation: Principal Component Analysis

In [109]:
# Performing normalisation 

norm = MinMaxScaler().fit(features_train) 
features_train = norm.transform(features_train) 
features_test = norm.transform(features_test)

# Applying PCA
 
pca = PCA()
features_train = pca.fit_transform(features_train)
features_test = pca.transform(features_test)

# Fitting the model

svm_regressor = SVR(kernel='rbf', C=10)
knn_regressor = KNeighborsRegressor()
dec_regressor = DecisionTreeRegressor(random_state=27)

svm_regressor.fit(features_train, target_train)
knn_regressor.fit(features_train, target_train)
dec_regressor.fit(features_train, target_train)

# Make Predictions  
svm_target_pred = svm_regressor.predict(features_test)
knn_target_pred = knn_regressor.predict(features_test)
dec_target_pred = dec_regressor.predict(features_test)

# Finally, we evaluatethe model 

print('SVM RMSE:', np.sqrt(metrics.mean_squared_error(target_test, svm_target_pred)))
print('KNN RMSE:', np.sqrt(metrics.mean_squared_error(target_test, knn_target_pred)))
print('Decision Tree RMSE:', np.sqrt(metrics.mean_squared_error(target_test, dec_target_pred)))


SVM RMSE: 173.78867759253905
KNN RMSE: 93.89825412209801
Decision Tree RMSE: 84.67214569246815


### Feature Transformation: Linear Discriminant Analysis

In [110]:
# Performing normalisation 

norm = MinMaxScaler().fit(features_train) 
features_train = norm.transform(features_train) 
features_test = norm.transform(features_test)

# Applying LDA

lda = LinearDiscriminantAnalysis()
features_train = lda.fit_transform(features_train, target_train)
features_test = lda.transform(features_test)

# Fitting the model

svm_regressor = SVR(kernel='rbf', C=10)
knn_regressor = KNeighborsRegressor()
dec_regressor = DecisionTreeRegressor(random_state=27)

svm_regressor.fit(features_train, target_train)
knn_regressor.fit(features_train, target_train)
dec_regressor.fit(features_train, target_train)

# Make Predictions  
svm_target_pred = svm_regressor.predict(features_test)
knn_target_pred = knn_regressor.predict(features_test)
dec_target_pred = dec_regressor.predict(features_test)

# Finally, we evaluatethe model 

print('SVM RMSE:', np.sqrt(metrics.mean_squared_error(target_test, svm_target_pred)))
print('KNN RMSE:', np.sqrt(metrics.mean_squared_error(target_test, knn_target_pred)))
print('Decision Tree RMSE:', np.sqrt(metrics.mean_squared_error(target_test, dec_target_pred)))


SVM RMSE: 498.0547484868409
KNN RMSE: 356.52544575897605
Decision Tree RMSE: 517.6005518705264


### Wrapper Method: Step Forward Feature Selection

In [111]:
# We'll need import and install the following packages: six, sys, mlrose and joblib
# to use `SequentialFeatureSelector` for feature selection from mlxtend.

# importing six and sys
import six
import sys
sys.modules['sklearn.externals.six'] = six

# installing mlrose
!pip install mlrose
import mlrose

# importing joblib
import joblib
sys.modules['sklearn.externals.joblib'] = joblib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [112]:
# First, we then perform modeling with both standardisation and normalisation.

# Performing normalisation 
norm = MinMaxScaler().fit(features_train) 
features_train = norm.transform(features_train) 
features_test = norm.transform(features_test)

# Selecting the ML algorithm to use   
dec_regressor = DecisionTreeRegressor(random_state=27)

from mlxtend.feature_selection import SequentialFeatureSelector
feature_selector = SequentialFeatureSelector(dec_regressor,
           k_features=4,
           forward=True,
           verbose=2,
           scoring='r2',
           cv=4)
 
# Perform step forward feature selection
feature_selector = feature_selector.fit(features_train, target_train) 

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    0.4s finished

[2022-11-25 19:22:00] Features: 1/4 -- score: 0.057473947347903426[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:    0.5s finished

[2022-11-25 19:22:00] Features: 2/4 -- score: 0.21690937677937092[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.2s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:    0.5s finished

[2022-11-25 19:22:01] Features: 3/4 -- score: 0.37444747225462377[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 

In [113]:
# The columns at these indexes are those which were selected

feat_cols = list(feature_selector.k_feature_idx_)
print(feat_cols)

[0, 1, 2, 3]


In [114]:
# We can now use those features to build our model

# Without step forward feature selection (sffs)
dec_regressor = DecisionTreeRegressor(random_state=27)
dec_regressor.fit(features_train, target_train)

# With step forward feature selection

dec_regressor2 = DecisionTreeRegressor(random_state=27)
dec_regressor2.fit(features_train[:, feat_cols], target_train)

# Making Predictions and determining the accuracies

target_test_pred = dec_regressor.predict(features_test)
print('Decision Tree RMSE Without sffs:', np.sqrt(metrics.mean_squared_error(target_test, target_test_pred)))

target_test_pred2 = dec_regressor2.predict(features_test[:, feat_cols])
print('Decision Tree RMSE with sffs:', np.sqrt(metrics.mean_squared_error(target_test, target_test_pred2)))

Decision Tree RMSE Without sffs: 517.6035755446537
Decision Tree RMSE with sffs: 562.8970982738125
