# <a name="0">Prediction of Food Delivery</a>


In this notebook we address John's food delivery problem and shape a machine learning solution with <b>sklearn</b> library. Given John's dataset of food delivery recordings, the task is to predict whether an order will be on time or delayed.  

1. <a href="#1">The dataset</a>
2. <a href="#2">Select features to build the model</a>
3. <a href="#3">Train a classifier (<b>K Nearest Neighbours Classifier </b>)</a>
4. <a href="#4">Use the trained classifier to make the predictions</a>
5. <a href="#5">Model Evaluation</a>
6. <a href="#6">Training and test datasets</a>
7. <a href="#7">Overfitting</a>

## 1. <a name="1">The dataset</a>
(<a href="#1">Go to top</a>)

Let's enter John's food deliveries dataset here, using the <b>numpy</b> high performance array-processing package.

In [3]:
import numpy as np

#This is John's data
data = np.array([[0, 1, 5, 1, 0], # record of John's 1st food delivery
               [1, 0, 7, 0, 1],   # record of John's 2nd food delivery
               [0, 1, 2, 1, 1],   # record of John's 3rd food delivery
               [1, 1, 4.2, 1, 0], # record of John's 4th food delivery
               [0, 0, 7.8, 0, 1], # ...
               [1, 0, 3.9, 1, 0],
               [0, 1, 4, 1, 0],
               [1, 1, 2, 0, 0],
               [0, 0, 3.5, 0, 1],
               [1, 0, 2.6, 1, 0],
               [0, 0, 4.1, 0, 1],
               [0, 1, 1.5, 0, 1],
               [1, 1, 1.75, 1, 0],
               [1, 0, 1.3, 0, 0],
               [1, 1, 2.1, 0, 0],
               [1, 1, 0.2, 1, 0],
               [1, 1, 5.2, 0, 1],
               [0, 1, 2, 1, 0],
               [1, 0, 5.5, 0, 1],
               [0, 0, 2, 1, 0],
               [1, 1, 1.7, 0, 0],
               [0, 1, 3, 1, 1],
               [1, 1, 1.9, 1, 0],
               [0, 1, 3.1, 0, 1],
               [0, 1, 2.3, 0, 0],
               [0, 0, 1.1, 1, 0],
               [1, 1, 2.5, 1, 1],
               [1, 1, 5, 0, 1],
               [1, 0, 7.5, 1, 1],
               [0, 0, 0.5, 1, 0],
               [0, 0, 0.5, 1, 0],
               [0, 0, 1.5, 1, 0],
               [1, 0, 3.2, 1, 0],
               [0, 0, 2.15, 1, 0],
               [1, 0, 4.2, 0, 1],   
               [1, 0, 6.5, 0, 1],
               [1, 0, 0.5, 0, 0], 
               [0, 0, 3.5, 0, 1],
               [0, 0, 1.75, 0, 0],
               [1, 1, 5, 0, 1],
               [0, 0, 2, 1, 0],
               [0, 1, 1.3, 1, 1],
               [0, 1, 0.2, 0, 0],
               [1, 1, 2.2, 0, 0],
               [0, 1, 1.2, 1, 0],
               [1, 1, 4.2, 0, 1]])

print(data)
print(data.shape)

[[0.   1.   5.   1.   0.  ]
 [1.   0.   7.   0.   1.  ]
 [0.   1.   2.   1.   1.  ]
 [1.   1.   4.2  1.   0.  ]
 [0.   0.   7.8  0.   1.  ]
 [1.   0.   3.9  1.   0.  ]
 [0.   1.   4.   1.   0.  ]
 [1.   1.   2.   0.   0.  ]
 [0.   0.   3.5  0.   1.  ]
 [1.   0.   2.6  1.   0.  ]
 [0.   0.   4.1  0.   1.  ]
 [0.   1.   1.5  0.   1.  ]
 [1.   1.   1.75 1.   0.  ]
 [1.   0.   1.3  0.   0.  ]
 [1.   1.   2.1  0.   0.  ]
 [1.   1.   0.2  1.   0.  ]
 [1.   1.   5.2  0.   1.  ]
 [0.   1.   2.   1.   0.  ]
 [1.   0.   5.5  0.   1.  ]
 [0.   0.   2.   1.   0.  ]
 [1.   1.   1.7  0.   0.  ]
 [0.   1.   3.   1.   1.  ]
 [1.   1.   1.9  1.   0.  ]
 [0.   1.   3.1  0.   1.  ]
 [0.   1.   2.3  0.   0.  ]
 [0.   0.   1.1  1.   0.  ]
 [1.   1.   2.5  1.   1.  ]
 [1.   1.   5.   0.   1.  ]
 [1.   0.   7.5  1.   1.  ]
 [0.   0.   0.5  1.   0.  ]
 [0.   0.   0.5  1.   0.  ]
 [0.   0.   1.5  1.   0.  ]
 [1.   0.   3.2  1.   0.  ]
 [0.   0.   2.15 1.   0.  ]
 [1.   0.   4.2  0.   1.  ]
 [1.   0.   6.5  0. 

Let's now write our toy dataset into <b>pandas</b> dataframe, labeling the columns for easier access. 

In [5]:
import pandas as pd

#Create the dataframe with this data, labeling the columns
delivery_data = pd.DataFrame(data, columns=["bad_weather", "is_rush_hour", "mile_distance", "urban_address", "late"])

Looking at our dataset as a dataframe:

In [6]:
# Print the first 15 rows
delivery_data.head(15)

Unnamed: 0,bad_weather,is_rush_hour,mile_distance,urban_address,late
0,0.0,1.0,5.0,1.0,0.0
1,1.0,0.0,7.0,0.0,1.0
2,0.0,1.0,2.0,1.0,1.0
3,1.0,1.0,4.2,1.0,0.0
4,0.0,0.0,7.8,0.0,1.0
5,1.0,0.0,3.9,1.0,0.0
6,0.0,1.0,4.0,1.0,0.0
7,1.0,1.0,2.0,0.0,0.0
8,0.0,0.0,3.5,0.0,1.0
9,1.0,0.0,2.6,1.0,0.0


Dataframes are not just more meaningful to look at, are also powerful, expressive and flexible data structures that make data manipulation and analysis much easier. 

## 2. <a name="2">Select features to build a model</a>
(<a href="#2">Go to top</a>)

Let's start using the dataframe, by first grabbing the input and output of our machine learning problem. 

In [7]:
input_data = delivery_data[["bad_weather", "is_rush_hour", "mile_distance", "urban_address"]]
target = delivery_data["late"]

On this dataset containing samples of each of the two possible classes, we fit an estimator from the <b>sklearn</b> library to best capture the relationship between the input and the output, and further explore that learned relationship to predict the classes to which unseen samples belong. 

In <b>sklearn</b>, an estimator is a Python object that implements the methods <b>.fit()</b> and <b>.predict()</b>. The estimator's contructor takes as arguments the model's parameters.

## 3. <a name="3">Train a classifier</a>
(<a href="#3">Go to top</a>)

Let's fit a K Nearest Neighbour (KNN) model to our data. We use the sklearn's library's KNeighborsClassifier here. 

In [9]:
from sklearn.neighbors import KNeighborsClassifier

# Use n_neighbors = 1
# This means the KNN will consider the "closest" record to make a decision.
classifier = KNeighborsClassifier(n_neighbors = 1)

# Fit the model to our data
classifier.fit(input_data, target)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

## 4. <a name="4">Use the trained classifier to make predictions</a>
(<a href="#4">Go to top</a>)

Let's make some prediction with our fitted model. Assume we have the following data:

In [10]:
import numpy as np

some_data = np.array([[0, 0, 2.1, 1]]) # bad_weather ->0, is_rush_hour->0, mile_distance-> 2.1 and urban_address->1

# Use the fitted model to make predictions on new data
print(classifier.predict(some_data))

[0.]


We predicted this delivery to be on time.

We can also predict multiple records, as shown below.

In [11]:
import numpy as np

some_data = np.array([[0, 0, 2.1, 1], # bad_weather ->0, is_rush_hour->0, mile_distance-> 2.1 and urban_address->1
                     [0, 1, 5, 0],    # bad_weather ->0, is_rush_hour->1, mile_distance-> 5 and urban_address->1
                     [1, 1, 3.1, 1]   # bad_weather ->1, is_rush_hour->1, mile_distance-> 3.1 and urban_address->1
                     ])

# Use the fitted model to make predictions on more new data
print(classifier.predict(some_data))

[0. 0. 1.]


The last delivery is predicted to be late. The first two will be on time. (hopefully)

## 5. <a name="5">Model Evaluation</a>
(<a href="#5">Go to top</a>)

<b>How do we know whether our predictions were good or bad predictions?</b>
If we don't have the correct label for this input, we won't know. Similarly, we won't have any idea about how good this model is.

One thing we can do is to test the model with the data we used to train it, and use sklearn's metrics functions to examine the performance of the classifier.  

In [12]:
# Use the fitted model to make predictions on our training dataset.
predictions = classifier.predict(input_data)

<b>Confusion matrix:</b> The diagonals show us correct classifications. Each row and column belongs to a class (late and on time). The first column and row corresponds to "on time" case, the second column-rows are "late" cases.

In [13]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(target, predictions))

[[26  1]
 [ 0 19]]


When we look at the confusion matrix, we can quickly see that all predictions were correct, so our classifier should have a high score.

<b>Classification metrics:</b> We use here the <b>accuracy</b> metric, that measures how correctly the trained model predicts the late or not late outcomes. Let's look at the classification report and the accuracy score below.

In [14]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print(classification_report(target, predictions))

print("Accuracy:", accuracy_score(target, predictions))

             precision    recall  f1-score   support

        0.0       1.00      0.96      0.98        27
        1.0       0.95      1.00      0.97        19

avg / total       0.98      0.98      0.98        46

Accuracy: 0.9782608695652174


Indeed, we predicted all outcomes with 97% accuracy.

## 5. <a name="6">Training and test datasets</a>
(<a href="#6">Go to top</a>)

John's model worked with 97% accuracy on the whole dataset. This might seem promising, but this doesn't tell us anything about performance on future orders. One way to test whether this model works on new "unseen" orders, is to reserve some data from our original dataset for test purposes. 

<b>Let's split our data into train and test sets: Training (85%) and test (15%).</b> This will give us 38 training records (of the total 45 records).

In [16]:
delivery_data.shape

(46, 5)

In [18]:
# Let's split our data into two sets: Training (85%) and test (15%)
# This gives us 38 training records and 7 test records (total 45 records)

training_data = delivery_data.iloc[:38, :] # First 38
test_data = delivery_data.iloc[38:, :] # Remaining

#Print the first 5 rows
training_data.head()

Unnamed: 0,bad_weather,is_rush_hour,mile_distance,urban_address,late
0,0.0,1.0,5.0,1.0,0.0
1,1.0,0.0,7.0,0.0,1.0
2,0.0,1.0,2.0,1.0,1.0
3,1.0,1.0,4.2,1.0,0.0
4,0.0,0.0,7.8,0.0,1.0


Fitting the KNN on training dataset this time.

In [21]:
from sklearn.neighbors import KNeighborsClassifier

X_train = training_data[["bad_weather", "is_rush_hour", "mile_distance", "urban_address"]].values
y_train = training_data["late"].tolist()

# Use n_neighbors = 1
# This means the KNN will consider the "closest" record to make a decision.
classifier = KNeighborsClassifier(n_neighbors = 1)

# Fit the model to our training data
classifier.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Let's check the accuracy on training data.

In [22]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted model to make predictions on the same dataset we trained the model on
train_predictions = classifier.predict(X_train)

print('Model evaluation on the training set:\n')
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print('Training accuracy:', accuracy_score(y_train, train_predictions))

Model evaluation on the training set:

[[21  1]
 [ 0 16]]
             precision    recall  f1-score   support

        0.0       1.00      0.95      0.98        22
        1.0       0.94      1.00      0.97        16

avg / total       0.98      0.97      0.97        38

Training accuracy: 0.9736842105263158


And now, let's check the accuracy on the test data.

In [23]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

X_test = test_data[["bad_weather", "is_rush_hour", "mile_distance", "urban_address"]].values
y_test = test_data["late"].tolist()

# Use the fitted model to make predictions on the same dataset we trained the model on
test_predictions = classifier.predict(X_test)

print('Model evaluation on the training set:\n')
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print('Training accuracy:', accuracy_score(y_test, test_predictions))

Model evaluation on the training set:

[[2 3]
 [0 3]]
             precision    recall  f1-score   support

        0.0       1.00      0.40      0.57         5
        1.0       0.50      1.00      0.67         3

avg / total       0.81      0.62      0.61         8

Training accuracy: 0.625


## 7. <a name="7">Overfitting</a>
(<a href="#7">Go to top</a>)

<b>This doesn't look good!</b>

We only achieved 62% accuracy on data that model hasn't seen before.
Can we trust this model? Probably not.

<b>Let's explain what happened here.</b>

We experienced a common problem called <b>"Overfitting"</b>. This means our model "over-learned" or memorized our training data, and failed on the new data it hasn't seen before.

Experienced people would have spotted the problem even before fitting the classifier, the K parameter we chose as 1 here looks at the closest one record and assign the class of that record. This doesn't generalize well to our overall dataset and "overfits" the dataset.

<b>Where is the validation subset?</b>

If we want to optimize the performance of our algorithm, we therefore reduce the so-called <i>generalization-gap</i>, we need to look for the best performing K value, using a validation set. We pick the K value that results in the best validation performance metric of our choice, and then we finally check model performance on the test dataset.

In [24]:
# Let's further split our training data into two sets: Training (80%) and Validation (20%)
# This gives us 30 training records and 8 test records

train_data = training_data.iloc[:30, :] #First 30
val_data = training_data.iloc[30:, :] #Remaining

X_train = train_data[["bad_weather", "is_rush_hour", "mile_distance", "urban_address"]].values
y_train = train_data["late"].tolist()

X_val = val_data[["bad_weather", "is_rush_hour", "mile_distance", "urban_address"]].values
y_val = val_data["late"].tolist()

<b>Trying different K values.</b>

Let's try different K values and see how the model performs with each one, on the validation set.

In [27]:
from sklearn.neighbors import KNeighborsClassifier

K_values = [1, 2, 3, 4, 5, 6]

for K in K_values:
    classifier = KNeighborsClassifier(n_neighbors = K)
    classifier.fit(X_train, y_train)
    val_predictions = classifier.predict(X_val)
    print("K=%d, Validation accuracy score:%f" % (K, accuracy_score(y_val, val_predictions)))

K=1, Validation accuracy score:1.000000
K=2, Validation accuracy score:0.875000
K=3, Validation accuracy score:1.000000
K=4, Validation accuracy score:1.000000
K=5, Validation accuracy score:1.000000
K=6, Validation accuracy score:1.000000


Looks like K=3 or K=4 or K=5 are optimal choices for K. Let's choose K=4 to build the classifier, train on the train set, and finally test on the test set.

In [28]:
classifier = KNeighborsClassifier(n_neighbors = 4)
classifier.fit(X_train, y_train)
test_predictions = classifier.predict(X_test)
print("Test accuracy score: %f" % (accuracy_score(y_test, test_predictions)))

Test accuracy score: 0.875000


Indeed, accuracy on the test set improved from 62% to 87%, reducing the generalization gap. 