#### ML Model Development

In this notebook we start from a simple machine learning problem, John's food delivery problem, and shape a machine learning solution with __sklearn__ library. Given John's dataset of food delivery recordings, the task is to predict whether an order will be on time or delayed. 

1. <a href="#1">The dataset</a>
2. <a href="#2">Select features to build the model</a>
3. <a href="#3">Train a  classifier</a> ([__K Nearest Neighbors Classifier__](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html))
4. <a href="#4">Use the trained classifier to make predictions</a>
5. <a href="#5">Model evaluation</a>
6. <a href="#6">Training and test datasets</a>


## 1. <a id="1">The dataset</a>

Let's enter John's food deliveries dataset here, using the __numpy__ high-performance array-processing package. 

In [34]:
import numpy as np

# This is John's data
data = np.array([[0, 1, 5, 1, 0],  # record of John's 1st food delivery
                [1, 0, 7, 0, 1],   # record of John's 2nd food delivery
                [0, 1, 2, 1, 0],   # record of John's 3rd food delivery
                [1, 1, 4.2, 1, 0], # record of John's 4th food delivery
                [0, 0, 7.8, 0, 1], # ...
                [1, 0, 3.9, 1, 0],
                [0, 1, 4, 1, 0],
                [1, 1, 2, 0, 0],
                [0, 0, 3.5, 0, 1],
                [1, 0, 2.6, 1, 0],
                [0, 0, 4.1, 0, 1],
                [0, 1, 1.5, 0, 1],
                [1, 1, 1.75, 1, 0],
                [1, 0, 1.3, 0, 0],
                [1, 1, 2.1, 0, 0],
                [1, 1, 0.2, 1, 0],
                [1, 1, 5.2, 0, 1],
                [0, 1, 2, 1, 0],
                [1, 0, 5.5, 0, 1],
                [0, 0, 2, 1, 0],
                [1, 1, 1.7, 0, 0],
                [0, 1, 3, 1, 1],
                [1, 1, 1.9, 1, 0],
                [0, 1, 3.1, 0, 1],
                [0, 1, 2.3, 0, 0],
                [0, 0, 1.1, 1, 0],
                [1, 1, 2.5, 1, 1],
                [1, 1, 5, 0, 1],
                [1, 0, 7.5, 1, 1],
                [0, 0, 0.5, 1, 0],
                [0, 0, 1.5, 1, 0],
                [1, 0, 3.2, 1, 0],
                [0, 0, 2.15, 1, 0],
                [1, 1, 4.2, 0, 1],
                [1, 0, 6.5, 0, 1],
                [1, 0, 0.5, 0, 0],
                [0, 0, 3.5, 0, 1],
                [0, 0, 1.75, 0, 0],
                [1, 1, 5, 0, 1],
                [0, 0, 2, 1, 0],
                [0, 1, 1.3, 1, 1],
                [0, 1, 0.2, 0, 0],
                [1, 1, 2.2, 0, 0],
                [0, 1, 1.2, 1, 0],
                [1, 1, 4.2, 0, 1]])

print(data)

[[0.   1.   5.   1.   0.  ]
 [1.   0.   7.   0.   1.  ]
 [0.   1.   2.   1.   0.  ]
 [1.   1.   4.2  1.   0.  ]
 [0.   0.   7.8  0.   1.  ]
 [1.   0.   3.9  1.   0.  ]
 [0.   1.   4.   1.   0.  ]
 [1.   1.   2.   0.   0.  ]
 [0.   0.   3.5  0.   1.  ]
 [1.   0.   2.6  1.   0.  ]
 [0.   0.   4.1  0.   1.  ]
 [0.   1.   1.5  0.   1.  ]
 [1.   1.   1.75 1.   0.  ]
 [1.   0.   1.3  0.   0.  ]
 [1.   1.   2.1  0.   0.  ]
 [1.   1.   0.2  1.   0.  ]
 [1.   1.   5.2  0.   1.  ]
 [0.   1.   2.   1.   0.  ]
 [1.   0.   5.5  0.   1.  ]
 [0.   0.   2.   1.   0.  ]
 [1.   1.   1.7  0.   0.  ]
 [0.   1.   3.   1.   1.  ]
 [1.   1.   1.9  1.   0.  ]
 [0.   1.   3.1  0.   1.  ]
 [0.   1.   2.3  0.   0.  ]
 [0.   0.   1.1  1.   0.  ]
 [1.   1.   2.5  1.   1.  ]
 [1.   1.   5.   0.   1.  ]
 [1.   0.   7.5  1.   1.  ]
 [0.   0.   0.5  1.   0.  ]
 [0.   0.   1.5  1.   0.  ]
 [1.   0.   3.2  1.   0.  ]
 [0.   0.   2.15 1.   0.  ]
 [1.   1.   4.2  0.   1.  ]
 [1.   0.   6.5  0.   1.  ]
 [1.   0.   0.5  0. 

Let's now write our toy dataset into a __pandas__ dataframe, labeling the columns for easier access. 

In [35]:
import pandas as pd

# Create the dataframe with this data, labeling the columns
delivery_data = pd.DataFrame(data, columns=["bad_weather", "is_rush_hour", "mile_distance", "urban_address", "late"])


Looking at our dataset as a dataframe:

In [36]:
# Print the first 5 rows
delivery_data.head(15)

Unnamed: 0,bad_weather,is_rush_hour,mile_distance,urban_address,late
0,0.0,1.0,5.0,1.0,0.0
1,1.0,0.0,7.0,0.0,1.0
2,0.0,1.0,2.0,1.0,0.0
3,1.0,1.0,4.2,1.0,0.0
4,0.0,0.0,7.8,0.0,1.0
5,1.0,0.0,3.9,1.0,0.0
6,0.0,1.0,4.0,1.0,0.0
7,1.0,1.0,2.0,0.0,0.0
8,0.0,0.0,3.5,0.0,1.0
9,1.0,0.0,2.6,1.0,0.0


Dataframes are not just more meaningful to look at, are also powerful, expressive and flexible data structures that make data manipulation and analysis much easier. 

## 2. <a id="2">Select features to build the model</a>

Let's start using the dataframe, by first grabing the input and output of our machine learning problem.

In [37]:
input_data = delivery_data[["bad_weather", "is_rush_hour", "mile_distance", "urban_address"]]
target = delivery_data["late"]

On this dataset containing samples of each of the two possible classes, we fit an estimator from the __sklearn__ library to best capture the relationship between the input and the output, and further explore that learned relationship to predict the classes to which unseen samples belong.

In __sklearn__, an estimator is a Python object that implements the methods __.fit()__ and __.predict()__. The estimator’s constructor takes as arguments the model’s parameters.


## 3. <a id="3">Train a classifier</a>


Let's fit a K Nearest Neighbors (KNN) model to our data. We use the sklearn library's [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) here.

In [38]:
from sklearn.neighbors import KNeighborsClassifier

# Use n_neighbors = 1
# This means the KNN will consider the "closest" record to make a decision.
classifier = KNeighborsClassifier(n_neighbors = 1)

# Fit the model to our data
classifier.fit(input_data, target)

KNeighborsClassifier(n_neighbors=1)

## 4. <a id="4">Use the trained classifier to make predictions</a>

Let's make some prediction with our fitted model. Assume we have the following data:

In [39]:
import numpy as np

some_data = np.array([[0, 0, 2.1, 1]]) # bad_weather->0, is_rush_hour->0, mile_distance->2.1 and urban_address->1

# Use the fitted model to make predictions on new data
print(classifier.predict(some_data))

[0.]


We predicted this delivery to be on time.

We can also predict multiple records, as shown below.

In [40]:
import numpy as np

some_data = np.array([[0, 0, 2.1, 1], # bad_weather->0, is_rush_hour->0, mile_distance->2.1 and urban_address->1
                 [0, 1, 5, 0],   # bad_weather->0, is_rush_hour->1, mile_distance->5.0 and urban_address->0
                 [1, 1, 3.1, 1]  # bad_weather->1, is_rush_hour->1, mile_distance->3.1 and urban_address->1
                ])

# Use the fitted model to make predictions on more new data
print(classifier.predict(some_data))

[0. 0. 1.]


The last delivery is predicted to be late. The first two will be on time (Hopefully!)

## 5. <a id="5">Model evaluation</a>

__How do we know whether our predictions were good or bad predictions?__ <br/>
If we don't have the correct label for this input, we won't know. Similarly, we won't have any idea about how good this model is. 

One thing we can do is to test the model with the data we used to train it, and use sklearn's metrics functions to examine the performance of the classifier.

In [41]:
# Use the fitted model to make predictions on our training dataset
predictions = classifier.predict(input_data)


__Confusion matrix__: The diagonals show us correct classifications. Each row and column belongs to a class (late and on time). The first column and row correspond to "on time" case, the second column-rows are "late" cases.


In [42]:
from sklearn.metrics import confusion_matrix

print(confusion_matrix(target, predictions))

[[27  0]
 [ 0 18]]


When we look at the confusion matrix, we can quickly see that all predictions were correct, so our classifier should have a high score.

__Classification metrics__: We use here the __accuracy__ metric, that measures how correctly the trained model predicts the late or not late outcomes. Let's look at the classification report and the accuracy score below.

In [43]:
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

print(classification_report(target, predictions))

print("Accuracy:", accuracy_score(target, predictions))


              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        27
         1.0       1.00      1.00      1.00        18

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

Accuracy: 1.0


Indeed, we predicted all outcomes with 100% accuracy!

## 6. <a id="6">Training and test datasets</a>

John's model worked with 100% accuracy on the whole dataset. This might seem promising, but this doesn't tell us anything about performance on future orders. One way to test whether this model works on new "unseen" orders, is to reserve some data from out original dataset for test purposes. 

__Let's split our data into two sets: Training (85%) and Test (15%)__. This will give us 38 training records and 7 test records (of the total 45 records).

In [44]:
delivery_data.shape

(45, 5)

In [45]:
# Let's split our data into two sets: Training (85%) and Test (15%)
# This gives us 38 training records and 7 test records (total 45 records)

training_data = delivery_data.iloc[:38, :] # First 38
test_data = delivery_data.iloc[38:, :] # Remaining

# Print the first 5 rows
training_data.head()

Unnamed: 0,bad_weather,is_rush_hour,mile_distance,urban_address,late
0,0.0,1.0,5.0,1.0,0.0
1,1.0,0.0,7.0,0.0,1.0
2,0.0,1.0,2.0,1.0,0.0
3,1.0,1.0,4.2,1.0,0.0
4,0.0,0.0,7.8,0.0,1.0


Fitting the KNN on training dataset this time.

In [46]:
from sklearn.neighbors import KNeighborsClassifier

X_train = training_data[["bad_weather", "is_rush_hour", "mile_distance", "urban_address"]].values
y_train = training_data["late"].tolist()

# Use n_neighbors = 1
# This means the KNN will consider two other "closest" records to make a decision.
classifier = KNeighborsClassifier(n_neighbors = 1)

# Fit the model to our training data
classifier.fit(X_train, y_train)

KNeighborsClassifier(n_neighbors=1)

Let's check the accuracy on the training data.

In [47]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted model to make predictions on the same dataset we trained the model on
train_predictions = classifier.predict(X_train)

print('Model evaluation on the training set: \n')
print(confusion_matrix(y_train, train_predictions))
print(classification_report(y_train, train_predictions))
print("Training accuracy:", accuracy_score(y_train, train_predictions))


Model evaluation on the training set: 

[[23  0]
 [ 0 15]]
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00        23
         1.0       1.00      1.00      1.00        15

    accuracy                           1.00        38
   macro avg       1.00      1.00      1.00        38
weighted avg       1.00      1.00      1.00        38

Training accuracy: 1.0


And now, let's check the accuracy on the test data.

In [48]:
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

X_test = test_data[["bad_weather", "is_rush_hour", "mile_distance", "urban_address"]].values
y_test = test_data["late"].tolist()

# Use the fitted model to make predictions on the test dataset
test_predictions = classifier.predict(X_test)

print('Model evaluation on the training set: \n')
print(confusion_matrix(y_test, test_predictions))
print(classification_report(y_test, test_predictions))
print("Training accuracy:", accuracy_score(y_test, test_predictions))


Model evaluation on the training set: 

[[3 1]
 [1 2]]
              precision    recall  f1-score   support

         0.0       0.75      0.75      0.75         4
         1.0       0.67      0.67      0.67         3

    accuracy                           0.71         7
   macro avg       0.71      0.71      0.71         7
weighted avg       0.71      0.71      0.71         7

Training accuracy: 0.7142857142857143
