## Create Realtime ML Model

We use this notebook to create the model we will use, when doing our realtime scoring.

#### Pre-reqs

Ensure you have the following Python packages installed (`pip install`):

* `pandas`
* `scikit-learn`
* `numpy`
* `pickle`

The data we use to create the model is in the `advertising.csv` file found in the top level `data` directory. The file is "borrowed" from this blog post: [Online Ad Click Prediction with Machine Learning](https://towardsdatascience.com/online-ad-click-prediction-with-machine-learning-b68c1467d960)

If you are running this notebook from **VS Code** you may also have to install `ipykernel`.

With the above requirements satisfied we can get going!

### Import Required Libraries/Packages

Let's import what we need:

In [18]:
import numpy as np
import pandas as pd
import sklearn
import pickle

# We'll use LogisticRegression for our model
from sklearn.linear_model import LogisticRegression
# To split our data we use train_test_split.
from sklearn.model_selection import train_test_split

### Load the Data

Let's load the data and get a feel for what it looks like:

In [19]:
# read in the data from the csv file 
ad_df = pd.read_csv('../../data/advertising.csv')
# display the first 5 rows of the data
ad_df.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Ad Topic Line,City,Male,Country,Timestamp,Clicked on Ad
0,68.95,35,61833.9,256.09,Cloned 5thgeneration orchestration,Wrightburgh,0,Tunisia,2016-03-27 00:53:11,0
1,80.23,31,68441.85,193.77,Monitored national standardization,West Jodi,1,Nauru,2016-04-04 01:39:02,0
2,69.47,26,59785.94,236.5,Organic bottom-line service-desk,Davidton,0,San Marino,2016-03-13 20:35:42,0
3,74.15,29,54806.18,245.89,Triple-buffered reciprocal time-frame,West Terrifurt,1,Italy,2016-01-10 02:31:19,0
4,68.37,35,73889.99,225.58,Robust logistical utilization,South Manuel,0,Iceland,2016-06-03 03:36:18,0


#### Feature Engineering

From `ad_df.head()` above we see that the dataset has 10 columns, including the one we want to predict `Clicked on Ad`.

As per the [blog post](https://towardsdatascience.com/online-ad-click-prediction-with-machine-learning-b68c1467d960), some of the columns do not have that big impact on the model, so let us drop those:

In [5]:
X = ad_df.drop(labels=['Ad Topic Line','City','Country','Timestamp','Clicked on Ad'], axis=1)
y = ad_df['Clicked on Ad']
# After the drop the dataset looks like so
X.head()

Unnamed: 0,Daily Time Spent on Site,Age,Area Income,Daily Internet Usage,Male
0,68.95,35,61833.9,256.09,0
1,80.23,31,68441.85,193.77,1
2,69.47,26,59785.94,236.5,0
3,74.15,29,54806.18,245.89,1
4,68.37,35,73889.99,225.58,0


### Create the Model

We are now ready to create the model, but first let us split the dataset into training and test:

In [21]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.3, random_state = 42)

#### Create Logistic Regression Model

As this is not so much about what model is the best (we just want one to use in a real time scenario), we now create a Logistic Regression model:

In [22]:
lr_model = LogisticRegression(solver='lbfgs', max_iter=1000)
lr_model.fit(X_train, y_train)

#### Predictions

Just to ensure that everything works let us do some predictions:

In [23]:
# Here we predict whether or not a user will click on an ad based on the features of that user.
binPredict = lr_model.predict(X_test)
print(binPredict)

[0 1 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 0 1 0 1 1 0 0 1 0 0 1 1 0 1 0 1 1 0 0
 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0 1 0 0 0 0 1 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 0
 1 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 1
 0 1 1 0 1 0 0 0 0 1 1 1 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 1 0
 0 0 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1
 1 0 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0
 1 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1
 0 0 1 1 0 1 0 1 0 0 1 1 1 1 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0 1 0 1
 0 1 1 0]


In [24]:
# Here we predict the probability that a user will click on an ad based on the features of that user.
# The first column is the probability that the user will not click on the ad, and the second column is the probability that the user will click on the ad.
# This is what we will use in the real-time scoring script.
prob = lr_model.predict_proba(X_test)
print(prob)

[[5.05899747e-01 4.94100253e-01]
 [3.01971071e-03 9.96980289e-01]
 [4.95289102e-02 9.50471090e-01]
 [1.00266761e-02 9.89973324e-01]
 [9.70050585e-01 2.99494150e-02]
 [7.15351236e-01 2.84648764e-01]
 [9.76649834e-01 2.33501657e-02]
 [9.21280366e-03 9.90787196e-01]
 [6.91399492e-01 3.08600508e-01]
 [5.18626496e-02 9.48137350e-01]
 [9.80802654e-01 1.91973459e-02]
 [9.09078753e-02 9.09092125e-01]
 [2.72400259e-03 9.97275997e-01]
 [9.75705812e-01 2.42941879e-02]
 [8.16276978e-02 9.18372302e-01]
 [6.29021287e-03 9.93709787e-01]
 [1.17786190e-03 9.98822138e-01]
 [3.24858449e-02 9.67514155e-01]
 [9.07252415e-01 9.27475849e-02]
 [9.69607953e-03 9.90303920e-01]
 [9.07315493e-01 9.26845067e-02]
 [1.32724713e-02 9.86727529e-01]
 [5.90823355e-03 9.94091766e-01]
 [9.84901444e-01 1.50985558e-02]
 [9.79923019e-01 2.00769811e-02]
 [4.57255934e-03 9.95427441e-01]
 [9.91313062e-01 8.68693822e-03]
 [9.85066207e-01 1.49337933e-02]
 [9.15462235e-03 9.90845378e-01]
 [2.69132722e-01 7.30867278e-01]
 [9.826569

In [25]:
# Below we see the probabilities for one user in the test set.
print(prob[0, 0:2])

[0.50589975 0.49410025]


#### Example of Calling it for One User

In the real-time scenario we will get data for one user, and the way we call the model looks like so:

In [26]:
# create an array of the incoming data
# columns: time spent on site, age, area income, daily internet usage, male
array = np.array([42.60, 55, 55121.65,168.29, 0])

# below we call array.reshape as predict_proba expects a 2D array.
prob = lr_model.predict_proba(array.reshape(1, -1))

print(prob)


[[0.0011894 0.9988106]]


### Save the Model

We now have a model we are happy with. Let us save it off, so we can later load it and push it to a topic in Kafka.

In [28]:
# we use pickle to save the model
# the folder need to exist beforehand
pickle.dump(lr_model, open('../../models/adclick.pkl', 'wb'))

In [29]:
# just for "giggles" we load the model back in
model = pickle.load(open('../../models/adclick.pkl', 'rb'))
array = np.array([42.60, 55, 55121.65,168.29, 0])
# and then we do a prediction
prob = lr_model.predict_proba(array.reshape(1, -1))
print(prob)

[[0.0011894 0.9988106]]


So the above semed to work, we can now continue.