## Practical example. Audiobooks

### Problem
We are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.
The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.
We have a .csv summarizing the data. There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).
So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).
The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.
The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.
This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.

### Import libraries

In [1]:
import numpy as np
import tensorflow as tf 
import os 

In [2]:
def file_path(filename) : 
    return os.path.join(os.path.expanduser("~"), "Desktop", filename)
# Load train data
npz = np.load(file_path("Audiobooks_data_train.npz"))
train_inputs = npz["inputs"].astype(np.float)
train_targets = npz["targets"].astype(np.int)
# Load validation data
npz = np.load(file_path("Audiobooks_data_validation.npz"))
validation_inputs = npz["inputs"].astype(np.float)
validation_targets = npz["targets"].astype(np.int)
# Load test data
npz = np.load(file_path("Audiobooks_data_test.npz"))
test_inputs = npz["inputs"].astype(np.float)
test_targets = npz["targets"].astype(np.int)

### Model

In [3]:
output_size = 2 # enum[0,1]
input_size = 10 # number of columns of train data
hidden_layer_size = 50 # optional
# define how the model will look like
model = tf.keras.Sequential([
    # tf.keras.layers.Dense() is basically implementing : output = activation(dot(input,weight) + bias)    
     # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation="relu"), #1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation="relu"), #2nd hidden layer
    # the final layer is no different, we just make sure to activate it with softmax
    tf.keras.layers.Dense(output_size, activation="softmax")
])

### Choose the optimizer and the loss function

# we define the optimizer we'd like to use, 
# the loss function, 
# and the metrics we are interested in obtaining at each iteration
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

### Training
# That's where we train the model we have built.
batch_size = 100 
max_epochs = 100 
# Set stop early to prevent overfit
# let's set patience=2, to be a bit tolerant against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)
# fit the model
# note that this time the train, validation and test data are not iterable
model.fit(
    train_inputs,
    train_targets,
    epochs = max_epochs,
    callbacks=early_stopping,
    validation_data=(validation_inputs, validation_targets),
    verbose = 2 
)

Epoch 1/100
112/112 - 1s - loss: 0.4761 - accuracy: 0.7960 - val_loss: 0.3849 - val_accuracy: 0.8814
Epoch 2/100
112/112 - 0s - loss: 0.3157 - accuracy: 0.8854 - val_loss: 0.3390 - val_accuracy: 0.8792
Epoch 3/100
112/112 - 1s - loss: 0.2839 - accuracy: 0.8935 - val_loss: 0.3148 - val_accuracy: 0.8904
Epoch 4/100
112/112 - 0s - loss: 0.2657 - accuracy: 0.9016 - val_loss: 0.2966 - val_accuracy: 0.8881
Epoch 5/100
112/112 - 0s - loss: 0.2568 - accuracy: 0.9044 - val_loss: 0.2893 - val_accuracy: 0.8881
Epoch 6/100
112/112 - 0s - loss: 0.2496 - accuracy: 0.9061 - val_loss: 0.2799 - val_accuracy: 0.8904
Epoch 7/100
112/112 - 0s - loss: 0.2493 - accuracy: 0.9058 - val_loss: 0.2853 - val_accuracy: 0.8926
Epoch 8/100
112/112 - 0s - loss: 0.2415 - accuracy: 0.9106 - val_loss: 0.2904 - val_accuracy: 0.8770
Epoch 9/100
112/112 - 0s - loss: 0.2389 - accuracy: 0.9111 - val_loss: 0.2834 - val_accuracy: 0.8904


<tensorflow.python.keras.callbacks.History at 0x7faf8163edd8>

### Test the model

After training and validation data, we test the final predict of power of out model by running on the test dataset that algorithm has NEVER seen before

It  is very important to realize that fiddling with the hyperparameters overfits the validation dataset.

The test is absolutely final instance. We should not test before we are completely done with adjusting out model.

If we adjusting, we will start overfitting the test dataset, which will defeat this purpose

In [5]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)




In [None]:
print(f"Test loss: {test_loss*100:.2f}%, Test accuracy: {test_accuracy*100:.2f}")