# **Audiobook Purchase Prediction**
## **Project description**

The given data comes from an Audiobook app. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create an algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

the data is summarized in a .csv file.

### **Variables:**

*   Customer ID
*   Book length in mins_avg (average of all purchases)
*   Book length in minutes_sum (sum of all purchases. If equal to Book length in mins_avg, thus customers made 1 purchase only.)
*   Price Paid_avg (average of all purchases)
*   Price paid_sum (sum of all purchases. If equal to Price Paid_avg, thus customers made 1 purchase only.)
*   Review (Boolean variable. 1 = Customer left a review)
*   Review (out of 10)
*   Total minutes listened
*   Completion (from 0 to 1) => Total minutes listened / Total lengths of books the person has purchased
*   Support requests (number)
*   Last visited minus purchase date (in days) => measures the difference between the last time a person interacted with the platform and the first purchase date. The bigger the difference, the bigger the engagement.

So these variables are going to be the inputs of our model (excluding customer ID, as it is completely arbitrary).

### **Targets:**

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets.<br />
In fact, we are predicting if based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. If they do not convert after 6 months, chances are they have gone to a competitor or di not like the Audiobook way of digesting information.

**This is a supervised classification problem with two classes: won't buy and will buy, represented by 0s and 1s.**

## Create the Deep Learning algorithm

In [None]:
# import relevant packages
import numpy as np
import tensorflow as tf

In [None]:
# import the preprocessed data
npz = np.load("audiobooks-data-train.npz")
train_inputs, train_targets = npz["inputs"].astype(np.float), npz["targets"].astype(np.int)

npz = np.load("audiobooks-data-validation.npz")
validation_inputs, validation_targets = npz["inputs"].astype(np.float), npz["targets"].astype(np.int)

npz = np.load("audiobooks-data-test.npz")
test_inputs, test_targets = npz["inputs"].astype(np.float), npz["targets"].astype(np.int)

In [None]:
# define layer sizes
input_size = 10
output_size = 2
hidden_layer_size = 150

model = tf.keras.Sequential([
                             tf.keras.layers.Dense(hidden_layer_size, activation="relu"),
                             tf.keras.layers.Dense(hidden_layer_size, activation="relu"),
                             tf.keras.layers.Dense(output_size, activation="softmax")
])

In [None]:
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

In [None]:
print(f"Model trained on {train_inputs.shape[0]} sample, validated on {validation_inputs.shape[0]} sample.")
batch_size = 100
max_epochs = 100
# set an early stopping callback to avoid overfitting (when validation loss starts increasing)
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

model.fit(train_inputs, train_targets, batch_size=batch_size, epochs=max_epochs,
          callbacks=[early_stopping],
          validation_data=(validation_inputs, validation_targets),
          verbose=2)

Model trained on 3579 sample, validated on 447 sample.
Epoch 1/100
36/36 - 1s - loss: 0.5062 - accuracy: 0.7402 - val_loss: 0.3891 - val_accuracy: 0.8233
Epoch 2/100
36/36 - 0s - loss: 0.4081 - accuracy: 0.7843 - val_loss: 0.3467 - val_accuracy: 0.8479
Epoch 3/100
36/36 - 0s - loss: 0.3874 - accuracy: 0.7927 - val_loss: 0.3465 - val_accuracy: 0.8434
Epoch 4/100
36/36 - 0s - loss: 0.3947 - accuracy: 0.7918 - val_loss: 0.3242 - val_accuracy: 0.8523
Epoch 5/100
36/36 - 0s - loss: 0.3749 - accuracy: 0.8055 - val_loss: 0.3379 - val_accuracy: 0.8300
Epoch 6/100
36/36 - 0s - loss: 0.3720 - accuracy: 0.8013 - val_loss: 0.3183 - val_accuracy: 0.8479
Epoch 7/100
36/36 - 0s - loss: 0.3665 - accuracy: 0.8050 - val_loss: 0.3115 - val_accuracy: 0.8479
Epoch 8/100
36/36 - 0s - loss: 0.3609 - accuracy: 0.8022 - val_loss: 0.3196 - val_accuracy: 0.8434
Epoch 9/100
36/36 - 0s - loss: 0.3628 - accuracy: 0.8047 - val_loss: 0.3061 - val_accuracy: 0.8546
Epoch 10/100
36/36 - 0s - loss: 0.3604 - accuracy: 0.8

<tensorflow.python.keras.callbacks.History at 0x7f5e198a2dd0>

On the validation data, the model reaches a score of 85%, which is a good result on a balanced dataset. We should try simpler machine learning model as logistic regression to see if we can reach better results.

### Test the model

In [None]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)
print("Test loss: {0:.2f}. Test accuracy: {1:.2f}%".format(test_loss, test_accuracy * 100.))

Test loss: 0.42. Test accuracy: 75.45%
