# Practical example: Audiobooks

## Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

Since we are dealing with real-life data, we will need to preprocess it a bit. This is the relevant code, which is not that hard, but refers to data engineering more than machine learning. 

If you want to know how to do that, go through the code and the comments. In any case, this should do the trick for all datasets organized in this way: many inputs, and then 1 cell containing the targets (all supervized learning datasets).

Note that we have removed the header row, which contains the names of the categories. We simply want the data.

### Extract the data from the csv

In [1]:
import numpy as np
from sklearn import preprocessing

raw_csv_data = np.loadtxt('The Data Science Course 2018/Part_7_Deep_Learning/S55_L392/Audiobooks-data.csv', delimiter=',')

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

### Balance the dataset

In [2]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### Standardize the inputs

In [3]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [4]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

In [5]:
samples_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1798.0 3579 0.502374965074043
219.0 447 0.4899328859060403
220.0 448 0.49107142857142855


### Save the three datasets in *.npz

In [6]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

## Problem

You are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases). Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name than a number).

The targets are a Boolean variable (so 0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months are targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again.

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

## Create the machine learning algorithm

### Modules

In [7]:
import tensorflow as tf

### Data

In [8]:
npz = np.load('Audiobooks_data_train.npz')

train_inputs = npz['inputs'].astype(np.float)
train_targets = npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_validation.npz')

validation_inputs = npz['inputs'].astype(np.float)
validation_targets = npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_test.npz')

test_inputs = npz['inputs'].astype(np.float)
test_targets = npz['targets'].astype(np.int)

### Model

Outline, optimizers, loss, early stopping, and training

In [9]:
input_size = 10
output_size = 2
hidden_layer_size = 50

model = tf.keras.Sequential([
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                            tf.keras.layers.Dense(hidden_layer_size, activation='relu'),
                            tf.keras.layers.Dense(output_size, activation='softmax')
                            ])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

BATCH_SIZE = 100
MAX_EPOCHS = 100

early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

model.fit(train_inputs,
         train_targets,
         batch_size=BATCH_SIZE,
         epochs=MAX_EPOCHS,
         callbacks=[early_stopping],
         validation_data=(validation_inputs, validation_targets),
         verbose=2)

Train on 3579 samples, validate on 447 samples
Epoch 1/100
3579/3579 - 1s - loss: 0.5920 - accuracy: 0.6734 - val_loss: 0.5152 - val_accuracy: 0.7673
Epoch 2/100
3579/3579 - 0s - loss: 0.4643 - accuracy: 0.7639 - val_loss: 0.4409 - val_accuracy: 0.7629
Epoch 3/100
3579/3579 - 0s - loss: 0.4112 - accuracy: 0.7890 - val_loss: 0.4091 - val_accuracy: 0.8009
Epoch 4/100
3579/3579 - 0s - loss: 0.3852 - accuracy: 0.7994 - val_loss: 0.3834 - val_accuracy: 0.7942
Epoch 5/100
3579/3579 - 0s - loss: 0.3702 - accuracy: 0.8019 - val_loss: 0.3850 - val_accuracy: 0.8098
Epoch 6/100
3579/3579 - 0s - loss: 0.3617 - accuracy: 0.8092 - val_loss: 0.3713 - val_accuracy: 0.8054
Epoch 7/100
3579/3579 - 0s - loss: 0.3561 - accuracy: 0.8030 - val_loss: 0.3671 - val_accuracy: 0.7987
Epoch 8/100
3579/3579 - 0s - loss: 0.3527 - accuracy: 0.8047 - val_loss: 0.3579 - val_accuracy: 0.8121
Epoch 9/100
3579/3579 - 0s - loss: 0.3457 - accuracy: 0.8111 - val_loss: 0.3573 - val_accuracy: 0.7852
Epoch 10/100
3579/3579 - 0

<tensorflow.python.keras.callbacks.History at 0x2db58a472e8>

### Test the model

In [11]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets, verbose=2)

448/1 - 0s - loss: 0.3235 - accuracy: 0.8304


In [12]:
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))


Test loss: 0.34. Test accuracy: 83.04%
