## Audiobook classification

## Problem

You are given data from an Audiobook app. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertizing to him/her. If we can focus our efforts ONLY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

So these are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (so 0, or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

## Preprocessing

In [None]:
# Let's preprocess the data
from sklearn import preprocessing
import numpy as np
import pandas as pd
import tensorflow as tf

raw_data = np.loadtxt('Audiobooks_data.csv', delimiter=',')

# To extract inputs and targets
inputs_all = raw_data[:,1:-1]
target_all = raw_data[:,-1]


In [None]:
target_all.shape[0]

14084

### To visualize the number of 1 and the number of 0 in the dataset

In [None]:
data_panda = pd.DataFrame(raw_data)
data_panda.head()


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,873.0,2160.0,2160.0,10.13,10.13,0.0,8.91,0.0,0.0,0.0,0.0,1.0
1,611.0,1404.0,2808.0,6.66,13.33,1.0,6.5,0.0,0.0,0.0,182.0,1.0
2,705.0,324.0,324.0,10.13,10.13,1.0,9.0,0.0,0.0,1.0,334.0,1.0
3,391.0,1620.0,1620.0,15.31,15.31,0.0,9.0,0.0,0.0,0.0,183.0,1.0
4,819.0,432.0,1296.0,7.11,21.33,1.0,9.0,0.0,0.0,0.0,0.0,1.0


In [None]:
one = np.sum(raw_data[:,-1])
print('number of one are {} and the number of zeros are {}'.format(one, len(data_panda)-one))

number of one are 2237.0 and the number of zeros are 11847.0


#### We can see that the data is imbalanced. So, we need to balance the data

## Balanced the data

In [None]:
# We need to balance the data
num_one_target = int(np.sum(target_all))
target_zero_counter = 0
indice_to_remove = []

for i in range(target_all.shape[0]):
  if target_all[i] == 0:
    target_zero_counter +=1
    if target_zero_counter > num_one_target:
      indice_to_remove.append(i)

inputs_all = np.delete(inputs_all, indice_to_remove, axis=0)
target_all = np.delete(target_all, indice_to_remove, axis=0)
target_all.shape


(4474,)

In [None]:
# Let's standardize the data
scale_inputs = preprocessing.scale(inputs_all)
scale_inputs.shape

(4474, 10)

## Shuffle the data

##### When the data was collected it was actually arranged by date. Shuffle the indices of the data, so the data is not arranged in any way when we feed it.

In [None]:
# Let's shuffle the data

shuffle_indice = np.arange(scale_inputs.shape[0])
np.random.shuffle(shuffle_indice)

# let's use the shuffle_indice to shuffle the inputs and targets
shuffle_inputs = scale_inputs[shuffle_indice]
shuffle_targets = target_all[shuffle_indice]

In [None]:
# Let's split the data into train, validation, test
samples_count = shuffle_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_sample_count = int(0.8*samples_count)
validation_sample_count = int(0.1*samples_count)

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = shuffle_inputs[:train_sample_count]
target_inputs = shuffle_targets[:train_sample_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = shuffle_inputs[train_sample_count:train_sample_count+validation_sample_count]
validation_target = shuffle_targets[train_sample_count:train_sample_count+validation_sample_count]

# The 'test' dataset contains all remaining data.
test_sample_count = samples_count - train_sample_count - validation_sample_count

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = shuffle_inputs[train_sample_count+validation_sample_count:]
test_targets = shuffle_targets[train_sample_count+validation_sample_count:]


# We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were 
# taken from a shuffled dataset. Check if they are balanced, too. 

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(target_inputs), train_sample_count, np.sum(target_inputs)/train_sample_count)
print(np.sum(validation_target), validation_sample_count, np.sum(validation_target)/validation_sample_count)
print(np.sum(test_targets), test_sample_count, np.sum(test_targets)/test_sample_count)

1821.0 3579 0.5088013411567477
193.0 447 0.4317673378076063
223.0 448 0.49776785714285715


In [None]:
3579+447+448

4474

## Save the three datasets

In [None]:
# Let's save the three datasets in *.npz.
np.savez('Audiobooks_data_train', inputs=train_inputs, targets = target_inputs)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets = validation_target)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets = test_targets)

## Create the Machine learning algorithm

Load the data

In [None]:
# Let's create a temporary variable npz
npz = np.load('Audiobooks_data_train.npz')
# inputs must be float
train_inputs = npz['inputs'].astype(np.float)
# targets must be integer
target_inputs = npz['targets'].astype(np.int)

# Do the same thing for validation and test set
npz = np.load('Audiobooks_data_validation.npz')
validation_inputs, validation_target = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

npz = np.load('Audiobooks_data_test.npz')
test_inputs, test_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)



### Let's build our Model here
Outline, optimizers, loss, early stopping and training

In [None]:
input_size = 10
output_size = 2

hidden_layer_size = 50

model_audio = tf.keras.Sequential([
                  tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                  tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
                  #tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'), 

                  tf.keras.layers.Dense(output_size, activation = 'softmax'),
])

bacth_size = 100
num_epochs = 100
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)

model_audio.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics=['accuracy'])
#print(validation_inputs.shape)
#print(validation_target.shape)
model_audio.fit(
    train_inputs,
    target_inputs,
    bacth_size = bacth_size,
    epochs = num_epochs,
    validation_data = (validation_inputs, validation_target),
    callbacks = [early_stopping],
    verbose = 2
) 

Epoch 1/100
112/112 - 0s - loss: 0.4299 - accuracy: 0.8326 - val_loss: 0.3099 - val_accuracy: 0.8747
Epoch 2/100
112/112 - 0s - loss: 0.2972 - accuracy: 0.8908 - val_loss: 0.2897 - val_accuracy: 0.8792
Epoch 3/100
112/112 - 0s - loss: 0.2714 - accuracy: 0.8972 - val_loss: 0.2835 - val_accuracy: 0.8792
Epoch 4/100
112/112 - 0s - loss: 0.2633 - accuracy: 0.9019 - val_loss: 0.2619 - val_accuracy: 0.8837
Epoch 5/100
112/112 - 0s - loss: 0.2552 - accuracy: 0.9044 - val_loss: 0.2692 - val_accuracy: 0.8837
Epoch 6/100
112/112 - 0s - loss: 0.2491 - accuracy: 0.9078 - val_loss: 0.2853 - val_accuracy: 0.8837


<tensorflow.python.keras.callbacks.History at 0x7f6cd287ba90>

### Test the model

##### The model has to be test with unseen data. This help us to know at what level our model is able to do a good classification.

In [None]:
test_loss, test_accuracy = model_audio.evaluate(test_inputs, test_targets)

print('test_loss: {0:.2f}. test_accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))

# the accuracy should be around 90%

test_loss: 0.24. test_accuracy: 91.07%


#### The test_accuracy below shows that our model is good at 91%. It means that for any new data the model can classify it with an uncertainty of 9%.
##### Note that when we rerun the model, the accuracy value must change because of some randomness included in our model. The challenge now is to some good hyperparameters that can gives the best model.