# Audiobooks business case

## Machine Learning

Here we implement our Machine Learning model to predict the purchase of audiobooks.

## Problem Description

You are given data from an **Audiobook App**. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can **predict if a customer will buy again from the Audiobook company**.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again

## Create a class that will batch the data

Whenever you want to batch the data you need to have appropriate methods. There are some batching methods integrated in TensorFlow and sklearn, but some problems may need specific coding. 

Here we show how these methods look like. You can use them for any machine learning framework you need (directly or after little fine-tuning).

This part is more programming than Machine Learning

In [28]:
# Load Librabries
import numpy as np

# Create a class that will do the batching for the algorithm
class Audiobooks_Data_Reader():
    # dataset(string): it is the name of the dataset ; train, validation, test
    # batch_size(integer): it is the number of portions the data will be slice. 
    #             If we don't specified an integer number the dataset is not
    #             splice
    def __init__(self, dataset, batch_size = None):
        # Load the dataset
        npz = np.load('Audiobooks_data_{0}.npz'.format(dataset))
        
        # store data in the variables: inputs(float); targets(integer)
        self.inputs, self.targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

        # If the batch size is None, we are either 
        # validating or testing, so we want to take the data in a single batch.
        if batch_size is None:
            self.batch_size = self.inputs.shape[0]
        else:
            self.batch_size = batch_size
            
        # Counts the batch number, given the size you feed it later   
        self.curr_batch = 0
        # Operator // is Floor Division
        self.batch_count = self.inputs.shape[0] // self.batch_size
        
    # We slice the dataset in batches and then 
    # the "next" function loads them one after the other
    # This method loads the next batch
    def __next__(self):
        if self.curr_batch >= self.batch_count:
            self.curr_batch = 0
            raise StopIteration()
        
        # Define the batch slice
        # The slice object is used to slice a given sequence 
        # slice(start,stop,step)
        batch_slice = slice(self.curr_batch * self.batch_size, 
                            (self.curr_batch + 1) * self.batch_size)

        # Slice the batch
        inputs_batch = self.inputs[batch_slice]
        targets_batch = self.targets[batch_slice]
        self.curr_batch += 1
              
        # One-hot encode (OHE) targets. In this example it's a bit superfluous since we have a 0/1 column 
        # However, it will be useful for any classification task with more than one target column
        # Example: 
        # target = 0 OHE= [1,0,0] ; target = 1 OHE= [0,1,0] ; ; target = 2 OHE= [0,0,1]
        classes_num = 2
        targets_one_hot = np.zeros((targets_batch.shape[0], classes_num))
        targets_one_hot[range(targets_batch.shape[0]), targets_batch] = 1  
        
        # The function will return the inputs batch and the one-hot encoded targets
        return inputs_batch, targets_one_hot
        
    # A method needed for iterating over the batches, as we will put them in a loop
    # This tells Python that the class we're defining is iterable, i.e. that we can use it like:
    # for input, output in data: 
        # do things
    # An iterator in Python is a class with a method __next__ that defines exactly how to iterate through its objects
    def __iter__(self):
        return self
        

## Create the machine learning algorithm

In [29]:
#import tensorflow as tf
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Number of features (10)
# Book length(min)_overal , Book length(min)_average, Price_average, Review, Minutes Listened,
# Review 10/10, Completion, Support Requests, Last visited minus, Purchase Date 
input_size = 10
output_size = 2
hidden_layer_size = 50 # width

tf.reset_default_graph()

inputs = tf.placeholder(tf.float32, [None, input_size])
targets = tf.placeholder(tf.int32, [None, output_size])

# Hidden Layer 1
# Activation Function introduces de non-linearity to the model
weights_1 = tf.get_variable("weights_1", [input_size, hidden_layer_size])
biases_1 = tf.get_variable("biases_1", [hidden_layer_size])
outputs_1 = tf.nn.relu(tf.matmul(inputs,weights_1) + biases_1) #Activation Function: ReLu

# Hidden Layer 2
# Activation Function introduces de non-linearity to the model
weights_2 = tf.get_variable("weights_2", [hidden_layer_size, hidden_layer_size])
biases_2 = tf.get_variable("biases_2",[hidden_layer_size])
outputs_2 = tf.nn.relu(tf.matmul(outputs_1,weights_2) + biases_2) #Activation Function: ReLu

# Output Layer
weights_3 = tf.get_variable("weights_3", [hidden_layer_size, output_size])
biases_3 = tf.get_variable("biases_3", [output_size])
outputs = tf.matmul(outputs_2, weights_3) + biases_3

# Loss Function that incorporates the activation function for the output layer
loss = tf.nn.softmax_cross_entropy_with_logits(logits = outputs, labels = targets)
mean_loss = tf.reduce_mean(loss)

# Optimization of the Loss Function
optimize = tf.train.AdamOptimizer(learning_rate=0.001).minimize(mean_loss)

# Define the accuracy of the model
out_equals_target = tf.equal(tf.argmax(outputs,1), tf.argmax(targets,1))
accuracy = tf.reduce_mean(tf.cast(out_equals_target, tf.float32))

# Initializer variables
sess = tf.InteractiveSession()
initializer = tf.global_variables_initializer()
sess.run(initializer)

batch_size = 100
max_epochs = 50
prev_validation_loss = 9999999.

train_data = Audiobooks_Data_Reader('train', batch_size)
validation_data = Audiobooks_Data_Reader ('validation')

# Note, in validation_data we don't specify any batch_size number
# It means that it will not be slipted in batches.
# And for loop only occurs one time.

for epoch_counter in range(max_epochs):
    
    curr_epoch_loss = 0.
    
    for input_batch, target_batch in train_data:
        _, batch_loss = sess.run([optimize, mean_loss], 
            feed_dict={inputs: input_batch, targets: target_batch})
        
        curr_epoch_loss += batch_loss
        
    curr_epoch_loss /= train_data.batch_count
    
    validation_loss = 0.
    validation_accuracy = 0.
    
    for input_batch, target_batch in validation_data:
        validation_loss, validation_accuracy = sess.run([mean_loss, accuracy], 
        feed_dict={inputs: input_batch, targets: target_batch})   
        
    print('Epoch '+str(epoch_counter+1)+
          '. Training loss: '+'{0:.3f}'.format(curr_epoch_loss)+
          '. Validation loss: '+'{0:.3f}'.format(validation_loss)+
          '. Validation accuracy: '+'{0:.2f}'.format(validation_accuracy * 100.)+'%')
    
    if validation_loss > prev_validation_loss:
        break
        
    prev_validation_loss = validation_loss
    
print('End of training.')

Epoch 1. Training loss: 0.596. Validation loss: 0.511. Validation accuracy: 78.52%
Epoch 2. Training loss: 0.483. Validation loss: 0.428. Validation accuracy: 81.43%
Epoch 3. Training loss: 0.425. Validation loss: 0.388. Validation accuracy: 80.76%
Epoch 4. Training loss: 0.395. Validation loss: 0.369. Validation accuracy: 81.21%
Epoch 5. Training loss: 0.376. Validation loss: 0.359. Validation accuracy: 80.54%
Epoch 6. Training loss: 0.365. Validation loss: 0.353. Validation accuracy: 80.09%
Epoch 7. Training loss: 0.357. Validation loss: 0.348. Validation accuracy: 79.87%
Epoch 8. Training loss: 0.351. Validation loss: 0.344. Validation accuracy: 80.54%
Epoch 9. Training loss: 0.346. Validation loss: 0.342. Validation accuracy: 80.76%
Epoch 10. Training loss: 0.342. Validation loss: 0.339. Validation accuracy: 81.21%
Epoch 11. Training loss: 0.339. Validation loss: 0.337. Validation accuracy: 81.43%
Epoch 12. Training loss: 0.335. Validation loss: 0.335. Validation accuracy: 81.43%
E

## Test Model

In [31]:
test_data = Audiobooks_Data_Reader('test')

for input_batch, target_batch in test_data:
        test_accuracy = sess.run([accuracy], 
            feed_dict={inputs: input_batch, targets: target_batch})   
        
test_accuracy_percent = test_accuracy[0] * 100.

print('Test Accuracy: '+'{0:.2f}'.format(test_accuracy_percent)+'%')

Test Accuracy: 80.13%


## Improve the model

Usually we can improve our model checking the below itens:
- Improve the preprocessing
- Fine-tune the model: increase the with and the depth of the neural network
- Play around with the activation functions
- Fiddle with the Batch Size: Single Batach size means Gradiante Descendent, Thousands of batches sizes means Stocastic Gradient Descendent 
- Experiment with the learning rate / optimizers: Visit the TensorFlow website to see the options that we have for optimizers
- Try Kaggle challenges websites to adquire more knowledge