# Audiobooks business case

### Problem

Data from an audio book app has been collected,from public sources. Logically, it relates to the audio versions of books ONLY.

Each customer in the database has made a purchase at least once, that's why he/she is in the database. Idea is to create a machine learning algorithm based on the available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If efforts are focused SOLELY on customers that are likely to convert again, great savings can be made. Moreover, the objective is to identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

From .csv file, data can be summarised. There are several variables:

Customer ID, ),
Book length overall (sum of the minute length of all purchases), 
Book length avg (average length in minutes of all purchases),
Price paid_overall (sum of all purchases) ,
Price Paid avg (average of all purchases),
Review (a Boolean variable whether the customer left a review),
Review out of 10 (if the customer left a review, his/her review out of 10,
Total minutes listened,
Completion (from 0 to 1),
Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (0 or 1). Data is available for a period of 2 years based on which predictions will be done. 

So,aim is to find if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is : create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

## Creating a class that will batch the data

Whenever you want to batch the data you need to have appropriate methods. There are some batching methods integrated in TensorFlow and sklearn, but some problems may need specific coding. 

In [0]:
import numpy as np

# Create a class that will do the batching for the algorithm
# This code is extremely reusable.Just changing Audiobooks_data everywhere in the code can help in using this class in other problems.
class Audiobooks_Data_Reader():
    # Dataset is a mandatory arugment, while the batch_size is optional
    # If batch_size input is not given, it will automatically take the value: None
    def __init__(self, dataset, batch_size = None):
    
        # The dataset that loads is one of "train", "validation", "test".
        # e.g. if I call this class with x('train',5), it will load 'Audiobooks_data_train.npz' with a batch size of 5.
        npz = np.load('Audiobooks_data_{0}.npz'.format(dataset))
        
        # Two variables that take the values of the inputs and the targets. Inputs are floats, targets are integers
        self.inputs, self.targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)
        
        # Counts the batch number, given the size that will be given later
        # If the batch size is None, we are either validating or testing, so we want to take the data in a single batch
        if batch_size is None:
            self.batch_size = self.inputs.shape[0]
        else:
            self.batch_size = batch_size
        self.curr_batch = 0
        self.batch_count = self.inputs.shape[0] // self.batch_size
    
    # A method which loads the next batch
    def __next__(self):
        if self.curr_batch >= self.batch_count:
            self.curr_batch = 0
            raise StopIteration()
            
        # You slice the dataset in batches and then the "next" function loads them one after the other
        batch_slice = slice(self.curr_batch * self.batch_size, (self.curr_batch + 1) * self.batch_size)
        inputs_batch = self.inputs[batch_slice]
        targets_batch = self.targets[batch_slice]
        self.curr_batch += 1
        
        # One-hot encoding the targets. In this problem, it's a bit superfluous since there is a 0/1 column 
        # as a target already but in case there's more than one target column, it will be useful for any 
        # classification task
        classes_num = 2
        targets_one_hot = np.zeros((targets_batch.shape[0], classes_num))
        targets_one_hot[range(targets_batch.shape[0]), targets_batch] = 1
        
        # The function will return the inputs batch and the one-hot encoded targets
        return inputs_batch, targets_one_hot
    
        
    # A method needed for iterating over the batches, as it will be in a loop
    # This tells Python that the class we're defining is iterable, i.e. that we can use it like:
    # for input, output in data: 
        # do things
    # An iterator in Python is a class with a method __next__ that defines exactly how to iterate through its objects
    def __iter__(self):
        return self

## Creating the machine learning algorithm

I am using the algorithm code from previous ML project and make the changes wherever necessary.
Once more, I will put the whole code in one piece as I can simply rerun the cell and train a new model. That's because the whole algorithm is contained in the cell and there's the tf.reset_default_graph() function.

In [10]:
# importing TF
# import tensorflow as tf
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior() 


# Input size depends on the number of input variables. There are 10
input_size = 10
# Output size is 2
output_size = 2
# Choosing a hidden_layer_size
hidden_layer_size = 50

# Reset the default graph, so later I can fiddle with the hyperparameters and then rerun the code.
# tf.reset_default_graph()
tf.compat.v1.reset_default_graph()
# Creating the placeholders
inputs = tf.placeholder(tf.float32, [None, input_size])
targets = tf.placeholder(tf.int32, [None, output_size])

# Outlining the model and creating a net with 2 hidden layers
weights_1 = tf.get_variable("weights_1", [input_size, hidden_layer_size])
biases_1 = tf.get_variable("biases_1", [hidden_layer_size])
outputs_1 = tf.nn.relu(tf.matmul(inputs, weights_1) + biases_1)

weights_2 = tf.get_variable("weights_2", [hidden_layer_size, hidden_layer_size])
biases_2 = tf.get_variable("biases_2", [hidden_layer_size])
outputs_2 = tf.nn.sigmoid(tf.matmul(outputs_1, weights_2) + biases_2)

weights_3 = tf.get_variable("weights_3", [hidden_layer_size, output_size])
biases_3 = tf.get_variable("biases_3", [output_size])
# incorporating the softmax activation into the loss
outputs = tf.matmul(outputs_2, weights_3) + biases_3

# Using the softmax cross entropy loss with logits
loss = tf.nn.softmax_cross_entropy_with_logits(logits=outputs, labels=targets)
mean_loss = tf.reduce_mean(loss)

# Get a 0 or 1 for every input indicating whether it output the correct answer
out_equals_target = tf.equal(tf.argmax(outputs, 1), tf.argmax(targets, 1))
accuracy = tf.reduce_mean(tf.cast(out_equals_target, tf.float32))

# Optimizing with Adam
optimize = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(mean_loss)

# Creating a session
sess = tf.InteractiveSession()

# Initializing the variables
initializer = tf.global_variables_initializer()
sess.run(initializer)

# Choosing the batch size
batch_size = 100

# Setting early stopping mechanisms
max_epochs = 100
prev_validation_loss = 9999999.

# Loading the first batch of training and validation, using the class we created. 
# Arguments are ending of 'Audiobooks_Data_<...>', where for <...> I input 'train', 'validation', or 'test'
# depending on what needs to be loaded
train_data = Audiobooks_Data_Reader('train', batch_size)
validation_data = Audiobooks_Data_Reader('validation')

# Creating the loop for epochs 
for epoch_counter in range(max_epochs):
    
    # Setting the epoch loss to 0, and make it a float
    curr_epoch_loss = 0.
    
    # Iterating over the training data 
    # Since train_data is an instance of the Audiobooks_Data_Reader class,
    # it can be iterated through by implicitly using the __next__ method.
    # it batches samples together, one-hot encodes the targets, and returns
    # inputs and targets batch by batch
    for input_batch, target_batch in train_data:
        _, batch_loss = sess.run([optimize, mean_loss], 
            feed_dict={inputs: input_batch, targets: target_batch})
        
        #Recording the batch loss into the current epoch loss
        curr_epoch_loss += batch_loss
    
    # Finding the mean curr_epoch_loss
    # batch_count is a variable, defined in the Audiobooks_Data_Reader class
    curr_epoch_loss /= train_data.batch_count
    
    # Setting validation loss and accuracy for the epoch to zero
    validation_loss = 0.
    validation_accuracy = 0.
    
    # Using the same logic of the code to forward propagate the validation set
    # There will be a single batch, as the class was created in this way
    for input_batch, target_batch in validation_data:
        validation_loss, validation_accuracy = sess.run([mean_loss, accuracy],
            feed_dict={inputs: input_batch, targets: target_batch})
    
    # Printing statistics for the current epoch
    print('Epoch '+str(epoch_counter+1)+
          '. Training loss: '+'{0:.3f}'.format(curr_epoch_loss)+
          '. Validation loss: '+'{0:.3f}'.format(validation_loss)+
          '. Validation accuracy: '+'{0:.2f}'.format(validation_accuracy * 100.)+'%')
    
    # Triggering early stopping if validation loss begins increasing.
    if validation_loss > prev_validation_loss:
        break
        
    # Storing this epoch's validation loss to be used as previous in the next iteration.
    prev_validation_loss = validation_loss
    
print('End of training.')



Epoch 1. Training loss: 1.838. Validation loss: 1.761. Validation accuracy: 50.34%
Epoch 2. Training loss: 1.698. Validation loss: 1.628. Validation accuracy: 50.34%
Epoch 3. Training loss: 1.563. Validation loss: 1.499. Validation accuracy: 50.34%
Epoch 4. Training loss: 1.433. Validation loss: 1.376. Validation accuracy: 50.34%
Epoch 5. Training loss: 1.309. Validation loss: 1.260. Validation accuracy: 50.34%
Epoch 6. Training loss: 1.193. Validation loss: 1.152. Validation accuracy: 50.34%
Epoch 7. Training loss: 1.088. Validation loss: 1.055. Validation accuracy: 50.34%
Epoch 8. Training loss: 0.993. Validation loss: 0.969. Validation accuracy: 50.11%
Epoch 9. Training loss: 0.911. Validation loss: 0.896. Validation accuracy: 49.66%
Epoch 10. Training loss: 0.841. Validation loss: 0.834. Validation accuracy: 49.44%
Epoch 11. Training loss: 0.785. Validation loss: 0.784. Validation accuracy: 48.99%
Epoch 12. Training loss: 0.740. Validation loss: 0.744. Validation accuracy: 49.66%
E

## Test the model

In [11]:
# Loading the test data, following the same logic done for the train_data and validation data
test_data = Audiobooks_Data_Reader('test')

# Forward propagate through the training set to check the accuracy
for inputs_batch, targets_batch in test_data:
    test_accuracy = sess.run([accuracy],
                     feed_dict={inputs: inputs_batch, targets: targets_batch})

# Getting the test accuracy in percentages
# When sess.run has a single output, output is a list (that's how it was coded by Google), rather than a float.
# Therefore, first value from the list (the value at position 0) should be taken
test_accuracy_percent = test_accuracy[0] * 100.

# Printing the test accuracy
print('Test accuracy: '+'{0:.2f}'.format(test_accuracy_percent)+'%')

Test accuracy: 81.70%
