# Audiobooks business case

## Problem Description

You are given data from an **Audiobook App**. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can **predict if a customer will buy again from the Audiobook company**.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information.

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again

## Dataset Description

Understading the meaning of variables/features in a dataset is important for a Data Scientist develop a model.

Each row represents a person

- **ID**: customer ID ; We will skip it in our algorithm OK.
- **Book length(min)_overal:** The overall book length is the sum of the lengths of all purchases.
- **Book length(min)_average:** The average book length is basically the sum divided by the number of purchases.
- **Price_overall:**  the overall price paid (dolars)
- **Price_average:** the average price paid (dolars)
- **Review:** review is a boolean: It shows if the customer left a review. This is a metric that shows engagement with the platform. We can do an assumption it is that people who leave reviews are more likely to convert again.
- **Review 10/10:** It measures the review of a customer on a scale from 1 to 10. Logically we will only have a value for people who left a review by examining the table.
- **Minutes Listened:** total minutes listened which is a measure of engagement next to it
- **Completion:** It is the total minutes listened divided by the total length of books a person has purchased.
- **Support Requests:** It shows the total number of support requests the person has opened.
- **Last visited minus Purchase Date:**  It measures the difference between the last time a person interacted with the platform and their first purchase date. The bigger the difference the better. If a person engages regularly with the platform this difference will be bigger. Thus the customer is likely to convert again. If the value of this variable is zero we are sure the customer has never access what he has bought. Or perhaps he did it on the first day only. So it is unlikely he or she will convert again.
- **Targets:** if he or she bought another book and if that happened we can count them as a conversion and the target will be 1. Otherwise it is zero.

**Note:** The price variable is almost always a good predictor of behavior.

**Review 10/10 details:**

We quickly see most people leave no review.
As in most marketplaces that's bad for our data set and bad in general.
We can decided to leave the reviews posted to the platform and substitute all missing values with the average review.

The average is 8.91. For our machine learning algorithm 8.91 one would mean the status quo.
A review bigger than 8.9 would indicate above average feelings. While the review less than 8.91 would indicate below average feelings notice.

A customer may have bought two or three books on the platform as a whole.
An average of two out of ten indicates the person did not have a pleasant experience with audio books
especially when the average is 8.9 one it is logical that such a customer is not likely to buy again.

## Load Libraries

In [10]:
import numpy as np
import pandas as pd
from sklearn import preprocessing 

## Extract the data from the csv

In [6]:
# Load data from file
raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter = ',')

# Select all observations that are from 2nd colum minus the last column
# We don't have interest in ID. And the last column is the targets.
unscaled_inputs_all = raw_csv_data[:,1:-1]

# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_csv_data[:,-1]

## Verify if Dataset is Balanced 

In [19]:
raw_csv_data_df = pd.DataFrame(raw_csv_data)
raw_csv_data_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,994.0,1620.0,1620.0,19.73,19.73,1.0,10.0,0.99,1603.8,5.0,92.0,0.0
1,1143.0,2160.0,2160.0,5.33,5.33,0.0,8.91,0.0,0.0,0.0,0.0,0.0
2,2059.0,2160.0,2160.0,5.33,5.33,0.0,8.91,0.0,0.0,0.0,388.0,0.0
3,2882.0,1620.0,1620.0,5.96,5.96,0.0,8.91,0.42,680.4,1.0,129.0,0.0
4,3342.0,2160.0,2160.0,5.33,5.33,0.0,8.91,0.22,475.2,0.0,361.0,0.0


In [35]:
from numpy import linalg as LA

perc_ones = (len(raw_csv_data_df[11]) - len(raw_csv_data_df[raw_csv_data_df[11]==1]))/len(raw_csv_data_df[11])

perc_zeros = (len(raw_csv_data_df[11]) - len(raw_csv_data_df[raw_csv_data_df[11]==0]))/len(raw_csv_data_df[11])

print("Percentage of ones {}".format(round(perc_ones,2)))
print("Percentage of zeros {}".format(round(perc_zeros,2)))
if LA.norm(perc_ones - perc_zeros) > 0.6:
    print("Unbalanced Dataset")

Percentage of ones 0.84
Percentage of zeros 0.16
Unbalanced Dataset


## Balanced Dataset

In [36]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# We want to create a "balanced" dataset, so we will have to remove some input/target pairs.
# Declare a variable that will do that:
indices_to_remove = []

# Here we want capture the observations that have 0 value
# in target if the total of zeros counting so far 
# is greater than thhe total of One's.
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# It deletes an object along an axis
# np..delete(array. obj_to_delete, axis)
unscale_inputs_equal_priors = np.delete(unscaled_inputs_all,
                                       indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, 
                                 indices_to_remove, axis=0)

## Standardize the Inputs

Scale / Standardize inputs is very common preprocessing step in Machine Learning because it improves a lot the predictions of the algorithm.

In [37]:
scaled_inputs = preprocessing.scale(unscale_inputs_equal_priors)

## Shuffle the data

Sometimes the data was collected in order. To make our model independent of that we shuffle inputs and outputs. 

Imagine the data is ordered so each batch represents approximately a different day of purchases inside the batch.
The data is homogeneous. While between batches it is very heterogeneous, outer promotions day of the week effects and so on. This will confuse the stochastic gradient descent when we average the loss across batches.

Overall we want them shuffled.

In [42]:
# Makes an evenly spaced values within a given interval 
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

## Split the dataset into train, validation and test

In [43]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)

# The 'test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

## Verify if the datset is balanced

In [44]:
print('Train: ', np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print('Validation: ', np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print('Test: ', np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

Train:  1780.0 3579 0.4973456272701872
Validation:  230.0 447 0.5145413870246085
Test:  227.0 448 0.5066964285714286


## Save the three datasets in *.npz

In [46]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)