# Audiobooks business case

### Problem

You are given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, ), Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. 6 months sounds like a reasonable time. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s. 

Good luck!

# The business case action plan

## 1. Preprocess the data

#### 1.1 Balance the data set

#### 1.2 Divide the dataset in training, validation and test.

#### 1.3 Save the data in tensor friendly format (*.npz)

## 2. Create the ML algorithm (similar to the previous problem)






## Preprocess the data

Since we are dealing with real life data, we will need to preprocess it a bit.

### Extract the data from .csv file

In [19]:
import numpy as np
from sklearn import preprocessing


raw_csv_data = np.loadtxt('Audiobooks_data.csv',delimiter=',')
np.set_printoptions(formatter={'float': '{: 0.1f}'.format})
raw_csv_data

# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which is our targets)

unscaled_inputs_all = raw_csv_data[:,1:-1]

# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_csv_data[:,-1]

targets_all

array([ 0.0,  0.0,  0.0, ...,  0.0,  0.0,  1.0])

Because we store raw_csv_data in 2D array of <b>numpy (not pandas)</b>, so the way we choose 

<b>Why we don't use read_csv here?</b>

As u can see the example below, 

1. read_csv is a method of pandas -> import pandas -> computational efficiency

2. read_csv will take the first line of our csv file as a header -> not applicable with our non-header file


In [3]:
import pandas as pd

raw_csv_data1 = pd.read_csv('Audiobooks_data.csv')
raw_csv_data1

Unnamed: 0,00994,1620,1620.1,19.73,19.73.1,1,10.00,0.99,1603.80,5,92,0
0,1143,2160.0,2160,5.33,5.33,0,8.91,0.00,0.0,0,0,0
1,2059,2160.0,2160,5.33,5.33,0,8.91,0.00,0.0,0,388,0
2,2882,1620.0,1620,5.96,5.96,0,8.91,0.42,680.4,1,129,0
3,3342,2160.0,2160,5.33,5.33,0,8.91,0.22,475.2,0,361,0
4,3416,2160.0,2160,4.61,4.61,0,8.91,0.00,0.0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
14078,28220,1620.0,1620,5.33,5.33,1,9.00,0.61,988.2,0,4,0
14079,28671,1080.0,1080,6.55,6.55,1,6.00,0.29,313.2,0,29,0
14080,31134,2160.0,2160,6.14,6.14,0,8.91,0.00,0.0,0,0,0
14081,32832,1620.0,1620,5.33,5.33,1,8.00,0.38,615.6,0,90,0


## Balance the dataset

1. We will count the number of targets that are <b>1s</b>

2. We will keep as many <b>0s</b> as <b>1s</b> (we will delete the others)

In [27]:
num_one_target = int(np.sum(targets_all))
zero_targets_counter = 0
indicies_to_remove = [] 

for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter+=1
        if zero_targets_counter >  num_one_target:
            indicies_to_remove.append(i)
            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indicies_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indicies_to_remove, axis=0)
unscaled_inputs_equal_priors[0]

array([ 1620.0,  1620.0,  19.7,  19.7,  1.0,  10.0,  1.0,  1603.8,  5.0,
        92.0])

## Standardize the inputs

In [35]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)
scaled_inputs[0]

array([ 0.2, -0.2,  2.0,  1.4,  2.1,  1.5,  4.2,  4.8,  11.8,  0.1])

## Shuffle the data

<b>np.arange([start],stop)  (not arrange)</b> is a method that returns an evenly spaced values with a given interval

In [36]:
# When the data was collected it was actually arranged by date
# Shuffle the indices of the data, so the data is not arranged in any way when we feed it.
# Since we will be batching, we want the data to be as randomly spread out as possible
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

In [37]:
scaled_inputs.shape[0]

4474

## Split the training into training validation and test

I will use the 80-10-10 for training, validation and test

In [42]:
samples_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count 

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1790.0 3579 0.5001397038278849
218.0 447 0.48769574944071586
229.0 448 0.5111607142857143


## Save the dataset in *.npz

In [43]:
np.savez('Audiobooks_data_train', inputs = train_inputs, targets = train_targets)
np.savez('Audiobooks_data_validation', inputs = validation_inputs, targets = validation_targets)
np.savez('Audiobooks_data_test', inputs = test_inputs, targets = test_targets)