### Task:
Create ML algorithm that predicts if a customer will buy an audio book or not.


You are given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

You have a .csv summarizing the data. There are several variables: Customer ID, ), Book length overall (sum of the minute length of all purchases), Book length avg (average length in minutes of all purchases), Price paid_overall (sum of all purchases) ,Price Paid avg (average of all purchases), Review (a Boolean variable whether the customer left a review), Review out of 10 (if the customer left a review, his/her review out of 10, Total minutes listened, Completion (from 0 to 1), Support requests (number of support requests; everything from forgotten password to assistance for using the App), and Last visited minus purchase date (in days).

These are the inputs (excluding customer ID, as it is completely arbitrary. It's more like a name, than a number).

The targets are a Boolean variable (0 or 1). We are taking a period of 2 years in our inputs, and the next 6 months as targets. So, in fact, we are predicting if: based on the last 2 years of activity and engagement, a customer will convert in the next 6 months. If they don't convert after 6 months, chances are they've gone to a competitor or didn't like the Audiobook way of digesting information. 

The task is simple: create a machine learning algorithm, which is able to predict if a customer will buy again. 

This is a classification problem with two classes: won't buy and will buy, represented by 0s and 1s.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
from sklearn import preprocessing     #extremely helpful to standardize variables -> improve model accuracy

In [2]:
raw_data = pd.read_csv('C:/Users/HP/Desktop/Privat/The Data Science Course 2018 - All Resources/Part_7_Deep_Learning/S51_L353/Audiobooks_data.csv')

Columns: 
ID, Book length (mins) total, Book length (mins) mean, Price total, Price mean, Review (bool:1=has reviewed), 
Review 10/10 (only if reviewed.If empty, it is filled with the mean), Minutes listened, completion %, #support request, last visited app minus purchase date, Target (1:bought a book, 0:didn't buy)
    

In [3]:
raw_data.head()

Unnamed: 0,873,2160,2160.1,10.13,10.13.1,0,8.91,0.1,0.2,0.3,0.4,1
0,611,1404.0,2808,6.66,13.33,1,6.5,0.0,0.0,0,182,1
1,705,324.0,324,10.13,10.13,1,9.0,0.0,0.0,1,334,1
2,391,1620.0,1620,15.31,15.31,0,9.0,0.0,0.0,0,183,1
3,819,432.0,1296,7.11,21.33,1,9.0,0.0,0.0,0,0,1
4,138,2160.0,2160,10.13,10.13,1,9.0,0.0,0.0,0,5,1


In [4]:
raw_data.describe(include='all')

Unnamed: 0,873,2160,2160.1,10.13,10.13.1,0,8.91,0.1,0.2,0.3,0.4,1
count,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0,14083.0
mean,16773.620535,1591.241302,1678.574451,7.103576,7.543621,0.160761,8.909795,0.125668,118.595166,0.070227,61.939431,0.158773
std,9691.225165,504.335798,654.849284,4.931782,5.560284,0.367324,0.643429,0.241212,268.739618,0.472173,88.209221,0.365477
min,2.0,216.0,216.0,3.86,3.86,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,8371.5,1188.0,1188.0,5.33,5.33,0.0,8.91,0.0,0.0,0.0,0.0,0.0
50%,16715.0,1620.0,1620.0,5.95,6.07,0.0,8.91,0.0,0.0,0.0,11.0,0.0
75%,25187.5,2160.0,2160.0,8.0,8.0,0.0,8.91,0.13,64.8,0.0,105.0,0.0
max,33683.0,2160.0,7020.0,130.94,130.94,1.0,10.0,1.0,2116.8,30.0,464.0,1.0


In [5]:
import numpy as np
import pandas as pd
# We will use the sklearn preprocessing library, as it will be easier to standardize the data.
from sklearn import preprocessing

# Load the data
raw_csv_data = np.loadtxt('C:/Users/HP/Desktop/Privat/The Data Science Course 2018 - All Resources/Part_7_Deep_Learning/S51_L353/Audiobooks_data.csv',delimiter=',')


# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which is our targets)

unscaled_inputs_all = raw_csv_data[:,1:-1]

# The targets are in the last column. That's how datasets are conventionally organized.
targets_all = raw_csv_data[:,-1]

#### Balance the data

We need an (about) equal number of the two target classes (1 and 0), so we have to balance the data:

In [6]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0 (meaning that the customer did not convert)
zero_targets_counter = 0

# We want to create a "balanced" dataset, so we will have to remove some input/target pairs.
# Declare a variable that will do that:
zero_targets_to_remove = []

# Count the number of targets that are 0. 
# Once there are as many 0s as 1s, mark entries where the target is 0.
for i in range (targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter +=1
        if zero_targets_counter > num_one_targets:
            zero_targets_to_remove.append(i)
            
# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices (from inputs and targets) that we marked "to remove" in the loop above.
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, zero_targets_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, zero_targets_to_remove, axis=0)

#### Standardize the inputs

In [7]:
# use sk.learn's preprocessing module to easily scale all inputs
inputs_scaled = preprocessing.scale(unscaled_inputs_equal_priors)

#### Shuffle the data

In [8]:
# Data originally sorted by date, but we want data so be as randomly spread as possible :  Shuffle it!
# 1. Create shuffled indices using the a-range function, which returns evenly spaced values within a given interval
shuffled_indices = np.arange(inputs_scaled.shape[0])
np.random.shuffle(shuffled_indices)

# 2. Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = inputs_scaled[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

#### Split data set into Train, Validation, Test

In [9]:
# count the samples
samples_count = shuffled_inputs.shape[0]
samples_count

4474

In [10]:
# Split: train 80%, validation 10%, test 10%
train_samples_count = int(0.85*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count =  samples_count -  train_samples_count -  validation_samples_count

In [11]:
# Assign first 80% of inputs to train_inputs and first 80% of targets to train_targets
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Assign the 10% between the 80% and 90% range of inputs to validation_inputs and the 10% between the 80% and 90% range of targets to validation_targets
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

#Assing the 10% range between 90% and 100% of inputs to test_inputs and the 10% between the 90% and 100% range of targets to test_targets
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

In [12]:
# Check if each of the three subsets is still balanced 50/50 regarding Targets 1 and 0
# even though we balanced the main data set before, it's possible that e.g. in our 10% test_targets data we randomly got more target=1's than target=0's
# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1895.0 3802 0.4984218832193582
232.0 447 0.5190156599552572
110.0 225 0.4888888888888889


#### Save the three datasets in *.npz

In [13]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

### Create the machine learning algorithm

In [14]:
import tensorflow as tf

##### Data

Train data

In [15]:
# create a temporary variable npz, where we will store each of the three Audiobooks datasets
npz = np.load('Audiobooks_data_train.npz')
# define train inputs and targets
train_inputs = npz['inputs'].astype(np.float)  #inputs must be floats
train_targets = npz['targets'].astype(np.int) #targets must be integers because of sparse_categorical_crossentropy (we want to be able to smoothly one-hot encode them)

Validation data

In [16]:
npz = np.load('Audiobooks_data_validation.npz')
valid_inputs, validation_targets = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

Test data

In [17]:
npz = np.load('Audiobooks_data_test.npz')
test_inputs,test_targets  = npz['inputs'].astype(np.float), npz['targets'].astype(np.int)

### Model
Outline, optimizers, loss, early stopping and training

In [20]:
# Set the input and output sizes
input_size = 10
output_size = 2
# Use same hidden layer size for both hidden layers. Not a necessity.
hidden_layer_size = 50

# define how the model will look like
model = tf.keras.Sequential([
    # tf.keras.layers.Dense is basically implementing: output = activation(input x weight) + bias)
    # it takes several arguments, but the most important ones for us are the hidden_layer_size and the activation function
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 1st hidden layer
    tf.keras.layers.Dense(hidden_layer_size, activation='relu'), # 2nd hidden layer
    # the final layer is no different, we just make sure to activate it with softmax (softmax because our targets are categorical)
    tf.keras.layers.Dense(output_size, activation='softmax') # output layer
])

### Optimizer and loss function
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy']) # metrics: return accuarcy of model after each iteration

### Training our built model
batch_size = 10
max_epochs = 150

# set an early stopping mechanism
# when the validation loss of a model inbetween its iterating process increases instead of decreases (as happens a few times in this model),
# we have overfitted the model. To prevent this, we tell the model to stop the iteration as soon as a validation loss after x iterations increased
# Patience=2: as soon as a second increasing validation loss occurs, the iterations stop. (2 allows for tolerance against random validation loss increases
early_stopping = tf.keras.callbacks.EarlyStopping(patience=2)


# Fit the model
# note that this time the train, validation and test data are not iterable
model.fit(train_inputs, 
          train_targets, 
          batch_size=batch_size, 
          epochs=max_epochs, # epochs that we will train for (assuming early stopping doesn't kick in)
         # callbacks are functions called by a task when a task is completed
          # task here is to check if val_loss is increasing
          callbacks=[early_stopping], # early stopping
          validation_data=(validation_inputs, validation_targets), 
          verbose = 2 # making sure we get enough information about the training process
          )  

AttributeError: module 'tensorflow' has no attribute 'reset_default_graph'

With early stoping mechanism after two increasing validation losses, the model performs only 17 of the original 100 epochs/iterations.
The validation accuracy is 2% lower compared to having no stoping mechanism, but now we can be sure to have no overfitted model!

### Test the model

In [None]:
#.evaluate returns the loss value and the metrics (in our case Accuracy). store them in tensors
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)

In [None]:
# put outputs in nice format
print('\nTest loss: {0:.2f}. Test accuracy: {1:.2f}%'.format(test_loss, test_accuracy*100.))