# Deep Learning Apliaction - Audiobook buyer business case

This project is going to explore the aplication of deep learning method to make a prediction based of a dataset from an audiobook company. The result will be an algorithm that can be used to predict will someone buy a audiobook from that company again.

## Dataset Preparation

### Extract the data from th csv

In [14]:
import numpy as np
from sklearn import preprocessing
import tensorflow as tf

raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter = ',')

# Choose all column exept the first and last column to become inputs data
unscaled_inputs_all = raw_csv_data[:,1:-1]

# Choose the last column to became targets data
targets_all = raw_csv_data[:,-1]

When the data was collected it was actually arranged by date Shuffle the indices of the data, so the data is not arranged in any way when we feed it. Since we will be batching, we want the data to be as randomly spread out as possible

In [3]:
shuffle_indicies = np.arange(unscaled_inputs_all.shape[0])
np.random.shuffle(shuffle_indicies)

unscaled_inputs_all = unscaled_inputs_all[shuffle_indicies]
targets_all = targets_all[shuffle_indicies]

### Balance the dataset

In [5]:
# Sum all targets so we will get the number of target that are 1
num_one_targets = int(np.sum(targets_all))
# Set the 0 counter and indicies to remove to help removing data that we will not use
zero_targets_counter = 0
indicies_to_remove =[]
# Set a loop that contain all target, with the lenght of vector
for i in range(targets_all.shape[0]):
    # Increase the zero cunter by 1 if the targets is 0
    if targets_all[i] == 0:
        zero_targets_counter += 1
        # if the targets in at position i is 0 and numbers of 0 is bigger than numbers of 1 we want to take note of that index
        if zero_targets_counter > num_one_targets:
            indicies_to_remove.append(i)
            #append is to add element to the list
            #indicies_to_remove will contain the data that we don't need

#Deleting the object that have index
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indicies_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indicies_to_remove, axis =0)

### Standardize the inputs

In [6]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

We take indicies from axis 0 of the scaled inputs shape and place them in a variable. np.random.shuffle is method that shuffle the number in a given sequences.

In [7]:
shuffle_indicies = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffle_indicies)

shuffled_inputs = scaled_inputs[shuffle_indicies]
shuffled_targets = targets_equal_priors[shuffle_indicies]

### Split the dataset into train, validation and test data

In [8]:
sample_count = shuffled_inputs.shape[0]

train_sample_count = int(0.8 * sample_count)
validation_sample_count = int(0.1 * sample_count)
test_sample_count = sample_count - train_sample_count - validation_sample_count

train_inputs = shuffled_inputs[: train_sample_count]
train_targets = shuffled_targets[: train_sample_count]

validation_inputs = shuffled_inputs[train_sample_count : train_sample_count + validation_sample_count]
validation_targets = shuffled_targets[train_sample_count : train_sample_count + validation_sample_count]

test_inputs =  shuffled_inputs[train_sample_count + validation_sample_count : ]
test_targets =  shuffled_targets[train_sample_count + validation_sample_count : ]

print(sample_count)
print(np.sum(train_targets), train_sample_count, np.sum(train_targets)/train_sample_count)
print(np.sum(validation_targets), validation_sample_count, np.sum(validation_targets)/validation_sample_count)
print(np.sum(test_targets), test_sample_count, np.sum(test_targets)/test_sample_count)

4474
1780.0 3579 0.4973456272701872
231.0 447 0.5167785234899329
226.0 448 0.5044642857142857


### Save the dataset to *.npz file

In [10]:
np.savez('Audiobook_data_train', inputs = train_inputs, targets = train_targets)
np.savez('Audiobook_data_validation', inputs = validation_inputs, targets = validation_targets)
np.savez('Audiobook_data_test', inputs = test_inputs, targets = test_targets)

### Prepocessing the dataset

In [11]:
# Train data
# Store the train data to a temporary variable npz
npz = np.load('Audiobook_data_train.npz')
# Load them into train input and target variable
# The input data type will be float and the target data will be integer
train_inputs = npz['inputs'].astype(float)
train_targets = npz['targets'].astype(int)

# Validation data
npz = np.load('Audiobook_data_validation.npz')
validation_inputs = npz['inputs'].astype(float)
validation_targets = npz['targets'].astype(int)

# Test data
npz = np.load('Audiobook_data_test.npz')
test_inputs = npz['inputs'].astype(float)
test_targets = npz['targets'].astype(int)

## Create The Model

In [15]:
# Setting input, output and hidden layer size
input_size = 10 #hyperparameter
output_size = 2 #hyperparameter
hidden_layer_size = 100 #hyperparameter

# Define the model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
    tf.keras.layers.Dense(hidden_layer_size, activation = 'relu'),
    tf.keras.layers.Dense(output_size, activation = 'softmax')
                            ])

# Set Optimizer and loss function
model.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

# Set Batch size and Epoch
batch_size = 500 #hyperparameter
max_epoch = 100 #hyperparameter

# Set early stopping mechanishm 
early_stopping = tf.keras.callbacks.EarlyStopping(patience = 2)
# 'patience' let us decide how many consecutive increase of validation loss we can tolerate

# Fit the model
model.fit(train_inputs, #train inputs
          train_targets, #train targets
          batch_size = batch_size, #batch size
          epochs = max_epoch, #epochs thath we will train from
          callbacks = [early_stopping], #early stopping mechanishm
          validation_data = (validation_inputs, validation_targets), #validation data
          verbose = 2 #make sure we geting enogh information about training process
         )

Epoch 1/100
8/8 - 1s - loss: 0.6382 - accuracy: 0.6239 - val_loss: 0.5727 - val_accuracy: 0.7405 - 624ms/epoch - 78ms/step
Epoch 2/100
8/8 - 0s - loss: 0.5451 - accuracy: 0.7427 - val_loss: 0.5083 - val_accuracy: 0.7629 - 45ms/epoch - 6ms/step
Epoch 3/100
8/8 - 0s - loss: 0.4945 - accuracy: 0.7586 - val_loss: 0.4679 - val_accuracy: 0.7562 - 45ms/epoch - 6ms/step
Epoch 4/100
8/8 - 0s - loss: 0.4631 - accuracy: 0.7617 - val_loss: 0.4448 - val_accuracy: 0.7673 - 41ms/epoch - 5ms/step
Epoch 5/100
8/8 - 0s - loss: 0.4419 - accuracy: 0.7698 - val_loss: 0.4270 - val_accuracy: 0.7785 - 47ms/epoch - 6ms/step
Epoch 6/100
8/8 - 0s - loss: 0.4275 - accuracy: 0.7807 - val_loss: 0.4169 - val_accuracy: 0.7875 - 44ms/epoch - 6ms/step
Epoch 7/100
8/8 - 0s - loss: 0.4166 - accuracy: 0.7851 - val_loss: 0.4034 - val_accuracy: 0.7987 - 43ms/epoch - 5ms/step
Epoch 8/100
8/8 - 0s - loss: 0.4068 - accuracy: 0.7885 - val_loss: 0.3963 - val_accuracy: 0.8009 - 41ms/epoch - 5ms/step
Epoch 9/100
8/8 - 0s - loss: 0

<keras.callbacks.History at 0x2925f6db400>

### Test The Model 

In [16]:
test_loss, test_accuracy = model.evaluate(test_inputs, test_targets)



In [17]:
print('\nTest Loss: {0:2f}. Test Accuracy: {1:2f}%'.format(test_loss, test_accuracy*100))


Test Loss: 0.370856. Test Accuracy: 82.366073%
