# Business Case Example: An Audiobooks Platform

## Preprocessing

1. Balance the dataset
2. Split the dataset into Trainign, Validation, Test
3. Save the new datasets into a tensor friendly format .npz

### Extract the data from the csv

In [1]:
import numpy as np
# We'll use sklearn capabilities for standardizing the inputs
from sklearn import preprocessing

raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter = ',')
# We create a new variable that takes as inputs all the variables excluding the IDs and targets,
# so, the zeroth column and the last one (or minus first)
unscaled_inputs_all = raw_csv_data[:,1:-1]
# we record the targets in a separete variable using the same method
targets_all = raw_csv_data[:,-1]

### Shuffle the data

It's possible that the original data was collected in order of date. Since we'll be batching, we must shuffle the data.

Imagine a batch corresponds to a day worth of purchases, inside the batch the data is homogeneous, but between the batched it is very heterogeneous due to promotions, day-of-the-week effect, etc.

This will confuse the SGD when we average the loss across batches (we update weigths after each batch).

In [2]:
shuffled_indices = np.arange(unscaled_inputs_all.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets.
unscaled_inputs_all = unscaled_inputs_all[shuffled_indices]
targets_all = targets_all[shuffled_indices]

### Balance the dataset

In [3]:
# We'll count the number of targets that are 1s
# if we sum all the tragets we´ll know the number of targets that are 1s
num_one_targets = int(np.sum(targets_all))

# Then, we´ll keep as many 0s as 1s and delete the others:
zero_targets_counter = 0 # we set a counter for zero targets that we set to zero
indices_to_remove = [] # we need a variable that records the indices to remove.
                        # For now it is empty, but we want to create a list or a tuple

# Iterate over the dataset and balance it:
for i in range(targets_all.shape[0]): # the shape of targets_all on axis=0, is basically the length of the vector which will show us the number of all targets
    if targets_all[i]==0: 
        zero_targets_counter += 1 # we want to increase the zeroes counter by 1 if the target is 0
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
            # If the target at position i is 0, and the number of zeroes is bigger than the numbe rof 1s, we want to take ote of that index
            # If the target at position i is 0, and the number of zeroes is bigger than the number of 1s, I'll know the indices of all data point to be removed 
            # deleting these entries will balance the dataset
        
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0) #np.delete(array, obj to delete, axis)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis = 0)

### Standardize the inputs

Scaling the inputs will greatly improve the algorithm (about 10%)

In [4]:
# Using the preprocessing library we imported from sklearn
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data (again)

We still have to shuffle them AFTER we balance the dataset as otherwise, all targets that are 1s will be contained in the train_targets. This code might be suboptimal, but is the easiest way to complete the exercise. Still, as we do the preprocessing only once, speed in not something we are aiming for.

In [5]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into training, validation and test

In [6]:
samples_count = shuffled_inputs.shape[0]

# To determine the size of each sample we'll use the 80-10-10 split for train, validation, and test
train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

In [9]:
# Lets's check if the datasets are balanced

print('Train Dataset:','Tragets 1s:', np.sum(train_targets),'Num. samples:', train_samples_count,'Balance%', np.sum(train_targets) / train_samples_count)
print('Validation Dataset:', 'Tragets 1s:', np.sum(validation_targets), 'Num. samples:', validation_samples_count,'Balance %', np.sum(validation_targets) / validation_samples_count)
print('Test Dataset:','Tragets 1s:', np.sum(test_targets),'Num. samples:', test_samples_count, 'Balance %', np.sum(test_targets) / test_samples_count)

Train Dataset: Tragets 1s: 1787.0 Num. samples: 3579 Balance% 0.4993014808605756
Validation Dataset: Tragets 1s: 215.0 Num. samples: 447 Balance % 0.4809843400447427
Test Dataset: Tragets 1s: 235.0 Num. samples: 448 Balance % 0.5245535714285714


### Save three datasets in *.npz

In [8]:
# We save the .npz in 2-tuple form[inputs, targets]
np.savez('Audiobooks_data_train', inputs=train_inputs, targets= train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets= test_targets)