# Audiobooks business case - Preprocessing

A data from an Audiobook app is given. Logically, it relates only to the audio versions of books. Each customer in the database has made a purchase at least once, that's why he/she is in the database. The goal is to create a machine learning algorithm based on the available data that can predict if a customer will buy again from the Audiobook company. But the prior to that, the data has to be preprocessed.

The Audiobook data has several columns: Customer ID, Book length in mins_avg (average of all purchases), Book length in minutes_sum (sum of all purchases), Price Paid_avg (average of all purchases), Price paid_sum (sum of all purchases), Review (a Boolean variable), Review (out of 10), Total minutes listened, Completion (from 0 to 1), Support requests (number), and Last visited minus purchase date (in days).

In this notebook, the dataset will be preprocessed. The preprocessing will involve balancing, standardizing and shuffling the dataset. The dataset will then be split into test, train and validation datasets which are later exported as .npz files.


### Extract the data from the csv

In [1]:
# Import relevant libraries
import numpy as np
from sklearn import preprocessing

# Load the data
raw_csv_data = np.loadtxt('data/Audiobooks_data.csv', delimiter=',')

# Select all columns in the csv, except for the first one [:,0]
# (which is customer IDs column), and the last one [:,-1] (which is the targets)
unscaled_inputs_all = raw_csv_data[:,1:-1]

# Select the targets column (which is the last one)
targets_all = raw_csv_data[:,-1]

### Balance the dataset

In [2]:
# Determine the number of ones in the targets data
num_one_targets = int(np.sum(targets_all))

# Set a counter for targets that are 0
zero_targets_counter = 0

# Create a "balanced" dataset by removing some input/target pairs
# Declare a variable that will do that:
indices_to_remove = []

# Count the number of targets that are 0
# Once there are as many 0s as 1s, mark entries where the target is 0
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)

# Create two new variables, one that will contain the inputs, and one that will contain the targets
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis=0)
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis=0)

### Standardize the inputs

In [3]:
# Standardize the balanced inputs
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [4]:
# Shuffle the indices of the data
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

# Use the shuffled indices to shuffle the inputs and targets
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation, and test

In [5]:
# Count the total number of samples
samples_count = shuffled_inputs.shape[0]

# Count the samples in each subset, using 80-10-10 distribution of training, validation, and test
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create variables that record the inputs and targets for training
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation
validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test
test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1757.0 3579 0.49091925118748253
241.0 447 0.5391498881431768
239.0 448 0.5334821428571429


### Save the three datasets in *.npz

In [6]:
# Save the three datasets in *.npz
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)