Problem: You are given data from an audiobook app. Each customer in the database has made a purchase at least once in 2 years. We want to create a machine learning algorithm based on our data that can predict if a customer will buy again from the audiobook company.

The business case action plan:
1. Preprocess the data
   * 1. Balance the dataset
   * 2. Divide the dataset in training, validation and test (prevent overfitting)
   * 3. Save the data in a tensor friendly format (.npz)
2. Create the machine learning algorithm (same structure, different model)

## Extract the data from the csv

In [9]:
import numpy as np
from sklearn import preprocessing

In [7]:
path='C:/Users/Mendes/Desktop/The Data Science Course 2021 - All Resources/Part_7_Deep_Learning/S51_L352/'
raw_csv_data = np.loadtxt(path+'Audiobooks_data.csv', delimiter=',')
#Customer ID, Book length (mins)_overall, Book length (mins)_avg, Price_overall, Price_avg, Review (yes=1, no=0), 
#Review 10/10, Minutes listened, Completion, Support Requests, Last visited minus Purchase date, 
#Targets (if this person bought another book in the following 6 months after this 2 years)

unscaled_inputs_all=raw_csv_data[:,1:-1]  #the customer ID doesn´t matter to us, and the target should go separately
targets_all=raw_csv_data[:,-1]

## Balance the dataset

Balancing the dataset: when the number of samples are too different, the model will generate wrongly. So it's important that the data has approximatelly the same number of inputs of the different targets.
1. We will count the number of targets that are 1s (there are less than 0s)
2. We will keep as many 0s as 1s (we will delete the others)

In [20]:
num_one_targets=int(np.sum(targets_all))  #count the targets that are 1s
zero_targets_counter=0
indices_to_remove=[]   #list to record the indices to be removed

for i in range (targets_all.shape[0]):      # start of the iteration over the dataset to balance it
 if targets_all[i]==0:                      # The shape of targets_all on axis=0, is basically the length of the vector. 
  zero_targets_counter += 1                 #We want to increase the zeroes count by 1 if target is 0.
  if zero_targets_counter>num_one_targets:  # if the target at position 'i' is 0, and the number of zeroes is bigger than the 
        #number of 1s, we want to take note of that index
    indices_to_remove.append(i)             #adds the element to the list and I'll know the indices of all data points to be 
    #removed

unscaled_inputs_equal_priors=np.delete(unscaled_inputs_all,indices_to_remove,axis=0)  #deletes from the inputs
targets_equal_priors=np.delete(targets_all,indices_to_remove,axis=0)                  #deletes from the targets

## Standardize the inputs

In [23]:
scaled_inputs=preprocessing.scale(unscaled_inputs_equal_priors) #from sklearn library

## Shuffle the data

A little trick is to shuffle the inputs and the targets. We keep the same information but in a random order because it's possible that the original dataset was collected in the order of date.

In [28]:
shuffled_indices=np.arange(scaled_inputs.shape[0]) #create a vector with the lenght of the scaled_inputs
np.random.shuffle(shuffled_indices)                #shuffle the vector

shuffled_inputs=scaled_inputs[shuffled_indices]    # create the variables with inputs and targets with the shuffled indices
shuffled_targets=targets_equal_priors[shuffled_indices]

## Split the dataset into train, validation and test

In [31]:
samples_count=shuffled_inputs.shape[0]  #count the total number of samples

#determine the size of samples
train_samples_count=int(0.8*samples_count)  
validation_samples_count=int(0.1*samples_count)
test_samples_count=samples_count-train_samples_count-validation_samples_count

#extract them
train_inputs=shuffled_inputs[:train_samples_count]
train_targets=shuffled_targets[:train_samples_count]

validation_inputs=shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets=shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs=shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets=shuffled_targets[train_samples_count+validation_samples_count:]

#it is usefull to check if we have balanced the dataset
print(np.sum(train_targets), train_samples_count, np.sum(train_targets)/train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets)/validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets)/test_samples_count)

1800.0 3579 0.5029337803855826
216.0 447 0.48322147651006714
221.0 448 0.49330357142857145


## Save the thress datasets in *.npz

In [32]:
np.savez('Audiobooks_data_train',inputs=train_inputs,targets=train_targets)
np.savez('Audiobooks_data_validation',inputs=validation_inputs,targets=validation_targets)
np.savez('Audiobooks_data_test',inputs=test_inputs,targets=test_targets)