# Practical example. Audiobooks

## Preprocess the data. Balance the dataset. Create 3 datasets: training, validation and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

Since we are dealing with real life data, we will need to process it a bit. This is the relevant code, which is not that hard, but refers to data engineering more than machine learning.

If you want to know how to do that, go through the code and the commments. In any case, this should do the trick for all datasets organized in the way: many inputs, and then 1 cell containing the targets (all supervised learning datasets(.

Note that we have removed the header row, which contains the names of the categories. We simply want the data. 

## Extract the data from the csv

In [1]:
import numpy as np
from sklearn import preprocessing ## should standardize inputs using sklearn, accuracy decreases by 10% otherwise 
raw_csv_data = np.loadtxt("Audiobooks_data.csv", delimiter = ",")
unscaled_inputs_all = raw_csv_data[:,1:-1] ## takes all data except first and last columns (Id and target columns)
targets_all = raw_csv_data[:,-1] ## creates variable for just targets

## Shuffle data

In [2]:
#shuffled_indices = np.arange(unscaled_inputs_all.shape[0])
#np.random.shuffle(shuffled_indices) ## put indices in variable and shuffle them 

#shuffled_inputs = unscaled_inputs_all[shuffled_indices]
#shuffled_targets = targets_all[shuffled_indices] ## indices are shuffled indices

## Balance the dataset

In [3]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = [] ## indices to be removed, want to be tuple or list, so leave empty for now with square brackets

for i in range(targets_all.shape[0]): ## shape on 0 axis is basically length of vector so will show sum of targets 
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)            ## once number of zeros matches no of 1's, can note indices to be removed 
            
            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all,indices_to_remove,axis=0)
# will delete content of indices to remove from inputs

targets_equal_priors = np.delete(targets_all,indices_to_remove,axis=0)

In [4]:
## important as could be more cats than dogs (model would think everything should be cats), or more 0's than 1's etc
## we will count the number of targets that are 1's, and keep as many 0's as 1's (and delete the others)

## Standardize the inputs 

In [5]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors) # method standardizes array across each element, 
# imported from sklearn

## Shuffle the data again

In [6]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices) ## put indices in variable and shuffle them 

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices] 
## indices are shuffled indices - oringinally here, now shuffling before balancing 

In [7]:
## now shuffle inputs and targets, same information but random order
## need to shuffle sufficiently enough for batching to be used, e.g. in data order originally 
## like if a batch was homogenous, but between batches hetrogenous, 
## would confuse stochastic gradient descent when updating weights/loss after each batch

## Split the dataset into train, validation and test

In [8]:
samples_count = shuffled_inputs.shape[0] ## going to split dataset into 3 datasets. 80,10,10 split
train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count] # first train samples count of inputs
train_targets = shuffled_targets[:train_samples_count] # first train samples count of targets

validation_inputs = shuffled_inputs[train_samples_count: train_samples_count + validation_samples_count] 
# between train and train samples, plus validation samples count
validation_targets = shuffled_targets[train_samples_count: train_samples_count + validation_samples_count]
# same place but for targets

test_inputs = shuffled_inputs[train_samples_count + validation_samples_count:] 
# everything left after validation subset
test_targets = shuffled_targets[train_samples_count + validation_samples_count:]

print(np.sum(train_targets),train_samples_count,np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets),validation_samples_count,np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets),test_samples_count,np.sum(test_targets) / test_samples_count)
## used to check if we have balanced training, test and validation datasets, not just whole dataset
## prints number of 1's, number of samples, and proportion of 1's in respect of total, should all be around 50%

1781.0 3579 0.49762503492595694
231.0 447 0.5167785234899329
225.0 448 0.5022321428571429


## Save the 3 datasets in *.npz

In [9]:
np.savez('Audiobooks_data_train',inputs = train_inputs, targets = train_targets)
np.savez('Audiobooks_data_validation',inputs = validation_inputs,targets = validation_targets)
np.savez('Audiobooks_data_test',inputs = test_inputs,targets = test_targets)

In [10]:
## can use this for when having two classes, if more classes, need to balance datasheet for each class
## proportions will change each time we run the code 

In [11]:
## 50 is good rough amount for hidden layer size,
## not too much which would slow learning, and see if we are actually learning anything
## but not too few to add enough complexity 
## inputs is 10 since removing first column, and targets column, gives 10 inputs
## 2 outputs due to 0 or 1 