#### Business Case

Given data from an audio book app, we want to create a ML algorithm based on our data to predict if a customer will buy again from the audio group company. 

The main idea is for the company not to spend its advertising budget targeting individuals unlikely to come back.

The input data represents 2 years worth of engagement and 6 months of data to check conversion,

Targets: 
- 1 if a customer bought again in the last 6 months of data.
- 0 if a customer did not buy again.

##### Task

Create a ML algorithm that can predict if a customer will buy again.

This is a classification problem with 2 classes - wont buy and will buy represented buy 0 and 1. 

#### The Business Case Action Plan

1. Preprocess the data.
    
    3 steps to doing this are:
    
    i. Balance the dataset
    
    ii. Divide the dataset into 3 parts - training, validation and test.
    
    iii. Save the dataset in a tensor friendly format
    
2. Create the Machine Learning algorithm.

#### Balancing the dataset

Importance of balancing your dataset.

90% accuracy for most problem is an impressive accomplishment.

The initial probability of picking one of 2 categories of data are referred to as a prior. The priors are balance when the 2 categories are 50% each.

Examples of unbalanced priors are
- 90% and 10%
- 70% and 30%
- 60% and 40%

For 3 classes - 33% each
For 4 classes - 25% each

In ML, only a result above 90% result is a more favouble one.

### Preprocessing the data

#### Practical example. Audiobooks

Preprocess the data, Balance the dataset, create 3 datasets: training, validation and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

##### Extract the data from the csv

In [3]:
pip install -U scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.2.1-cp310-cp310-win_amd64.whl (8.3 MB)
     ---------------------------------------- 8.3/8.3 MB 1.4 MB/s eta 0:00:00
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-3.1.0-py3-none-any.whl (14 kB)
Collecting joblib>=1.1.1
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
     -------------------------------------- 298.0/298.0 kB 1.5 MB/s eta 0:00:00
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.2.0 scikit-learn-1.2.1 threadpoolctl-3.1.0
Note: you may need to restart the kernel to use updated packages.


In [11]:
import numpy as np
from sklearn import preprocessing

#using the sklearn capabilities for standardizing the inputs
#Load the csv file
raw_csv_data = np.loadtxt('Audiobooks_data.csv', delimiter = ',')

unscaled_inputs_all = raw_csv_data[:, 1:-1]
targets_all = raw_csv_data[:, -1]

##### Balance the dataset
1. We will count the number of targets that are 1s - if we sum all the targets, we will get the number of targets that are 1s.
2. We will keep as many 0s as 1s (we will delete the others).

In [13]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
    if targets_all[i] ==0: #we want to increase the 0s counter by 1, if the target is 0
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0) #a method that deletes an object along an axis
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis = 0)

#We have a balanced dataset

#### Standardize the inputs 

In [14]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors) #a method that standardizes an array along an axis

#### Shuffle the data

A little trick is to shuffle the inputs and the targets. We keep same info but in a random order. We must shuffle the data since we will be batching.

In [15]:
shuffled_indices = np.arange(scaled_inputs.shape[0])  #a method that returns a evenly spaced values within a given interval
np.random.shuffle(shuffled_indices) # a method that shuffles the numbersin a given sequence

shuffled_inputs = scaled_inputs[shuffled_indices] 
shuffled_targets = targets_equal_priors[shuffled_indices]

#### Split the dataset into train, validation and test

In [16]:
samples_count = shuffled_inputs.shape[0]

#determine the size of each dataset. We'll be using 80-10-10.
train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

#Extractracting them from the big dataset
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count : train_samples_count + validation_samples_count]
validation_targets = shuffled_targets[train_samples_count : train_samples_count + validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count + validation_samples_count:]
test_targets = shuffled_targets[train_samples_count + validation_samples_count:]

#it is useful to check that the dataset is balanced
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

1801.0 3579 0.5032131880413523
226.0 447 0.5055928411633109
210.0 448 0.46875


Explaining the result:

The training set is considerably larger than the validation and the test - just the way we want it. 

The priors look ok as all three sets are balanced.

#### Save the three datasets in *.npz

In [17]:
np.savez('Audiobooks_data_train', inputs = train_inputs, targets = train_targets)
np.savez('Audiobooks_data_validation', inputs = validation_inputs, targets = validation_targets)
np.savez('Audiobooks_data_test', inputs = test_inputs, targets = test_targets)