# Business Case Example

You are given data from an audiobook app. Logically it relates to the audio versions of books only. Each customer in the database has made a purchase at least once. That's the condition to be included.

We want to create a ML algorithm based on our data that predicts whether a customer will buy again from the audiobook company.

The idea is that the audiobook company shouldn't spend its advertising budget targeting individuals who are unlikely to come back.

If you focus marketing spend on customers who are likely to return and continue spending, you can improve sales and revenue figures.

Our model will take several metrics and try to predict human behaviour. A side-benefit of this model is that it will show us which are the most important metrics that cause a customer to come back.

Having the data and the technology to identify prospective customers creates a lot of value and growth opportunities. It is one of the better applications of data science.

The data is a csv file, with each row representing a person. 

- Customer ID - ID is like a name.
- Book length (overall) is the sum of lengths of all purchases.
- Book length (average) is the average length per book.
- Number of purchases is not explicit, but is calculated from the book length variables if required.
- Price (overall and average) act in the same way as book length. NB> price variable is almost always a good predictor.
- Review is a boolean that shows if the customer left a review. It shows engagement with the platform. 
- Review 10/10 meeasures the review of the customer on a scale of 1-10. 
- The two review columns show an early step for preprocessing. If a customer has not left a review, they have a 0 bool value and no entry for 10/10 rating. It is possible to combine the two fields. One way would be to fill in missing values with the average review score for those that were rated. That average would be the status quo of book rating. A review of above this average indicates above average feelings towards content on the platform.
- Minutes listened is measure of engagement
- Completion is total minutes listened / book length (overall)
- Support requests show the total number of support requests a customer has made. (A measure of engagement)
- Last interaction vs first purchase date. the bigger the difference, the bigger the engagement. If the value is 0, the customer has never accessed what they bought, or just used it on the first day only.

It is necessary to ask how the data was gathered. This data was gathered from the audiobook app. It represents 2 years worth of engagement.

We are doing supervised learning, so we need to set targets. 1 if the customer converted, 0 if they didnt. 

What does it mean to convert??

We have taken an extra 6 months of data after that 2 year period to check if a customer has converted. So overall we have 2 years and 6 months of data. 

The inputs are the 2 years of data. The targets are whether they bought a book in the 6 months following. If they did, it would be bool value of 1, if they didnt, it would be a bool value of 0.

That is how the targets column is created. 

We need to create a ML algorithm that can predict if a customer will buy again.



## The Business Case Action Plan

- preprocess the data
    - balance the dataset
    - divide the dataset in training, validation, test
    - save the data in a tensor friendly format
- create the machine learning algorithm



### Balancing the Dataset

What accuracy do you expect from a model that is meant to classify photos of cats and dogs. 

- 70% is not too bad
- 80% is good
- 90% is very good for beginners and useful for most problems

Imagine a model that takes animal photos and outputs only cats. No matter what you feed to the algorithm, it will always output cat as the answer. This is a bad model.

Imagine that 90% of the photos in a dataset was cats and 10% of them are dogs. The model would always output cats. But 90% of them are cats, so what is the accuracy? It is 90%.

The ML model is trying to reduce the losses, and if most outputs are cats, it is safe to predict that everything is a cat.

If this same dataset had a model that had 80% accuracy, this would be an awful model, as one that said just cats would be right 90% of the time and so much better.

When you talk about the balance of cats and dogs in the original dataset, you are talking about PRIORS being unbalanced. The Priors are balanced when 50% are cats and 50% are dogs.

When you have unbalanced priors, a ML algorithm might quickly learn that one outcome is more likely than the other and that will skew the results. 

If we have three classes, cats dogs and horses, balancing would mean each would be 1/3. and so on...

By exploring the dataset in this case study, you can see that most customers did not convert in the following 6 months, which means it is an unbalanced dataset. 

The dataset needs to be balanced by counting the total number of target 1s and matching the same number of target 0s to them. 

## Practical Example. Audiobooks

#### Preprocess the data. Balance the dataset. Create 3 datasets: training, validation, and test. Save the newly created sets in a tensor friendly format (e.g. *.npz)

### Extract the data from the csv

In [5]:
import numpy as np
from sklearn import preprocessing

# sklearn's preprocessing capabilities means one line of code can drastically help

raw_csv_data = np.loadtxt('/Users/yafja/Google Drive/Jack/Udemy/DataSci/The Data Science Course 2018 - All Resources/Part_7_Deep_Learning/S51_L356/Audiobooks_data.csv',delimiter=',')

unscaled_inputs_all = raw_csv_data[:,1:-1]
targets_all = raw_csv_data[:,-1]

### Balance the dataset

In [8]:
num_one_targets = int(np.sum(targets_all))
zero_targets_counter = 0
indices_to_remove = []

# the shape of targets_all on axis=0, is basically the length of the vector.
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
            
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all,indices_to_remove,axis = 0)
targets_equal_priors = np.delete(targets_all,indices_to_remove,axis=0)


### Standardise the inputs

In [9]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data

In [10]:
shuffled_indices = np.arange(scaled_inputs.shape[0])
np.random.shuffle(shuffled_indices)

shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

### Split the dataset into train, validation and test

In [12]:
samples_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = samples_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

print(np.sum(train_targets),train_samples_count,np.sum(train_targets)/train_samples_count)
print(np.sum(validation_targets),validation_samples_count,np.sum(validation_targets)/validation_samples_count)
print(np.sum(test_targets),test_samples_count,np.sum(test_targets)/test_samples_count)

1788.0 3579 0.49958088851634536
218.0 447 0.48769574944071586
231.0 448 0.515625


All three sets are balanced now.

### Save the three datasets in *.npz

In [13]:
np.savez('Audiobooks_data_train',inputs=train_inputs,targets=train_targets)
np.savez('Audiobooks_data_validation',inputs=validation_inputs,targets=validation_targets)
np.savez('Audiobooks_data_test',inputs=test_inputs,targets=test_targets)

The data is preprocessed now. Each time we run the code we will get different data as the sampling and shuffling will happen differently. 

This code is reusable for any dataset that has two classes.

If there is a dataset that has more than two classes, the only change needs to be in the balancing part, where you need to balance it out for x classes rather than for 2.

It does make more sense to shuffle the indices prior to balancing the dataset, so that could be changed in this code.