# Audiobook customer analysis
##### Given: a dataset (.csv) of every customer who has every bought from the audiobook marketplace
##### Goal: make a ML algorithim to make a model that will predict if a customer will buy again from the marketplace
##### Simpler: figure out which variables that we know about the customer, has the most weights in determing if they buy again
##### --2 years worth of engagement to train, 6 months to test
##### --Supervised learning== meaning we need targets
##### --targets = a boolean either they converted to audiobook or did not buy a book
##### --Classification problem with either they will buy or will not buy (0, 1)


### Action plan
##### 1. Preprocess data
##### 1a.Balance the dataset (shuffle it as well)
##### 1b. Divide the dataset into training, validation, test
##### 1c. Save the data into tensor for tf, i.e. .npz
##### 2. CREATE ML algo

#### Import relevant libraries

In [13]:
import numpy as np
#use preprocessing object from sklearn, it will make process go much faster
from sklearn import preprocessing

#### Load the data

In [3]:
#load the data using np framework
raw_csv_data = np.loadtxt('D:/data_science/Neural_Nets/final projects/Audiobooks_data.csv',
                          delimiter = ',')
#dataset has bogus column 0 and column -1 is the targets, get rid of them both for inputs
unscaled_inputs_all = raw_csv_data[:,1:-1]
#assign targets to corresponding column
targets_all = raw_csv_data[:,-1]

#### Balance the data

In [4]:
#Count how many targets are 'True', meaning they did convert to buy more
#this method works because the sum of all the targets = sum of all the number 1s
#this is true becuase our targets are only 0s and 1s
num_one_targets = int(np.sum(targets_all))

#initialize variables for later loop
    #counter for how many customers did not convert ('False' i.e. 0s)
zero_targets_counter = 0
    #keep track of which input/target pairs we need to remove to create balanced dataset
indices_to_remove = []

#for loop: 1. Count how many targets are 0s(no converstion)
#2. once there are as many 0s as there are 1s, mark where the excess are located
for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_targets_counter += 1 
        if zero_targets_counter > num_one_targets:
            indices_to_remove.append(i)
    
#assign variable to the balanced(equal prior) list of inputs
unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove, axis = 0)
#assign variable to the balanced(equal prior) to the list of targets
targets_equal_priors = np.delete(targets_all, indices_to_remove, axis = 0)

#### Standardize the inputs

In [14]:
#use sklearn object 'preprocessing' to scale each variable individually(each columns w/in itself)
#.scale() method
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

#### Shuffle the inputs/targets
##### To avoid ordered time-series correlations

In [15]:
#Shuffle the data
#do this by the index, such that a target/input row moves together, and they dont move independently
shuffled_indices = np.arange(scaled_inputs.shape[0])
#actually shuffle based on index
np.random.shuffle(shuffled_indices)

#assign new variables to the shuffled, balanced, and scaled inputs/targets
shuffled_inputs = scaled_inputs[shuffled_indices]
shuffled_targets = targets_equal_priors[shuffled_indices]

#### Split into test/validate/train

In [16]:
#count the total # of samples
samples_count = shuffled_inputs.shape[0]

#make the splits of 80/10/10 but parameterize it for easy editing
train_samples_count = int(0.8*samples_count)
validation_samples_count = int(0.1*samples_count)
test_samples_count = int(0.1*samples_count)
#or
#test_samples_count = samples_count -train_samples_count - validation_samples_count

#assign variables to the train inputs/targets; do this using train_samples_count so that it is paramterized
train_inputs = shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

#do the same thing with validation inputs/targets --> this is easy bc we paramaterized
validation_inputs = shuffled_inputs[train_samples_count:
                                    train_samples_count + validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:
                                     train_samples_count + validation_samples_count]

#do the same thing with validation inputs/targets --> this is easy bc we paramaterized
test_inputs = shuffled_inputs[train_samples_count + validation_samples_count:]
test_targets = shuffled_targets[train_samples_count + validation_samples_count:]

#bc we balanced our data w/ np.random.shuffle, each train/validate/test target should be balanced
#lets check thos by printing the total # of ones per the total number of samples, broken down by 
#train/validate/test
print(np.sum(train_targets), train_samples_count, np.sum(train_targets)/train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) 
      / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets)/ test_samples_count)
                                    



1790.0 3579 0.5001397038278849
227.0 447 0.5078299776286354
220.0 447 0.49217002237136465


#### Make the datasets into tensors (combine inputs/targets)

In [18]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets= test_targets)