<a href="https://colab.research.google.com/github/kirat89/Complete-Python-3-Bootcamp/blob/master/Audiobook_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# You are given data from an Audiobook App. Logically, it relates to the audio versions of books ONLY. Each customer in the database has made a purchase at least once, that's why he/she is in the database. We want to create a machine learning algorithm based on our available data that can predict if a customer will buy again from the Audiobook company.

The main idea is that if a customer has a low probability of coming back, there is no reason to spend any money on advertising to him/her. If we can focus our efforts SOLELY on customers that are likely to convert again, we can make great savings. Moreover, this model can identify the most important metrics for a customer to come back again. Identifying new customers creates value and growth opportunities.

In [1]:
import numpy as np
from sklearn import preprocessing

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
raw_csv_data=np.loadtxt('/content/drive/MyDrive/Data Science /Audiobooks_data.csv',delimiter=',')

In [4]:
# The inputs are all columns in the csv, except for the first one [:,0]
# (which is just the arbitrary customer IDs that bear no useful information),
# and the last one [:,-1] (which is our targets)
unscaled_input_all=raw_csv_data[:,1:-1]

In [5]:
# The targets are in the last column.
targets=raw_csv_data[:,-1]

In [6]:
shuffle_indices=np.arange(unscaled_input_all.shape[0])
np.random.shuffle(shuffle_indices)
# Use the shuffled indices to shuffle the inputs and targets.
p_shuffled_inputs = unscaled_input_all[shuffle_indices]
p_shuffled_targets = targets[shuffle_indices]

since in the given dataset most of the customer didn't buy back, so targets are imbalanced. Their's a chance machine could learn that buying back is not possiblity.
so to avoid this, we will **balance the dataset**.


In [7]:
# Count how many targets are 1 (meaning that the customer did convert)
num_one_targets=int(np.sum(p_shuffled_targets))
#setting a counter for coustomers that didn't convert(0)
zero_targets_counter = 0
# We want to create a "balanced" dataset, so we will have to remove some input/target pairs.
# Declare a variable that will do that:
indices_to_remove = []
for i in range(p_shuffled_targets.shape[0]):
  if p_shuffled_targets[i]==0:
   zero_targets_counter+=1
   if zero_targets_counter>num_one_targets:
     indices_to_remove.append(i)


# Create two new variables, one that will contain the inputs, and one that will contain the targets.
# We delete all indices that we marked "to remove" in the loop above.
balanced_inputs = np.delete(p_shuffled_inputs, indices_to_remove, axis=0)
balanced_targets = np.delete(p_shuffled_targets, indices_to_remove, axis=0)

In [8]:
unique, counts = np.unique(balanced_targets, return_counts=True)

print (np.asarray((unique, counts)).T)

[[0.000e+00 2.237e+03]
 [1.000e+00 2.237e+03]]


In [9]:
shuffle_indices=np.arange(balanced_inputs.shape[0])
np.random.shuffle(shuffle_indices)
# Use the shuffled indices to shuffle the inputs and targets.
shuffled_inputs = unscaled_input_all[shuffle_indices]
shuffled_targets = targets[shuffle_indices]

In [10]:
scaled_shuffled_inputs = preprocessing.scale(shuffled_inputs)

In [11]:
samples_count = scaled_shuffled_inputs.shape[0]

# Count the samples in each subset, assuming we want 80-10-10 distribution of training, validation, and test.
# Naturally, the numbers are integers.
train_samples_count = int(0.8 * samples_count)
validation_samples_count = int(0.1 * samples_count)

# The 'test' dataset contains all remaining data.
test_samples_count = samples_count - train_samples_count - validation_samples_count

# Create variables that record the inputs and targets for training
# In our shuffled dataset, they are the first "train_samples_count" observations
train_inputs = scaled_shuffled_inputs[:train_samples_count]
train_targets = shuffled_targets[:train_samples_count]

# Create variables that record the inputs and targets for validation.
# They are the next "validation_samples_count" observations, folllowing the "train_samples_count" we already assigned
validation_inputs = scaled_shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]

# Create variables that record the inputs and targets for test.
# They are everything that is remaining.
test_inputs = scaled_shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]

# We balanced our dataset to be 50-50 (for targets 0 and 1), but the training, validation, and test were 
# taken from a shuffled dataset. Check if they are balanced, too. Note that each time you rerun this code, 
# you will get different values, as each time they are shuffled randomly.
# Normally you preprocess ONCE, so you need not rerun this code once it is done.
# If you rerun this whole sheet, the npzs will be overwritten with your newly preprocessed data.

# Print the number of targets that are 1s, the total number of samples, and the proportion for training, validation, and test.
print(np.sum(train_targets), train_samples_count, np.sum(train_targets) / train_samples_count)
print(np.sum(validation_targets), validation_samples_count, np.sum(validation_targets) / validation_samples_count)
print(np.sum(test_targets), test_samples_count, np.sum(test_targets) / test_samples_count)

717.0 3579 0.20033528918692373
80.0 447 0.1789709172259508
90.0 448 0.20089285714285715


In [12]:
unique, counts = np.unique(test_targets, return_counts=True)

print (np.asarray((unique, counts)).T)

[[  0. 358.]
 [  1.  90.]]


In [13]:
np.savez('/content/drive/MyDrive/Data Science /Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('/content/drive/MyDrive/Data Science /Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('/content/drive/MyDrive/Data Science /Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

In [14]:
import tensorflow as tf

In [15]:
npz=np.load('/content/drive/MyDrive/Data Science /Audiobooks_data_train.npz')

train_inputs=npz['inputs'].astype(np.float)
train_targets=npz['targets'].astype(np.int)

npz=np.load('/content/drive/MyDrive/Data Science /Audiobooks_data_validation.npz')

validation_inputs=npz['inputs'].astype(np.float)
validation_targets=npz['targets'].astype(np.int)

npz=np.load('/content/drive/MyDrive/Data Science /Audiobooks_data_test.npz')

test_inputs=npz['inputs'].astype(np.float)
test_targets=npz['targets'].astype(np.int)

In [16]:
input_size=10
output_size=2
hidden_layer_size=100

model=tf.keras.Sequential([
                           tf.keras.layers.Dense(units=input_size,activation='relu'),
                           tf.keras.layers.Dense(units=hidden_layer_size,activation='relu'),
                           tf.keras.layers.Dense(units=hidden_layer_size,activation='relu'),
                           tf.keras.layers.Dense(units=output_size,activation='softmax')
                           ])
model.compile(optimizer='Adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])



In [17]:
batch_size=100
epochs=100
early_stopping=tf.keras.callbacks.EarlyStopping(patience=2)
epoch_hist=model.fit(train_inputs,train_targets,batch_size=batch_size,epochs=epochs,callbacks=[early_stopping],validation_data=(validation_inputs,validation_targets),
verbose=2)

Epoch 1/100
36/36 - 1s - loss: 0.4956 - accuracy: 0.8011 - val_loss: 0.4051 - val_accuracy: 0.8523
Epoch 2/100
36/36 - 0s - loss: 0.3942 - accuracy: 0.8500 - val_loss: 0.3508 - val_accuracy: 0.8680
Epoch 3/100
36/36 - 0s - loss: 0.3468 - accuracy: 0.8653 - val_loss: 0.3193 - val_accuracy: 0.8747
Epoch 4/100
36/36 - 0s - loss: 0.3235 - accuracy: 0.8729 - val_loss: 0.2977 - val_accuracy: 0.8837
Epoch 5/100
36/36 - 0s - loss: 0.3114 - accuracy: 0.8793 - val_loss: 0.2871 - val_accuracy: 0.8814
Epoch 6/100
36/36 - 0s - loss: 0.2969 - accuracy: 0.8829 - val_loss: 0.2955 - val_accuracy: 0.8904
Epoch 7/100
36/36 - 0s - loss: 0.2868 - accuracy: 0.8826 - val_loss: 0.2835 - val_accuracy: 0.8926
Epoch 8/100
36/36 - 0s - loss: 0.2806 - accuracy: 0.8852 - val_loss: 0.2766 - val_accuracy: 0.8926
Epoch 9/100
36/36 - 0s - loss: 0.2773 - accuracy: 0.8868 - val_loss: 0.2759 - val_accuracy: 0.8904
Epoch 10/100
36/36 - 0s - loss: 0.2754 - accuracy: 0.8885 - val_loss: 0.2751 - val_accuracy: 0.8770
Epoch 11/