<a href="https://colab.research.google.com/github/praveen-jalaja/Practice_python-and-R/blob/master/Audiobooks.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AudioBook Supervised Learning Using Deep Neural Networks

##### BackGround : 
* The AudioBooks are purchased from a audiobook online retail app. The Data is colllected in the 2 years span and 6 months Span.

##### Problem Statement: 
* we have many inputs data like overall purchase minutes and reviews etc. and the targets is that **wheather the customer will buy or won't buy an another audiobook.**
* The targets in the data set was collected in the last 6 months of total span.



### Action Plan

* **Preprocessing the Data**
    * Shuffle the Dataset to eliminate the bias due to data Collection methods(like day effects).
    * Balance the Dataset.
    * Scale all the inputs.
    * shuffle the datasets to eliminate the bias due to Balancing.
    * Divide the dataset into train ,validation , test datasets.
    * convert the dataset into a tensor.

* **Create a Machine Learning Algorithm**
  * Create a model.
  * validate the model.
  * Test the Model
  

### Import all the libraries and raw data 

In [0]:
import tensorflow as tf
import numpy as np
from sklearn import preprocessing

In [0]:
url ="https://raw.githubusercontent.com/praveen-jalaja/ml-datasets/master/Audiobooks_data.csv?token=AN42ENME6UBQ3WN24LSK3DK6WYZT2"
raw_data = np.loadtxt(url,delimiter=',')

unscaled_inputs_all = raw_data[:,1:-1]
targets_all = raw_data[:,-1]

In [3]:
 print("Inputs Have {} rows X {} columns".format(unscaled_inputs_all.shape[0], unscaled_inputs_all.shape[1]))
 print("Targers Have {} rows X 1 columns".format(targets_all.shape[0]))

Inputs Have 14084 rows X 10 columns
Targers Have 14084 rows X 1 columns


### Shuffling the Datasets

In [0]:
# shuffled_indices = np.arange(unscaled_inputs_all.shape[0])
# np.random.shuffle(shuffled_indices) 

# unscaled_inputs_all = unscaled_inputs_all[shuffled_indices]
# targets_all = targets_all[shuffled_indices]

In [0]:
#  print("Inputs Have {} rows X {} columns".format(unscaled_inputs_all.shape[0], unscaled_inputs_all.shape[1]))
#  print("Targers Have {} rows X 1 columns".format(targets_all.shape[0]))

### Balancing Dataset

  * The Balancing the dataset required due the output is a categorical data. so it is possible that our Datasets target values have more percentage of one category than others.


In [6]:
no_of_ones = np.sum(targets_all, axis = 0)
no_of_zeros = len(targets_all) - no_of_ones
print("The Percentage of Zeros {:.2f} %, The Percentage of Ones {:.2f}%".format(no_of_zeros/len(targets_all)*100 , no_of_ones/len(targets_all)*100))

The Percentage of Zeros 84.12 %, The Percentage of Ones 15.88%


* The Percentage of customers won't buy audiobooks is higher than Percantage of customers will buy. That Leads to result in a biased machine learning algorithm.

In [7]:
zero_targets_counter = 0
indices_to_remove = []

for i in range(targets_all.shape[0]):
  if targets_all[i] == 0:
    zero_targets_counter+=1
    if zero_targets_counter > no_of_ones:
      indices_to_remove.append(i)

unscaled_inputs_equal_priors = np.delete(unscaled_inputs_all, indices_to_remove ,axis =0)
target_equal_priors = np.delete(targets_all,indices_to_remove ,axis =0)

print("The Percentage of Zeros {:.2f} %, The Percentage of Ones {:.2f}%".format(np.sum(target_equal_priors)/len(target_equal_priors)*100 , (len(target_equal_priors) - np.sum(target_equal_priors))/len(target_equal_priors)*100))

The Percentage of Zeros 50.00 %, The Percentage of Ones 50.00%


### Standardize the Inputs

In [0]:
scaled_inputs = preprocessing.scale(unscaled_inputs_equal_priors)

### Shuffle the data 

In [0]:
scaled_indices = np.arange(scaled_inputs.shape[0])

np.random.shuffle(scaled_indices)

shuffled_inputs = scaled_inputs[scaled_indices]
shuffled_targets = target_equal_priors[scaled_indices]

In [10]:
 print("Inputs Have {} rows X {} columns".format(shuffled_inputs.shape[0], shuffled_inputs.shape[1]))
 print("Targers Have {} rows X 1 columns".format(shuffled_targets.shape[0]))

Inputs Have 4474 rows X 10 columns
Targers Have 4474 rows X 1 columns


### split the train ,validation ,test Datasets.

In [0]:
sample_count = shuffled_inputs.shape[0]

train_samples_count = int(0.8 * sample_count)
validation_samples_count = int(0.1 * sample_count)

test_samples_count = sample_count - train_samples_count - validation_samples_count

train_inputs = shuffled_inputs[: train_samples_count]
train_targets = shuffled_targets[: train_samples_count]

validation_inputs = shuffled_inputs[train_samples_count:train_samples_count+validation_samples_count]
validation_targets = shuffled_targets[train_samples_count:train_samples_count+validation_samples_count]


test_inputs = shuffled_inputs[train_samples_count+validation_samples_count:]
test_targets = shuffled_targets[train_samples_count+validation_samples_count:]


In [12]:
 print("Train Inputs Have {} rows X {} columns".format(train_inputs.shape[0], train_inputs.shape[1]))
 print("Train Targers Have {} rows X 1 columns\n".format(train_targets.shape[0]))

 print("validation Inputs Have {} rows X {} columns".format(validation_inputs.shape[0], validation_inputs.shape[1]))
 print("validation Targers Have {} rows X 1 columns\n".format(validation_targets.shape[0]))


 print("test Inputs Have {} rows X {} columns".format(test_inputs.shape[0], test_inputs.shape[1]))
 print("test Targers Have {} rows X 1 columns\n".format(test_targets.shape[0]))

Train Inputs Have 3579 rows X 10 columns
Train Targers Have 3579 rows X 1 columns

validation Inputs Have 447 rows X 10 columns
validation Targers Have 447 rows X 1 columns

test Inputs Have 448 rows X 10 columns
test Targers Have 448 rows X 1 columns



In [0]:
np.savez('Audiobooks_data_train', inputs=train_inputs, targets=train_targets)
np.savez('Audiobooks_data_validation', inputs=validation_inputs, targets=validation_targets)
np.savez('Audiobooks_data_test', inputs=test_inputs, targets=test_targets)

### Model Creation 

In [0]:
npz = np.load('Audiobooks_data_train.npz')
train_inputs , train_targets = npz['inputs'].astype(np.float) , npz['targets'].astype(np.int)
npz = np.load('Audiobooks_data_validation.npz')
validation_inputs , validation_targets = npz['inputs'].astype(np.float) , npz['targets'].astype(np.int)
npz = np.load('Audiobooks_data_test.npz')
test_inputs , test_targets = npz['inputs'].astype(np.float) , npz['targets'].astype(np.int)

In [15]:
 print("Train Inputs Have {} rows X {} columns".format(train_inputs.shape[0], train_inputs.shape[1]))
 print("Train Targers Have {} rows X 1 columns\n".format(train_targets.shape[0]))

 print("validation Inputs Have {} rows X {} columns".format(validation_inputs.shape[0], validation_inputs.shape[1]))
 print("validation Targers Have {} rows X 1 columns\n".format(validation_targets.shape[0]))


 print("test Inputs Have {} rows X {} columns".format(test_inputs.shape[0], test_inputs.shape[1]))
 print("test Targers Have {} rows X 1 columns\n".format(test_targets.shape[0]))

Train Inputs Have 3579 rows X 10 columns
Train Targers Have 3579 rows X 1 columns

validation Inputs Have 447 rows X 10 columns
validation Targers Have 447 rows X 1 columns

test Inputs Have 448 rows X 10 columns
test Targers Have 448 rows X 1 columns



In [0]:
 input_size = 10
 output_size = 2
 hidden_layers_size = 50
 model = tf.keras.Sequential([
                              tf.keras.layers.Dense(hidden_layers_size ,activation='relu'),
                              tf.keras.layers.Dense(hidden_layers_size ,activation = 'relu'),
                              tf.keras.layers.Dense(output_size , activation='softmax')
 ])
 model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
 early_stopping = tf.keras.callbacks.EarlyStopping(patience=2 , verbose=2)

BATCH_SIZE =100
EPOCH_MAX = 100



In [17]:
model.fit(
    x = train_inputs,
    y = train_targets,
    batch_size = BATCH_SIZE,
    epochs = EPOCH_MAX,
    callbacks =[early_stopping],
    validation_data = (validation_inputs, validation_targets),
    verbose =2
)
 

Epoch 1/100
36/36 - 0s - loss: 0.6017 - accuracy: 0.7153 - val_loss: 0.4459 - val_accuracy: 0.8792
Epoch 2/100
36/36 - 0s - loss: 0.3897 - accuracy: 0.8715 - val_loss: 0.3348 - val_accuracy: 0.8770
Epoch 3/100
36/36 - 0s - loss: 0.3262 - accuracy: 0.8860 - val_loss: 0.3108 - val_accuracy: 0.8747
Epoch 4/100
36/36 - 0s - loss: 0.3039 - accuracy: 0.8908 - val_loss: 0.2997 - val_accuracy: 0.8859
Epoch 5/100
36/36 - 0s - loss: 0.2879 - accuracy: 0.8949 - val_loss: 0.2837 - val_accuracy: 0.8881
Epoch 6/100
36/36 - 0s - loss: 0.2785 - accuracy: 0.8955 - val_loss: 0.2818 - val_accuracy: 0.8881
Epoch 7/100
36/36 - 0s - loss: 0.2688 - accuracy: 0.8994 - val_loss: 0.2714 - val_accuracy: 0.8881
Epoch 8/100
36/36 - 0s - loss: 0.2631 - accuracy: 0.9014 - val_loss: 0.2671 - val_accuracy: 0.8881
Epoch 9/100
36/36 - 0s - loss: 0.2584 - accuracy: 0.9019 - val_loss: 0.2680 - val_accuracy: 0.8949
Epoch 10/100
36/36 - 0s - loss: 0.2547 - accuracy: 0.9033 - val_loss: 0.2648 - val_accuracy: 0.8904
Epoch 11/

<tensorflow.python.keras.callbacks.History at 0x7efb96140320>

### Test Model

* Testing with the Balanced test Dataset

In [18]:
 test_loss , test_accuracy = model.evaluate(test_inputs,test_targets)



In [19]:
print("TestLoss : {:.1f} , Test Accuracy : {:.2f}%".format(test_loss, test_accuracy*100))

TestLoss : 0.2 , Test Accuracy : 91.29%


* Testing the un-balanced dataset.

In [23]:
whole_inputs = preprocessing.scale(unscaled_inputs_all)

scaled_indices = np.arange(whole_inputs.shape[0])

np.random.shuffle(scaled_indices)

shuffled_whole_inputs = whole_inputs[scaled_indices]
shuffled_whole_targets = targets_all[scaled_indices]

print("Inputs Have {} rows X {} columns".format(shuffled_whole_inputs.shape[0], shuffled_whole_inputs.shape[1]))
print("Targers Have {} rows X 1 columns".format(shuffled_whole_targets.shape[0]))

Inputs Have 14084 rows X 10 columns
Targers Have 14084 rows X 1 columns


In [24]:
test_loss , test_accuracy = model.evaluate(shuffled_whole_inputs,shuffled_whole_targets)
print("TestLoss : {:.1f} , Test Accuracy : {:.2f}%".format(test_loss, test_accuracy*100))

TestLoss : 1.4 , Test Accuracy : 49.44%


### Conclusion 
* The Test Accuracy of our Model is 91.96 %, which higher than validation accuracy, theoratically it is not possible.

* But, 91.96 % pretty Good Accuracy Rate.

* When we applied for the whole dataset without Balancing the dataset, the Test accuracy is 50%.

* It shows that balancing the dataset , will have a cutback on the actual dataset.



