<h1 align='center'> AudioBook Customer Analysis</h1>
<h5 align='center'> ~by Shivam Shukla </h5>




* Predicting whether the customer will again buy another audiobook or not!!!!

It will help the company to focus more on the customers who are more likey to buy the audiobook next time.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data  = pd.read_csv('original (1).csv',header=None)

In [3]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,994,1620.0,1620,19.73,19.73,1,10.0,0.99,1603.8,5,92,0
1,1143,2160.0,2160,5.33,5.33,0,8.91,0.0,0.0,0,0,0
2,2059,2160.0,2160,5.33,5.33,0,8.91,0.0,0.0,0,388,0
3,2882,1620.0,1620,5.96,5.96,0,8.91,0.42,680.4,1,129,0
4,3342,2160.0,2160,5.33,5.33,0,8.91,0.22,475.2,0,361,0


In [4]:
data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
count,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0,14084.0
mean,16772.491551,1591.281685,1678.608634,7.103791,7.543805,0.16075,8.909795,0.125659,189.888983,0.070222,61.935033,0.158833
std,9691.807248,504.340663,654.838599,4.931673,5.560129,0.367313,0.643406,0.241206,371.08401,0.472157,88.207634,0.365533
min,2.0,216.0,216.0,3.86,3.86,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,8368.0,1188.0,1188.0,5.33,5.33,0.0,8.91,0.0,0.0,0.0,0.0,0.0
50%,16711.5,1620.0,1620.0,5.95,6.07,0.0,8.91,0.0,0.0,0.0,11.0,0.0
75%,25187.25,2160.0,2160.0,8.0,8.0,0.0,8.91,0.13,194.4,0.0,105.0,0.0
max,33683.0,2160.0,7020.0,130.94,130.94,1.0,10.0,1.0,2160.0,30.0,464.0,1.0


Here 1st col is ID of customer and last col is Target.

* 1: Customer will buy.

* 0: Not buy

In [5]:
unscaled_data = data.values   # Converting dataframe to np.arrays

In [6]:
unscaled_inputs_all = unscaled_data[:,1:-1]
targets_all = unscaled_data[:,-1]
unscaled_inputs_all.shape

(14084, 10)

In [7]:
data[11].value_counts()

0    11847
1     2237
Name: 11, dtype: int64

## Balancing Data

Since out data is unbalanced... lots of 0's. So we need to balance data.

In [8]:
target_ones = int(np.sum(targets_all))
target_ones

2237

In [9]:
indices_to_remove = []
zero_index_counter = 0

for i in range(targets_all.shape[0]):
    if targets_all[i] == 0:
        zero_index_counter = zero_index_counter +1
        if zero_index_counter > target_ones:
            indices_to_remove.append(i)
zero_index_counter 

11847

In [10]:
len(indices_to_remove)

9610

In [11]:
unscaled_balanced_input = np.delete(unscaled_inputs_all,indices_to_remove,axis=0 )

In [12]:
target_balanced  = np.delete(targets_all, indices_to_remove, axis=0)

In [13]:
sum(target_balanced)   # Total 2237 target are 1 and we only kept same number of 0 targets and removed others.

2237.0

## Shuffling the data

As the data may be stored based on the timestamp so we need to randomly shuffle dataa.

In [14]:
shuffled_indicies = np.arange(unscaled_balanced_input.shape[0])

In [15]:
np.random.shuffle(shuffled_indicies)

In [16]:
shuffled_balanced_input = unscaled_balanced_input[shuffled_indicies]

In [17]:
shuffled_balanced_target  = target_balanced[shuffled_indicies]

In [18]:
sum(shuffled_balanced_target)

2237.0

## Standardising Data

In [19]:
from sklearn.preprocessing import StandardScaler

In [20]:
scalar = StandardScaler()

In [21]:
scaled_input = scalar.fit_transform(shuffled_balanced_input)

## Splitting into train.test,validation set

* 80,10,10% split

In [22]:
sample_count = scaled_input.shape[0]

In [23]:
train_count = int(0.8*sample_count)
val_count = int(0.1*sample_count)
test_count = sample_count-train_count-val_count

In [24]:
X_train = scaled_input[ : train_count]

In [25]:
y_train = shuffled_balanced_target[ : train_count]

In [26]:
X_val = scaled_input[train_count : train_count+val_count]
y_val = shuffled_balanced_target[train_count : train_count+val_count]

In [27]:
X_test = scaled_input[train_count+val_count : ]
y_test = shuffled_balanced_target[train_count+val_count : ]

#### Making sure each set is balanced

In [28]:
print(np.sum(y_train), train_count , np.sum(y_train)/train_count)
print(np.sum(y_val), val_count , np.sum(y_val)/val_count)
print(np.sum(y_test), test_count , np.sum(y_test)/test_count)

1794.0 3579 0.501257334450964
209.0 447 0.46756152125279643
234.0 448 0.5223214285714286


## Saving our preprocessed data in NPZ files

Just to directly load data and build our model.

In [29]:
np.savez('Audiobooks_data_train', inputs=X_train, targets=y_train)
np.savez('Audiobooks_data_validation', inputs=X_val, targets=y_val)
np.savez('Audiobooks_data_test', inputs=X_train, targets=y_test)

Now we can directly load these files using np.load() without performing the above preprocessing steps.

## Preparing data for model

In [30]:
X_train = X_train.astype(np.float)

In [31]:
y_train = y_train.astype(np.int)

In [32]:
X_val =X_val.astype(np.float)

In [33]:
y_val =y_val.astype(np.int)

In [34]:
X_test = X_test.astype(np.float)

In [35]:
y_test =y_test.astype(np.int)

## Model training

In [36]:
import tensorflow as tf

In [37]:
model = tf.keras.Sequential([tf.keras.layers.Dense(50,activation='relu'),
                            tf.keras.layers.Dense(50,activation='relu'),
                            tf.keras.layers.Dense(1,activation='sigmoid')])                            

In [38]:
model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [39]:
earlyStopping = tf.keras.callbacks.EarlyStopping(patience=10)  # It will ckeck previous 10 val_loss and if they are increasing, it will stop the training to prevent overfitting

In [40]:
model.fit(X_train,y_train,
          epochs=100,
          batch_size=100,
          validation_data=(X_val,y_val),
          callbacks=[earlyStopping]
     )

Train on 3579 samples, validate on 447 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100


<tensorflow.python.keras.callbacks.History at 0x7fca3c753510>

In [41]:
from sklearn.metrics import confusion_matrix,classification_report

In [42]:
pred = model.predict_classes(X_test)

In [43]:
print(confusion_matrix(y_test,pred))

[[154  60]
 [ 21 213]]


In [44]:
print(classification_report(y_test,pred))

              precision    recall  f1-score   support

           0       0.88      0.72      0.79       214
           1       0.78      0.91      0.84       234

    accuracy                           0.82       448
   macro avg       0.83      0.81      0.82       448
weighted avg       0.83      0.82      0.82       448



### I also tried the data without balancing and shuffling but it resulted in very Low Precision and Recall for class 1. That is why I first balanced the class and then shuffled the rows so that is also remains balanced when we split it into test, train and validation set.

Using this model, a company can focus more on the customers who are more likely to buy the new audiobook!