# Environment Management

**Environment Name:** Audiobooks    
**Project Directory Name:** audiobooks_prj
<div style="text-align: justify">
<strong>Original Imported Libraries and Python:</strong>  
    
-python=3.12.4 (from conda-forge)  
-numpy=1.26.4 (from conda-forge)  
-pandas=2.2.2 (from conda-forge)  
-scikit-learn=1.5.1 (from conda-forge)  
-matplotlib=3.9.1 (from conda-forge)  
-tensorflow==2.17.0 (from pip)  
 
    
<strong>Project Date:</strong> July 2024
</div>

In [1]:
import sys 
sys.executable  # Display the path to the Python executable ensuring the correct env

'C:\\Users\\Adespotos\\anaconda3\\envs\\Audiobooks\\python.exe'

# Import Libraries & Read the File

In [2]:
import numpy as np  # For numerical operations and arrays.	
import pandas as pd  # For data manipulation and analysis.	
import matplotlib.pyplot as plt  # For basic plotting.	
import tensorflow as tf  # For building and training ML models.
from sklearn.preprocessing import StandardScaler  # For creating scaler instances for standardization purposes.
from imblearn.under_sampling import RandomUnderSampler  # For reducing the majority class number
from sklearn.model_selection import train_test_split
from audiobooks_scripts import create_datasets, create_model_train_eval_present_results

In [3]:
# Read Excel file to a DataFrame:
df = pd.read_excel('Audiobooks_data.xlsx')

# Drop customer ID column:
df_dropped = df.copy().drop(columns='Customer ID')

# Dealing with the Imbalance Dataset

In [4]:
# Check how the target values are separated:
df_dropped['Targets'].value_counts()

Targets
0    11847
1     2237
Name: count, dtype: int64

<div style="text-align: justify">
From the above code it can be seen that the 15.88% of customers made a purchase again, whereas the rest of the customers didn't. We 'll proceed by undersampling the majority class.
</div>

In [5]:
x = df_dropped.drop(columns='Targets')  # Create features
y = df_dropped['Targets']  # Create targets

# Create an instance of RandomUnderSampler class:
under_sampler = RandomUnderSampler(random_state=42)

# Undersample the separated data:
x_undersampled, y_undersampled = under_sampler.fit_resample(x, y)

# Convert to DataFrame:
df_undersampled = pd.DataFrame(x_undersampled, columns=x.columns)
df_undersampled['Targets'] = y_undersampled

# Verify the undersampling:
df_undersampled['Targets'].value_counts()

Targets
0    2237
1    2237
Name: count, dtype: int64

In [6]:
df_final = df_undersampled.reset_index(drop=True)

# Train, Validation and Test Splits with Sklearn

In [7]:
# Create the features and the targets from the previous DataFrame:
X = df_final.drop(columns='Targets')
y = df_final['Targets']

# Assign size percentages to variables to automate processes and avoid mistakes:
test_perc = 0.09
mask_perc = 1 - test_perc
val_perc = test_perc / mask_perc

# Split into training+validation (mask set) and test sets:
X_mask, X_test, y_mask, y_test = train_test_split(
    X, 
    y, 
    test_size=test_perc, 
    stratify=y,  # Ensure the new set is balanced
    random_state=42)

# Split the training+validation (mask) set into training and validation sets:
X_train, X_val, y_train, y_val = train_test_split(
    X_mask, 
    y_mask, 
    test_size=val_perc, 
    stratify=y_mask,  # Ensure the new set is balanced
    random_state=42)

In [8]:
# Verify that y_train, y_val and y_test are balanced:
print(y_train.value_counts())
print(y_val.value_counts())
print(y_test.value_counts())

Targets
1    1834
0    1834
Name: count, dtype: int64
Targets
1    202
0    201
Name: count, dtype: int64
Targets
0    202
1    201
Name: count, dtype: int64


## Scale the Data

In [9]:
# Create an instance of StandardScaler class:
scaler = StandardScaler()

# Scale the training data:
X_train_scaled = scaler.fit_transform(X_train)

# Use the same scaler to transform the validation and test sets:
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Data Preprocessing Using Tensorflow

<div style="text-align: justify">
We could shuffle using Pandas' `.sample` method. However, it's time to convert the DataFrame to TensorFlow tensors because this is the most robust process, especially for very large datasets. Then, we'll shuffle in TensorFlow.
</div>

In [10]:
# Convert back to tensors
X_train_tensor = tf.convert_to_tensor(X_train_scaled, dtype=tf.float32)
y_train_tensor = tf.convert_to_tensor(y_train, dtype=tf.float32)
X_val_tensor = tf.convert_to_tensor(X_val_scaled, dtype=tf.float32)
y_val_tensor = tf.convert_to_tensor(y_val, dtype=tf.float32)
X_test_tensor = tf.convert_to_tensor(X_test_scaled, dtype=tf.float32)
y_test_tensor = tf.convert_to_tensor(y_test, dtype=tf.float32)

In [11]:
# Verify that y_train, y_val and y_test are balanced:
print(np.unique(y_train_tensor, return_counts=True))
print(np.unique(y_val_tensor, return_counts=True))
print(np.unique(y_test_tensor, return_counts=True))

(array([0., 1.], dtype=float32), array([1834, 1834], dtype=int64))
(array([0., 1.], dtype=float32), array([201, 202], dtype=int64))
(array([0., 1.], dtype=float32), array([202, 201], dtype=int64))


In [12]:
# Call a function to create the tensorflow datasets with a specific batch size:
train_set, validation_set, test_set = create_datasets(
    x_train_tens=X_train_tensor,  # Tensor of training features
    y_train_tens=y_train_tensor,  # Tensor of training labels
    x_val_tens=X_val_tensor,  # Tensor of validation features
    y_val_tens=y_val_tensor,  # Tensor of validation labels
    x_test_tens=X_test_tensor,  # Tensor of test features
    y_test_tens=y_test_tensor,  # Tensor of test labels
    buffer_size=len(X_train),  # Buffer size for shuffling, set to the length of the training data
    batch_size=100  # Number of samples per batch
)

# Baseline Model (with Instructor's Values)

<div style="text-align: justify">
I have named this model a baseline model, even though the parameter values are finely tuned using the same values as the instructor's neural network model. Our goal is to compare my neural network with the instructor's using identical parameters. I also commented out this line of code inside the 'audiobooks_scripts.py' file: 'restore_best_weights=True' because the instructor's early stopping callback is much simpler.  
</div>

<div style="text-align: justify">
I believe my hands-on approach surpasses the instructor's in terms of code readability and comprehension. Furthermore, I have automated the process more efficiently by passing almost all model parameters, except for the batch size, into a single function (see below). Additionally, my train, validation, and test split works properly by changing only one parameter: the test percentage. Finally, I feed the model with 3 batched and prefetched sets instead of 6, which enhances comprehension.
</div>

In [13]:
baseline_df = create_model_train_eval_present_results(
    batch_size=100,  # The batch size we used to batched the data in the create_datassets function
    optimizer='adam',  # Optimization technique (see function dockstring for the options)
    learn_rate=0.001,  # Choosing the default Learning rate for ADAM optimizer
    mom=None,  # Momentum parameter for SGD optimizer (ignored if not using 'sgd')
    n_range=30,  # Number of training and evaluation cycles to run using the same model
    input_size=(X_train.shape[1],),  # Shape of the input features (number of features)
    hidden_layer_sizes=[50, 50],  # List of sizes for hidden layers (two hidden layers with 50 neurons each)
    activation_fun='relu',  # Activation function for the hidden layers (see function dockstring for the options)
    output_size=len(y_train.unique()),  # Number of output units
    activation_fun_output='softmax',  # Activation function for the output layer (see function dockstring for the options)
    loss_fun='sparse_categorical_crossentropy',  # Loss function for training (see function dockstring for the options)
    train_set=train_set,  # Training dataset
    patience=2,  # Number of epochs with no improvements on validation loss
    epochs=100,  # Number of epochs to train the model
    validation_set=validation_set,  # Validation dataset
    test_set=test_set,  # Number of epochs with no improvement to stop training
    verb=0  # Verbosity mode (0 = silent, 1 = progress bar, 2 = one line per epoch)
)

baseline_df

Unnamed: 0,Value
Batch Size,100
Number of Runs,30
Optimization Technique,adam
Loss Function,sparse_categorical_crossentropy
Learning Rate,0.001
Momentum,
Patience,2
Hidden Layers Act. Function,relu
Output Activation Function,softmax
Epochs,100


# VS

These are instructor's code results using exactly the same parameters and performing 30 runs using the same model:  
***Average Test Accuracy***: 0.8089  
***Standard Deviation Test Accuracy***: 0.0162  
***Average Test Loss***: 0.3466  
***Standard Deviation Test Loss***: 0.0193

<div style="text-align: justify">
The results are very close to each other. My approach demonstrates better consistency, with a good reduction in test accuracy and test loss standard deviations. However, my approach produces slightly worse test accuracy and test loss.
</div>

# Best Model

It is very difficult to beat the finely tuned parameters, however we 'll give it a try in this section.

In [26]:
# Call a function to create the tensorflow datasets:
train_set_2, validation_set_2, test_set_2 = create_datasets(
    x_train_tens=X_train_tensor,  # Tensor of training features
    y_train_tens=y_train_tensor,  # Tensor of training labels
    x_val_tens=X_val_tensor,  # Tensor of validation features
    y_val_tens=y_val_tensor,  # Tensor of validation labels
    x_test_tens=X_test_tensor,  # Tensor of test features
    y_test_tens=y_test_tensor,  # Tensor of test labels
    buffer_size=len(X_train),  # Buffer size for shuffling, set to the length of the training data
    batch_size=150  # Number of samples per batch
)

In [31]:
model_2_df = create_model_train_eval_present_results(
    batch_size=150,
    optimizer='adam', 
    learn_rate=0.0003, 
    mom=None,  
    n_range=30, 
    input_size=(X_train.shape[1],),  
    hidden_layer_sizes=[100, 100],  
    activation_fun='relu',  
    output_size=len(np.unique(y_train)),  
    activation_fun_output='softmax', 
    loss_fun='sparse_categorical_crossentropy',  
    train_set=train_set_2,  
    patience=10,  
    epochs=100,  
    validation_set=validation_set_2,  
    test_set=test_set_2, 
    verb=0
)

model_2_df

Unnamed: 0,Value
Batch Size,150
Number of Runs,30
Optimization Technique,adam
Loss Function,sparse_categorical_crossentropy
Learning Rate,0.0003
Momentum,
Patience,10
Hidden Layers Act. Function,relu
Output Activation Function,softmax
Epochs,100


<div style="text-align: justify">
The results aren't very encouraging. I tried hundreds of different combinations but I didn't manage to increase the model's performance in a very significant way. However, I realized that the model presents robust performance even when the hyperparameters change to very extreme values, such as batch_size=400 or even higher.
</div>

<div style="text-align: justify">
I noticed that there is a consistent improvement in the results by keeping 'restore_best_weights=True' commented out. This might happen because, by not restoring the best weights, the model continues to learn beyond the point where validation loss stopped improving. This can sometimes allow the model to capture more complex patterns in the data and hence generalize better on unseen data.
</div>