**Table of contents**<a id='toc0_'></a>    
- [Data Preprocessing](#toc1_)    
- [Modelling](#toc2_)    
- [Evaluate & Export Model](#toc3_)    
- [Metrics](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Data Preprocessing](#toc0_)

In [1]:
# Constants & Hyperparameters to define
VERBOSE = True
RANDOM_SEED = 42
EPOCHS = 10
BATCH_SIZE = 32
NUM_WORDS = 5000
MAX_SEQ_LEN = 50
EMBEDDING_DIM = 50
NUM_FILTERS = 64
KERNEL_SIZE = 5
NUM_CLASSES = 2
SAMPLE_FRAC = 0.0001

# Import Libraries
import pandas as pd
from tqdm.notebook import tqdm

# Import Functions
import sys
sys.path.append('../src')
from preprocessing import DataInspection, HandleMissingValues, RemoveDuplicates, TranslateText, HandleOutliers, SplitDataset

# Initialize Preprocessing Steps
data_inspection = DataInspection()
handle_missing_values = HandleMissingValues()
remove_duplicates = RemoveDuplicates()
translate_text = TranslateText()
handle_outliers = HandleOutliers()
split_dataset = SplitDataset()


data_inspection.set_next(handle_missing_values)
handle_missing_values.set_next(remove_duplicates)
remove_duplicates.set_next(translate_text)
translate_text.set_next(handle_outliers)
handle_outliers.set_next(split_dataset)

# Chain Preprocessing Steps
data_inspection.set_next(handle_missing_values)
handle_missing_values.set_next(remove_duplicates)
remove_duplicates.set_next(translate_text)
translate_text.set_next(handle_outliers)
handle_outliers.set_next(split_dataset)

# Load Data
raw_data = pd.read_csv('../data/raw_data.csv')
print(f'raw_data.shape: {raw_data.shape}')

######################################################################
# Sample the data
raw_data = raw_data.sample(frac=SAMPLE_FRAC, 
    random_state=RANDOM_SEED, 
    ignore_index=True)

print(f'raw_data.shape: {raw_data.shape}')
######################################################################

# Execute the pipeline
X_train, X_test, y_train, y_test = data_inspection.process(raw_data, VERBOSE)

raw_data.shape: (4000000, 3)
raw_data.shape: (400, 3)

**View Data Structure**

HEAD


Unnamed: 0,labels,review_title,text
0,__label__1,"Deeply disappointing, faulty morality & social...","The book delves very little into art, aside fr..."
1,__label__2,insight into the philosophy of libertarian soc...,"In ""The Limits of State Action"" Enlightenment ..."
2,__label__2,a great book,"""In vain did the Bedouins strive to cut down a..."
3,__label__1,toys for great sex,"wow, that was bad, I threw it away after watch..."
4,__label__2,i love this movie!!!!,i just finished reading Someone Like You and i...



SAMPLE


Unnamed: 0,labels,review_title,text
375,__label__2,five stars,i have been reading paranormal books for years...
152,__label__2,"For Not Honoring The Price, Take Action!","I also ordered the Toshiba REGZA 47HL167, and ..."
339,__label__2,Fragments from a poisoned mind,This film was an unexpected surprise! I expect...
271,__label__1,Don't Waste Your Money,If you are expecting some guidance on the art ...
323,__label__1,ALL I CAN SAY IS P.U,THIS SHOW SHOULD NOT EVEN HAVE STARGATE IN TIT...



TAIL


Unnamed: 0,labels,review_title,text
395,__label__2,How did this get by me?,Friends of mine and I have been bellyaching ab...
396,__label__1,BAD TERRIBLE THE WORST SPANISH HORROR MOVIE,This is not scary. It is just plain terrible. ...
397,__label__1,All I can taste is rancid nuts,"No refined sugar, no gluten, dairy or soy, no ..."
398,__label__2,Motive Product Power Bleeder Review,The instructions were good. Very easy to use a...
399,__label__2,Truly A Caldecott Classic,Owl Moon by Jane Yolen in my opinion is one of...



**Info Summary**

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   labels        400 non-null    object
 1   review_title  397 non-null    object
 2   text          400 non-null    object
dtypes: object(3)
memory usage: 250.3 KB

**Summary Statistics**


Unnamed: 0,labels,review_title,text
count,400,397,400
unique,2,393,400
top,__label__2,Disappointed,"The book delves very little into art, aside fr..."
freq,213,3,1



**Handle Missing Values**

review_title    3
labels          0
text            0
dtype: int64

Actual Missing Values

labels          0
review_title    0
text            0
dtype: int64

**Remove Duplicates**

There are no duplicated values.
ACTUAL SHAPE: (400, 3)

**Translate Text**


Translating Text: 100%|██████████| 400/400 [00:00<00:00, 473.24it/s]



**Handle Outliers**

Number of detected outliers: 0
SHAPE BEFORE: (400, 4)
ACTUAL SHAPE: (400, 4)



2024-07-18 12:44:17.535338: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-18 12:44:17.548292: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-18 12:44:17.567942: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-18 12:44:17.567980: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-18 12:44:17.582312: I tensorflow/core/platform/cpu_feature_gua


LABELS:
1.0

X_train shape: (320, 50)
X_test shape: (80, 50)
y_train shape: (320,)
y_test shape: (80,)


# <a id='toc2_'></a>[Modelling](#toc0_)

**Sequential convolutional neural network (CNN) for text classification**

1. Embedding Layer
* `Embedding(input_dim=5000, output_dim=100, input_length=100)`
* `input_dim=5000`: This specifies the vocabulary size, meaning the model can handle up to **5000** unique words.
* `output_dim=100`: This defines the dimensionality of the embedding vector, which compresses each word into a **100**-dimensional vector.
* `input_length=100`: This sets the maximum length of the input text sequences (sentences or paragraphs) to **100** words.

2. Convolutional Layer
* `Conv1D(filters=64, kernel_size=5, activation='relu')`: This 1D convolutional layer extracts features from the embedded text sequences.
* `filters=64`: This indicates the number of filters used to identify patterns in the text.
* `kernel_size=5`: This defines the size of the window that the filter slides over the text sequence (**5** words in this case).
* `activation='relu'`: This activation function introduces non-linearity, allowing the model to learn complex relationships between words.

    * `'relu'` means Rectified Linear Unit (ReLU). 
    * For any input value $(x)$, it outputs the value itself if it's positive $(x > 0)$ and zero otherwise $(x <= 0)$. 
    * Mathematically, it can be represented as:
    * $f(x) = max(0, x)$

3. Pooling Layer
* `MaxPooling1D(pool_size=4)`: This layer reduces the dimensionality of the data by taking the maximum value from every window of size **4** along the sequence This helps control overfitting and focuses on the most important features.

4. Flattening Layer
* `Flatten()`: This layer transforms the 2D output from the convolutional layer into a 1D vector suitable for feeding into the fully connected layers.

5. Fully Connected Layers
* `Dense(10, activation='relu')`: This first fully connected layer has **10** neurons and uses the ReLU activation function. It learns higher-level features by combining the extracted features from the convolutional layers.

* `Dense(3, activation='softmax')`: This final fully connected layer has 3 neurons and uses the softmax activation function. It outputs a probability distribution over 3 categories, making it suitable for multi-class classification tasks (e.g., classifying text into 3 different genres).

    * `'softmax'`: For each element $(i)$ in the input vector, softmax calculates the probability $(p_i)$ using the following formula:
    * $p_i = exp(x_i) / Σ(exp(x_j))$  for all $j$ in the vector
    * Here, $exp(x_i)$ represents the exponentiation of the i-th element in the input vector.
    * $Σ(exp(x_j))$ represents the sum of the exponentials of all elements in the vector.

6. Compiling the Model:
* `model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])`: This compiles the model by specifying the optimizer (Adaptive Moment Estimation (Adam) for efficient training), the loss function (sparse categorical crossentropy for multi-class classification), and the metrics (accuracy to measure performance).

In [2]:
# Import Libraries
import visualkeras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, MaxPooling1D, Flatten, Dense

# Disable XLA JIT Compilation
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

########################################################################################################################

padded_sequences = tf.convert_to_tensor(X_train, dtype=tf.float32)
labels = tf.convert_to_tensor(y_train, dtype=tf.float32)

print('\n')
print(type(padded_sequences))
print(padded_sequences.dtype)
print(padded_sequences[0])
print('\n')
print(type(labels))
print(labels.dtype)
print(labels[0])
print('\n')

# Define the CNN model
model = Sequential([
    Embedding(
        input_dim=NUM_WORDS, 
        output_dim=EMBEDDING_DIM, 
        input_length=MAX_SEQ_LEN),
    Conv1D(
        filters=NUM_FILTERS, 
        kernel_size=KERNEL_SIZE, 
        activation='relu', 
        padding='same'),
    MaxPooling1D(
        pool_size=4, 
        padding='same'),
    Flatten(),
    Dense(
        10, 
        activation='relu'),
    Dense(
        1, 
        activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', 
    loss='binary_crossentropy',
    metrics=['accuracy', 'precision', 'recall'])

# Fit the model    
model.fit(x=padded_sequences, 
    y=labels, 
    epochs=10, 
    batch_size=32, 
    validation_split=0.2)

# Visualize the CNN scheme
visualkeras.layered_view(
    model, 
    legend=True, 
    draw_volume=True, 
    scale_xy=2, 
    scale_z=2, 
    max_z=1000,
    to_file='output.png'
    ).show()



<class 'tensorflow.python.framework.ops.EagerTensor'>
<dtype: 'float32'>
tf.Tensor(
[   0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.    0.
    0.    0.    0.    0.    0.    0.    0.    0.  797.   13.   14.  104.
    4. 2401.    7.  291.   81.  396.    7.  336. 2402.  150. 1484.    8.
    7.  255.    5.  114.  151. 1485.    7.  141.  565.  241. 1050. 1479.
   67.  205.], shape=(50,), dtype=float32)


<class 'tensorflow.python.framework.ops.EagerTensor'>
<dtype: 'float32'>
tf.Tensor(1.0, shape=(), dtype=float32)


Epoch 1/10


2024-07-18 12:44:19.077628: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:282] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2024-07-18 12:44:19.077661: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:134] retrieving CUDA diagnostic information for host: heroines
2024-07-18 12:44:19.077666: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:141] hostname: heroines
2024-07-18 12:44:19.077963: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:165] libcuda reported version is: 550.90.7
2024-07-18 12:44:19.078020: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:169] kernel reported version is: 550.90.7
2024-07-18 12:44:19.078026: I external/local_xla/xla/stream_executor/cuda/cuda_diagnostics.cc:248] kernel version seems to match DSO: 550.90.7


[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 29ms/step - accuracy: 0.5100 - loss: 0.6927 - precision: 0.5595 - recall: 0.0722 - val_accuracy: 0.5312 - val_loss: 0.6928 - val_precision: 1.0000 - val_recall: 0.0323
Epoch 2/10
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - accuracy: 0.5671 - loss: 0.6796 - precision: 1.0000 - recall: 0.1764 - val_accuracy: 0.5156 - val_loss: 0.6921 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00
Epoch 3/10
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step - accuracy: 0.6028 - loss: 0.6489 - precision: 1.0000 - recall: 0.1523 - val_accuracy: 0.5156 - val_loss: 0.6937 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00
Epoch 4/10
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - accuracy: 0.7035 - loss: 0.6139 - precision: 1.0000 - recall: 0.3970 - val_accuracy: 0.5156 - val_loss: 0.6924 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00
Epoch 5/10
[1m8/8[0m [3

# <a id='toc3_'></a>[Evaluate & Export Model](#toc0_)

$F1-score=2×(Precision+Recall)/(Precision×Recall)​$

In [3]:
model_loss, model_accuracy, model_precision, model_recall = model.evaluate(X_test, y_test)

# Calculate the F1-score
if (model_precision + model_recall) == 0:
    model_f1_score = 0
else:
    model_f1_score = 2 * (model_precision * model_recall) / (model_precision + model_recall)

print(f'Accuracy:\t {model_accuracy:.4f}')
print(f'Precision:\t {model_precision:.4f}')
print(f'Recall:\t\t {model_recall:.4f}')
print(f'F1-score:\t {model_f1_score:.4f}')

# Export model
model.export('../model')

'''
# Later, in a different process / environment...
reloaded_artifact = tf.saved_model.load("path/to/location")
predictions = reloaded_artifact.serve(input_data)
''';

[1m3/3[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - accuracy: 0.6445 - loss: 0.6487 - precision: 0.5905 - recall: 0.4757  
Accuracy:	 0.6250
Precision:	 0.5714
Recall:		 0.4706
F1-score:	 0.5161
INFO:tensorflow:Assets written to: ../model/assets


INFO:tensorflow:Assets written to: ../model/assets


Saved artifact at '../model'. The following endpoints are available:

* Endpoint 'serve'
  args_0 (POSITIONAL_ONLY): TensorSpec(shape=(None, 50), dtype=tf.float32, name='keras_tensor')
Output Type:
  TensorSpec(shape=(None, 1), dtype=tf.float32, name=None)
Captures:
  138212975704720: TensorSpec(shape=(), dtype=tf.resource, name=None)
  138212975835792: TensorSpec(shape=(), dtype=tf.resource, name=None)
  138212975460016: TensorSpec(shape=(), dtype=tf.resource, name=None)
  138212975512736: TensorSpec(shape=(), dtype=tf.resource, name=None)
  138212975566160: TensorSpec(shape=(), dtype=tf.resource, name=None)
  138212975114896: TensorSpec(shape=(), dtype=tf.resource, name=None)
  138212975168144: TensorSpec(shape=(), dtype=tf.resource, name=None)


# <a id='toc4_'></a>[Metrics](#toc0_)

**Sample 0.0001**:

$0.5750$ - Accuracy         
$0.5000$ - Precision        
$0.1176$ - Recall       
$0.1905$ - F1-score     

**Sample 0.001**:
      
$0.5938$ - Accuracy     
$0.5610$ - Precision        
$0.5806$ - Recall     
$0.5707$ - F1-score

**Sample 0.01**:

$0.6658$ - Accuracy         
$0.6552$ - Precision        
$0.6726$ - Recall       
$0.6638$ - F1-score     

**Sample 0.1**:

$0.7249$ - Accuracy         
$0.7284$ - Precision        
$0.7161$ - Recall       
$0.7222$ - F1-score    

**Sample 0.2**:

$0.7377$ - Accuracy         
$0.7338$ - Precision        
$0.7451$ - Recall       
$0.7394$ - F1-score  

**Sample 0.5**:

$0.7438$ - Accuracy         
$0.7481$ - Precision        
$0.7349$ - Recall       
$0.7415$ - F1-score  