### **Convolutional Neural Network (1D) - Glove Embedding**

#### Initial operations

In [None]:
from google.colab import drive
from shutil import copyfile

!pip install iterative-stratification

In [None]:
drive.mount('/content/drive')

In [None]:
path = '/content/drive/MyDrive/FDL Project/Code/'

copyfile(path + 'text_vectorization.py', 'text_vectorization.py')
copyfile(path + 'embedding.py', 'embedding.py')
copyfile(path + 'kfold_cv.py', 'kfold_cv.py')

In [None]:
import pandas as pd
import numpy as np
import importlib
import itertools
import csv

from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

from tensorflow import keras
from sklearn.model_selection import train_test_split
from keras.utils import to_categorical
from tensorflow.keras.models import load_model
from keras.regularizers import l2

import text_vectorization
importlib.reload(text_vectorization)

import embedding
importlib.reload(embedding)

import kfold_cv
importlib.reload(kfold_cv)

#### Training and test set

In this first stage, we have:
- Read the **training and test set**;
- Calculated the **number of unique categories**, so the number of classes in the text classification;
- Converted the labels associated with the articles' to **one-hot encoding representation**, which is a deep learning best practice when we cope with multi-label text classification task.

In [None]:
train_set = pd.read_csv(path + 'data/train-set-cat1-processed.csv')
test_set = pd.read_csv(path + 'data/test-set-cat1-processed.csv')

# Number of different categories
number_of_categories = len(train_set['label'].unique())

# One-hot encoding of the labels
label_train = to_categorical(train_set['label'], num_classes = number_of_categories, dtype = 'int64')
label_test = to_categorical(test_set['label'], num_classes = number_of_categories, dtype = 'int64')

#### *Text vectorization and embedding*

Firstly, the following **parameters** are defined:
- **Size of the vocabulary** to create;
- **Number of words** considered for each text (article);
- **Dimension of the embedding**;

In [None]:
vocabulary_size = 50000
words_per_sentence = 200
embedding_dim = 100

Then, we have opted the first embedding approach: **Keras vectorization and GloVe embedding**.

- The *vectorization* (and so the creation of the *vocabulary*) is carried out using the **Keras built-in function**, with the final adaption of the text vectorizer on the training set;
- For the *embedding matrix*, we have used a pre-trained solution, named **GloVe**, with 100 dimensions;
- Finally, we have created the final **vectorized feature** for the training phase.

In [None]:
text_vectorizer_keras = text_vectorization.createTextVectorizer(vocabulary_size, words_per_sentence, train_set['text'])
vocabulary_keras = text_vectorizer_keras.get_vocabulary()

embedding_matrix_glove = embedding.buildEmbeddingMatrix(embedding_dim, vocabulary_keras)
embedding_layer_glove = embedding.createEmbeddingLayer(embedding_matrix_glove, None)
embedding_layer_glove._name = 'GloVe'

In [None]:
feature_train_glove = text_vectorization.textVectorization(train_set['text'], text_vectorizer_keras)

#### *Neural network architecture*

Here, we have defined a function which create a **Convolutional Neural Network (1D) architecture** (model) given a set of hyperparameters.

In [None]:
def buildNetwork(hyperparams_combination, words_per_sentence):

    # Input layer
    input_layer = keras.Input(shape = (words_per_sentence,), dtype = 'int64')

    # Embedding layer
    x = embedding_layer_glove(input_layer)

    # Hidden layers
    if(hyperparams_combination['regularization'] == True):

        x = keras.layers.Conv1D(
            filters = hyperparams_combination['filters'], 
            kernel_size = hyperparams_combination['kernel_size'], 
            activation = 'relu', 
            kernel_regularizer = l2(0.01))(x)
        
    else:
        
        x = keras.layers.Conv1D(filters = hyperparams_combination['filters'], kernel_size = hyperparams_combination['kernel_size'], activation = 'relu')(x)

    x = keras.layers.Conv1D(filters = hyperparams_combination['filters'], kernel_size = hyperparams_combination['kernel_size'], activation = 'relu')(x)
    x = keras.layers.GlobalMaxPooling1D()(x)
    x = keras.layers.Dropout(rate = hyperparams_combination['rate'])(x)

    # Output layer
    x = keras.layers.Dense(number_of_categories, activation = 'softmax')(x)
    output_layer = x

    return keras.Model(input_layer, output_layer, name = 'Conv1D')

#### K-Fold Cross Validation

In this part of the project, we have implemented the **K-Fold Cross Validation** as a strategy to find the **best hyperparameters** for the neural network and also to have a **performance estimation** of the model on new and unseen data. Our approach has followed these logic:

*   Firstly, we defined the **hyperparameters** and their combinations to try with the K-Fold CV. In detail, we have choosen to change the **number of filters** and the **kernel size** on the convolutional layer, and the **optimizer**;

*   Then, we defined a **number of epochs** equal to *50*, which will be an upper bound in the actual number of epochs used to train the model, due to the fact that we have used an early stopping monitoring rule: if performance does not improve for 3 straight epochs, the K-Fold cycle end and we keep the epoch number with the highest performance as hypeparameter;

*   The **number of folds K** has been set to *3* and a multi-label stratified approach has been carried out;

*   As a text classification task (categorical label), the **loss function** has been the **categorical cross entropy**, which will result in a loss value. Our goal is to **minimize** this metric, in order to improve the performance of the model, so we have used it as our performance proxy. Also, we have taken into account the **accuracy**;

*   To evaluate a single combination of hyperparameters, we have computed the **average of the performance** of the K iteration;

*   *The best combination of hyperparameters is the one which lead to the lowest loss*;

*   *We write a CSV file with all the hyperparameters combination and the related obtained performance in the K-Fold CV*.

In [None]:
filters = [ 128, 256 ]
kernel_size = [ 3, 5 ]
optimizer = [ 'adam', 'rmsprop' ]
regularization = [ True, False ]
augmentation = [ False ]
rate = [ 0.5 ]
batch_size = [ 128 ]

hyperparams = list(itertools.product(filters, kernel_size, optimizer, batch_size, rate, regularization, augmentation))
columns = [ 'filters', 'kernel_size', 'optimizer', 'batch_size', 'rate', 'regularization', 'augmentation' ]

hyperparams = [ dict(zip(columns, values)) for values in hyperparams ]

In [None]:
epochs = 50
k_fold = 3

In [None]:
kfold_results = []

# For each hyperparameters combination
for hyperparams_combination in hyperparams:

    # Building the network model
    network = buildNetwork(hyperparams_combination, words_per_sentence)

    # Print information to manage the situation during the process
    print(hyperparams_combination)
    print(network.summary())

    # Performing the K-Fold Cross Validation
    kfold_results.append(kfold_cv.kfoldCrossValidation(k_fold, feature_train_glove, label_train, network, hyperparams_combination, epochs))

In [None]:
# Get the best hyperparameters combination
best_hyperparams = { 'loss_kfold': 999 }

for result in kfold_results:

    if(result['loss_kfold'] < best_hyperparams['loss_kfold']):
        best_hyperparams = result

print(best_hyperparams)

In [None]:
# Write the K-Fold CV results to a CSV file
with open(path + 'results/kfold-conv1D.csv', mode = 'w', newline = '') as file:

    writer = csv.DictWriter(file, fieldnames = list(kfold_results[0].keys()))
    writer.writeheader()

    for row_data in kfold_results:
        writer.writerow(row_data)

#### Convolutional Neural Network (1D) - Final Architecture

Here, we have created the **neural network architecture** model with the best hyperparameters found in the K-Fold Cross Validation.

In [None]:
# Neural Network architecture with best hyperparameters combination
conv1D_network = buildNetwork(best_hyperparams)

# Compiling the network
conv1D_network.compile(

    loss = 'categorical_crossentropy',
    optimizer = best_hyperparams['optimizer'],
    metrics = ['accuracy']

)

#### Training

Training the neural network model with all the training data and save the H5 model file.

In [None]:
# Training (fit Neural Network)
training_history = conv1D_network.fit(

    x = feature_train_glove,
    y = label_train,
    batch_size = best_hyperparams['batch_size'],
    epochs = best_hyperparams['best_number_epochs']

)

In [None]:
conv1D_network.save(path + '/models/conv1D-model.h5')

#### Testing

Testing the neural network model with the test set.

*   The text in the test set has been vectorized using the Glove embedding created using the training set, keeping the consistency in the results;

*   We have evaluated the performance using the **categorical cross entropy loss**, **global accuracy** and **single class accuracy**;

In [None]:
conv1D_network = load_model(path + 'models/conv1D-model.h5')

In [None]:
feature_test_glove = text_vectorization.textVectorization(test_set['text'], text_vectorizer_keras)
score = conv1D_network.evaluate(feature_test_glove, label_test, verbose = 0)

In [None]:
# Performance metrics
test_loss = round(score[0], 3)
test_accuracy = round(score[1], 3)

In [None]:
# Write the testing performance on the global final results CSV
with open(path + 'results/final-results.csv', mode = 'a', newline = '') as file:

    writer = csv.writer(file)
    writer.writerow(['Conv1D', 'Glove', 'False', test_loss, test_accuracy])