# Preprocessing Step
Data pre-processing stage and then we feed this data to the model.

In this step we perform:
* Transformation of all words to lowercase;
* Removal of punctuations;
* Removal of stopwords;
* We apply stemming;
* Splitting between training, validation and testing;
* Tokenization;
* Padding.

## Table of Contents
* [Packages](#1)
* [Preprocessing](#2)
    * [RNN Preprocessing](#2.1)
        * [Lowercasing, Stopwords, Stemming and Punctuations](#2.1.1)
        * [Tokenization](#2.1.2)
    * [Transformer Preprocessing](#2.2)

<a name="1"></a>
## Packages
Packages that were used in the system:
* [pandas](https://pandas.pydata.org/): is the main package for data manipulation;
* [numpy](www.numpy.org): is the main package for scientific computing;
* [re](https://docs.python.org/3/library/re.html): provides regular expression matching operations similar to those found in Perl;
* [string](https://docs.python.org/pt-br/3.13/library/string.html): for common string operations;
* [nltk](https://www.nltk.org/): NLTK is a leading platform for building Python programs to work with human language data;
* [tensorflow](https://www.tensorflow.org/): framework that makes it easy to create ML models that can run in any environment;
* [scikit-learn](https://scikit-learn.org/stable/): open source machine learning library;
* [pickle](https://docs.python.org/3/library/pickle.html): implements binary protocols for serializing and de-serializing a Python object structure;
* [transformers](https://huggingface.co/docs/transformers/index): provides APIs and tools to easily download and train state-of-the-art pretrained models;
* [datasets](https://huggingface.co/docs/datasets/index): is a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks;
* [os](https://docs.python.org/3/library/os.html): built-in module, provides a portable way of using operating system dependent functionality;
* [sys](https://docs.python.org/3/library/sys.html): provides access to some variables used or maintained by the interpreter and to functions that interact strongly with the interpreter;
* [src](../src/): package with all the codes for all utility functions created for this system. Located inside the `../../src/` directory.

In [7]:
import pandas as pd
import numpy as np
import string
from nltk.corpus import stopwords
from sklearn.model_selection import StratifiedShuffleSplit
import pickle
from transformers import DistilBertTokenizer
from datasets import Dataset

import os
import sys
PROJECT_ROOT = os.path.abspath( # Getting Obtaining the absolute normalized version of the project root path
    os.path.join( # Concatenating the paths
        os.getcwd(), # Getting the path of the notebooks directory
        os.pardir, # Gettin the constant string used by the OS to refer to the parent directory
        os.pardir
    )
)
# Adding path to the list of strings that specify the search path for modules
sys.path.append(PROJECT_ROOT)
from src.preprocessing import *

> **Note**: the codes for the utility functions used in this system are in the `preprocessing.py` script within the `../../src/` directory.

<a name="2"></a>
## Preprocessing
We will create 2 models for this project. Therefore, we will do 2 different preprocessings, one for each model:
1. The first will be using an RNN with a bidirectional LSTM layer created and trained from scratch.
2. The second model will be the application of fine-tuning to a model with pre-trained transformers architecture.

Reading the dataset that will be pre-processed and plotting your first 5 examples.

In [11]:
comics_data = pd.read_csv('../../data/raw/comics_corpus.csv')
comics_data.head()

Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...,non-action
1,93339,The Mighty Valkyries (2021) #3,CHILDREN OF THE AFTERLIFE! While Kraven the Hu...,action
2,94884,The Mighty Valkyries (2021) #3 (Variant),CHILDREN OF THE AFTERLIFE! While Kraven the Hu...,action
3,93350,X-Corp (2021) #2,A SHARK IN THE WATER! After X-CORP’s shocking ...,non-action
4,94896,X-Corp (2021) #2 (Variant),A SHARK IN THE WATER! After X-CORP?s shocking ...,non-action


We can see some duplicate phrases in the `description` feature. But if we explore more closely, we can see that what doesn't make them 100% similar are the scores. Therefore, let's treat the scores to eliminate duplicate examples.

In [14]:
print(f'Some example:\n{comics_data["description"][1]}\n')
print(f'Duplicated example:\n{comics_data["description"][2]}')

Some example:
CHILDREN OF THE AFTERLIFE! While Kraven the Hunter stalks Jane Foster on Midgard and the newest Valkyrie fights for her soul on Perdita, Karnilla, the queen of Hel, works a miracle in the land of the dead! But Karnilla isn’t Hel’s only ruler—and now she’s upset the cosmic balance. There will be a price to pay…and Karnilla intends to ensure the Valkyries pay it.

Duplicated example:
CHILDREN OF THE AFTERLIFE! While Kraven the Hunter stalks Jane Foster on Midgard and the newest Valkyrie fights for her soul on Perdita, Karnilla, the queen of Hel, works a miracle in the land of the dead! But Karnilla isn?t Hel?s only ruler?and now she?s upset the cosmic balance. There will be a price to pay?and Karnilla intends to ensure the Valkyries pay it.


<a name="2.1"></a>
### RNN Preprocessing
For a sentiment classifier, we first pre-process the raw data, then tokenize our train set and extract useful features to train our model and make our predictions.

Plotting the stopwords in English and the punctuations that will be removed.

In [17]:
stopwords_en = stopwords.words('english')
punct = string.punctuation
print(f'Stopwords:\n{stopwords_en}\n\nPunctuations:\n{punct}')

Stopwords:
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so'

<a name="2.1.1"></a>
#### Lowercasing, Stopwords, Stemming and Punctuations
After initial preprocessing of the data, we only end up with words that contain all the relevant information about the text. Initial preprocesses before splitting and tokenization:
* `Lowercasing`: to reduce our vocabulary without losing valuable information, we will have to put each of our words in lowercase. Therefore, the word CHILDREN, Children and children, will be treated as being exactly the same word children.
* `Special characters`: such as mathematical symbols, currency symbols, section and paragraph signs, inline markup signs and so on. It's usually safe to delete them.
* `Stopwords and punctuation`: we remove all words that do not add significant meaning to the texts, also known as stopwords and punctuation marks. Once eliminated, the general meaning of the sentence can be inferred without any effort.
* `Stemming`: is simply transforming any word into its base stem, which we can define as the set of characters used to build the word and its derivatives. The word works for example, its stem is `work`, because, adding the letter "s", it forms the word works, adding the suffix "e", forms the word worke, and adding the suffix "ing", forms the word working.
    * After performing stemming on our corpus, the words works, worke and working will be reduced to the stem `work`. Therefore, our vocabulary will be significantly reduced by carrying out this process for each word in the corpus.

Preprocessing the data, counting the initially preprocessed and duplicate examples, and plotting the first 5 examples of the initially preprocessed dataset.

In [20]:
comics_data_pre = comics_data.copy()
comics_data_pre['description'] = comics_data_pre['description'].map(rnn_preprocess)
print(f'Number of duplicate examples: {comics_data_pre["description"].duplicated().sum()}')
comics_data_pre.head()

Number of duplicate examples: 790


Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),shadow kirisaki mountain secret histori come l...,non-action
1,93339,The Mighty Valkyries (2021) #3,children afterlif kraven hunter stalk jane fos...,action
2,94884,The Mighty Valkyries (2021) #3 (Variant),children afterlif kraven hunter stalk jane fos...,action
3,93350,X-Corp (2021) #2,shark water x corp shock debut got fenc mend h...,non-action
4,94896,X-Corp (2021) #2 (Variant),shark water x corp shock debut got fenc mend h...,non-action


Now we can remove the duplicate examples, because they are initially pre-processed and unscored.

Plotting the number of duplicates after removing duplicates and plotting the first 5 examples from the pre-processed dataset and without duplicate examples.

In [22]:
comics_data_pre = comics_data_pre.drop_duplicates('description')
print(f'Number of duplicate examples: {comics_data_pre["description"].duplicated().sum()}')
comics_data_pre.head()

Number of duplicate examples: 0


Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),shadow kirisaki mountain secret histori come l...,non-action
1,93339,The Mighty Valkyries (2021) #3,children afterlif kraven hunter stalk jane fos...,action
3,93350,X-Corp (2021) #2,shark water x corp shock debut got fenc mend h...,non-action
5,93645,Heroes Reborn: Weapon X & Final Flight (2021) #1,best world without aveng squadron suprem prote...,non-action
6,93052,Heroes Reborn (2021) #6,eon fabl daughter utopia isl known power princ...,non-action


Creating a dataset with only the features that will be used in the model, transforming the target label $y$ into binary and plotting the first 5 examples of the dataset.

In [24]:
# Setting the new dataset
comics_corpus = comics_data_pre[['description', 'y']].copy()
# Transforming the target label y into binary
comics_corpus['y'] = comics_corpus['y'].map(lambda x: 1 if x == 'action' else 0)
comics_corpus.head()

Unnamed: 0,description,y
0,shadow kirisaki mountain secret histori come l...,0
1,children afterlif kraven hunter stalk jane fos...,1
3,shark water x corp shock debut got fenc mend h...,0
5,best world without aveng squadron suprem prote...,0
6,eon fabl daughter utopia isl known power princ...,0


Dividing the dataset between training, validation and testing subsets. We use the split `stratified sampling` method to try to maintain the same proportion of labels in the division between each subset, because we have a slight imbalance of classes and we don't want this to affect our model.

Dividing the dataset and plotting the dimension of each of them.

In [26]:
# Splitting between training and the validation and testing subset
split_train = StratifiedShuffleSplit(n_splits=1, test_size=.4, random_state=42)
for train_index, subset_index in split_train.split(comics_corpus, comics_corpus['y']):
    train_corpus, subset_corpus = comics_corpus.iloc[train_index, :].copy(), comics_corpus.iloc[subset_index, :].copy()

# Splitting between validation and testing
split_test = StratifiedShuffleSplit(n_splits=1, test_size=.5, random_state=42)
for val_index, test_index in split_test.split(subset_corpus, subset_corpus['y']):
    val_corpus, test_corpus = subset_corpus.iloc[val_index, :].copy(), subset_corpus.iloc[test_index, :].copy()

print(f'Train set shape: {train_corpus.shape}\nValidation set shape: {val_corpus.shape}\nTest set shape: {test_corpus.shape}')

Train set shape: (9682, 2)
Validation set shape: (3227, 2)
Test set shape: (3228, 2)


Setting the global variables `VOCAB_SIZE` and `MAX_LEN` to tokenize the training set.

In [28]:
VOCAB_SIZE = 1000
MAX_LEN = max([len(sentence.split()) for sentence in train_corpus['description']])
print(f'Length of the largest clean sentence: {MAX_LEN}')

Length of the largest clean sentence: 166


<a name="2.1.2"></a>
#### Tokenization
In this step, we encode our training set corpus in its vector representation, that is, first we need to create a $V$ vocabulary that allows us to encode any text as a vector of integers, for example. Where our vocabulary $V$ will be a vector of unique words from our vector of texts, where we go through each word of each text and save in the vocabulary all the new words that appear in our search. Then, we map, that is, we replace each word found in the training set with its index in the vocabulary.
| Token | Index |
| :---: | :---: |
| afterlif | 1 |
| $\vdots$ | $\vdots$ |
| balanc | 615 |
| $\vdots$ | $\vdots$ |
| cosmic | 621 |
| $\vdots$ | $\vdots$ |
| work | VOCAB_SIZE |

The RNN can receive as input a vector representation of tokenized sentences and then we apply an embedding layer that will transform our tokenized representation of integers into values ​​that represent the semantics of that word numerically. This addresses one of the problems, the word order, alphabetical order, in this example, does not make much sense from a semantic point of view. For example, there is no reason 'cosmic' should be given a higher number than 'afterlif'.
> I will talk more about embeddings when creating the model.

This RNN will allow us to predict sentiment in complex sentences, which we would not be able to correctly classify using simpler methods like Naive Bayes because they miss important information.

Our representation $X$ will be a vector of integers, that is, the index of each token in our vocabulary. Once we have all the vector representations of our sentences, we need to identify the maximum vector size and pad each vector with 0 to match that size, this process is called `padding` and it ensures that all of our vectors are the same size, even if our sentences are not.

To create vocabulary, we define which words belong to the vocabulary. This means that we create a list of the words that we are going to use in our representations. One way to create this vocabulary is to look at our training set, and find the `VOCAB_SIZE` words with the most occurrence, for example, or we use already created dictionaries that tell us the `VOCAB_SIZE` words most commonly used in the language of our task.

In some tasks such as speech recognition, or question answering, we will only find and generate words from a fixed set of words, for example, a chatbot can only answer limited sets of questions. This fixed list of words is also called `closed vocabulary`. However, using a fixed set of words is not always sufficient for the task. Often, we need to deal with words that we have never seen before, which results in an `open vocabulary`. **Open vocabulary** simply means that we can find words outside of vocabulary, like the name of a new city in the training set.

If we train a neural network on a corpus of texts based on our vocabulary, when we want to make inference with the trained model, we will need to encode the text we want to infer with the same vocabulary. Otherwise it won't make sense, because the words would map to different numbers, different tokens. Qualquer palavra no corpus de treino que não esteja no vocabulário será substituída por `<UNK>`. Unknown words are also called `Out of vocabulary (OOV)`. Uma forma de lidar com palavras OOV é modelá-las com uma palavra especial, **\<UNK>**. Para fazermos isso, substituímos todas as palavras OOV por \<UNK>, o special token, `<UNK>`. The proportion of unknown words in the test set is called `OOV rate`.

At the end, we apply padding, add 0 to the end of each sequence and make them all the same length. This is called `post-padding`, because the padding tokens are at the end of the sequences.

Tensorflow and Keras offer us several ways to tokenize words. One of them is the layer I'm using in this system, [`tensorflow.keras.layers.TextVectorization()`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization).
* `TextVectorization()`: will generate the vocabulary and create vectors from the sentences. It removes punctuations, that is, it will manage the tokens, transforming the sentences into a list of integers (the indices of each token in the vocabulary) and so on. 
* `adapt()`: method from the TextVectorization() layer, which takes the data and generates a vocabulary from the words found in these sentences.

Training the tokenizer on the training set with the previously defined `VOCAB_SIZE` and `MAX_LEN` and projecting the vocabulary size.

In [30]:
sentence_vec = rnn_tokenizer(train_corpus['description'], max_tokens=VOCAB_SIZE, max_len=MAX_LEN)
print(f'Vocabulary size: {sentence_vec.vocabulary_size()}')

Vocabulary size: 1000


Applying the trained tokenizer to each subset and plotting its dimensions and the first tokenized example of the training set.

In [32]:
train_tokenized = sentence_vec(train_corpus['description'])
val_tokenized = sentence_vec(val_corpus['description'])
test_tokenized = sentence_vec(test_corpus['description'])
print(f'Tokenized and padded train sequences shape: {train_tokenized.shape}\n\nFirst training padded sequence:\n{train_tokenized[0]}\n')
print(f'Tokenized and padded validation sequences shape: {val_tokenized.shape}\nTokenized and padded test sequences shape: {test_tokenized.shape}')

Tokenized and padded train sequences shape: (9682, 166)

First training padded sequence:
[202 101  80  45 332 101  80 804 323 767   1   1 815 649 499 795 303  92
 409 624   3   1   1   1   1 413   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0]

Tokenized and padded validation sequences shape: (3227, 166)
Tokenized and padded test sequences shape: (3228, 166)


Loading the trained tokenization model into the `../../models/` directory for later use. We save the hyperparameters that were used in training and the generated vocabulary.

In [34]:
pickle.dump(
    {'config': sentence_vec.get_config(), 'vocabulary': sentence_vec.get_vocabulary()},
    open('../../models/vectorizer.pkl', 'wb')
)

Transforming the $y$ labels into a column vector, concatenating each tokenized corpus with the corresponding $y$ labels and plotting the dimension of each subset.

In [44]:
# Transforming the y labels into a column vector
labels_train = train_corpus[['y']].copy()
labels_val = val_corpus[['y']].copy()
labels_test = test_corpus[['y']].copy()

# Concatenating the corpus of each subset and the corresponding labels
train_tokens = np.concatenate([train_tokenized, labels_train], axis=1)
val_tokens = np.concatenate([val_tokenized, labels_val], axis=1)
test_tokens = np.concatenate([test_tokenized, labels_test], axis=1)
print(f'Preprocessed train set shape: {train_tokens.shape}\nPreprocessed validation set shape: {val_tokens.shape}\nPreprocessed test set shape: {test_tokens.shape}')

Preprocessed train set shape: (9682, 167)
Preprocessed validation set shape: (3227, 167)
Preprocessed test set shape: (3228, 167)


Loading each preprocessed dataset into the `../../data/preprocessed/` directory.

In [47]:
# Loading the dataset with initial pre-processing to disk
comics_corpus.to_csv('../../data/preprocessed/comics_corpus.csv', index=False)

# Loading tokenized datasets to disk
np.save('../../data/preprocessed/train_tokens.npy', train_tokens)
np.save('../../data/preprocessed/validation_tokens.npy', val_tokens)
np.save('../../data/preprocessed/test_tokens.npy', test_tokens)

<a name="2.2"></a>
### Transformer Preprocessing
For transformers preprocessing, we will use a pre-trained tokenizer from the `DistilBERT` checkpoint to apply tokenization and padding.

DistilBERT is a small, fast, cheap and lightweight Transformers model trained by distilling the BERT (Bidirectional Encoder Representation from Transformers) base model. It has 40% fewer parameters than bert-base-uncased, runs 60% faster, and preserves more than 95% of Bert's performance as measured in the GLUE (General Language Understanding Evaluation) benchmark.

[Hugging Face](https://huggingface.co/) (🤗) is the best resource for pre-trained transformers. Its open-source libraries make it simple to download, fine-tune, and use transformer models like DeepSeek, BERT, Llama, T5, Qwen, GPT-2, and more. And the best part, you can use them together with TensorFlow, PyTorch or Flax. In this system, I use 🤗 transformers to use the `DistilBERT` model for sentiment classification. For the pre-processing step, we used the pre-trained DistilBERT tokenizer `distilbert-base-uncased-finetuned-sst-2-english`, for this we initialized the DistilBertTokenizer class and defined the desired pre-trained model.
> In the fine-tuning of the model, in the notebook `05_transformers_finetuning.ipynb`, I will talk in more detail about transfer learning and fine-tuning.

Tokenizing, padding the corpus and returning the tokenized corpus as pytorch tensors using the pre-trained `DistilBERT` tokenizer. Defining the tokenizer and plotting the vector representations of the corpus.

In [53]:
# Setting the pre-trained model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
# Setting the tokenizer
comics_transformers = tokenizer(
    comics_data['description'].tolist(),
    return_tensors='pt',
    padding='max_length',
    truncation=True
)
# Accessing the vector representations of the corpus
transformers_tokens = comics_transformers['input_ids']
transformers_attention = comics_transformers['attention_mask']
transformers_tokens

tensor([[  101,  1999,  1996,  ...,     0,     0,     0],
        [  101,  2336,  1997,  ...,     0,     0,     0],
        [  101,  2336,  1997,  ...,     0,     0,     0],
        ...,
        [  101,  1037,  3181,  ...,     0,     0,     0],
        [  101,  1996,  2028,  ...,     0,     0,     0],
        [  101, 16228,  2023,  ...,     0,     0,     0]])

Selecting the labels from the raw dataset and dividing the tokenized tensor between the training, validation and testing subsets. We use the `stratified sampling` split method to try to maintain the same proportion of labels in the division between each subset.

Dividing the dataset and plotting the size of each of them.

In [56]:
# Selecting labels from the raw dataset
labels = (comics_data['y']
          .map(lambda x: 1 if x == 'action' else 0)
          .to_numpy()
          .reshape(-1, 1))

# Splitting between training and the validation and testing subset
train_idx, subset_idx = next(split_train.split(transformers_tokens, labels))
# Splitting between validation and testing
val_idx, test_idx = next(split_test.split(transformers_tokens[subset_idx], labels[subset_idx]))
print(f'Train subset size: {len(train_idx)}\nValidation subset size: {len(val_idx)}\nTest subset size: {len(test_idx)}')

Train subset size: 10156
Validation subset size: 3385
Test subset size: 3386


Transforming the tokenized pytorch tensor subsets divided by the `stratified sampling split` to the `Dataset` type in dictionary format, to perform fine-tuning of the transformers and plotting each dataset.

In [63]:
train_dataset = tensors_to_dataset(transformers_tokens, transformers_attention, labels, train_idx)
val_dataset = tensors_to_dataset(transformers_tokens, transformers_attention, labels, val_idx)
test_dataset = tensors_to_dataset(transformers_tokens, transformers_attention, labels, test_idx)
print(f'Train dataset: {train_dataset}\nValidation dataset: {val_dataset}\nTest dataset: {test_dataset}')

Train dataset: Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 10156
})
Validation dataset: Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 3385
})
Test dataset: Dataset({
    features: ['input_ids', 'attention_mask', 'labels'],
    num_rows: 3386
})


Loading each pre-processed dataset and its metadata into its specific directory within the `../../data/preprocessed/` directory.

In [65]:
train_dataset.save_to_disk('../../data/preprocessed/train_dataset')
val_dataset.save_to_disk('../../data/preprocessed/validation_dataset')
test_dataset.save_to_disk('../../data/preprocessed/test_dataset')

Saving the dataset (0/1 shards):   0%|          | 0/10156 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3385 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/3386 [00:00<?, ? examples/s]