# Preprocessing Step (Etapa de Pré-Processamento)

## Table of Contents
* [Packages](#1)
* [Preprocessing](#2)
    * [RNN Preprocessing](#2.1)
    * [Transformer Preprocessing](#2.2)

<a name="1"></a>
## Packages (Pacotes)
**[EN-US]**

Packages used in the system.
* [pandas](https://pandas.pydata.org/): is the main package for data manipulation;
* [numpy](www.numpy.org): is the main package for scientific computing;

**[PT-BR]**

Pacotes utilizados no sistema.
* [pandas](https://pandas.pydata.org/): é o principal pacote para manipulação de dados;
* [numpy](www.numpy.org): é o principal pacote para computação científica;

In [160]:
import pandas as pd
import numpy as np
import string
from sklearn.model_selection import StratifiedShuffleSplit
from transformers import DistilBertTokenizer

import os
import sys
PROJECT_ROOT = os.path.abspath( # Getting Obtaining the absolute normalized version of the project root path (Obtendo a versão absoluta normalizada do path raíz do projeto)
    os.path.join( # Concatenating the paths (Concatenando os paths)
        os.getcwd(), # # Getting the path of the notebooks directory (Obtendo o path do diretório dos notebooks)
        os.pardir # Gettin the constant string used by the OS to refer to the parent directory (Obtendo a string constante usada pelo OS para fazer referência ao diretório pai)
    )
)
# Adding path to the list of strings that specify the search path for modules
# Adicionando o path à lista de strings que especifica o path de pesquisa para os módulos
sys.path.append(PROJECT_ROOT)
from src.preprocessing import *

<a name="2"></a>
## Preprocessing (Pré-processamento)

In [91]:
comics_data = pd.read_csv('../data/raw/comics_corpus.csv')
comics_data.head()

Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...,non-action
1,93339,The Mighty Valkyries (2021) #3,CHILDREN OF THE AFTERLIFE! While Kraven the Hu...,action
2,94884,The Mighty Valkyries (2021) #3 (Variant),CHILDREN OF THE AFTERLIFE! While Kraven the Hu...,action
3,93350,X-Corp (2021) #2,A SHARK IN THE WATER! After X-CORP’s shocking ...,non-action
4,94896,X-Corp (2021) #2 (Variant),A SHARK IN THE WATER! After X-CORP?s shocking ...,non-action


In [93]:
MAX_TOKENS = None
SHUFFLE_BUFFER_SIZE = 1000
BATCH_SIZE = 128

<a name="2.1"></a>
### RNN Preprocessing (Pré-processamento RNN)

In [166]:
stopwords_en = stopwords.words('english')
punct = string.punctuation
print(f'Stopwords: {stopwords_en}\nPunctuations: {punct}')

Stopwords: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so'

In [97]:
comics_data['description'] = comics_data['description'].map(rnn_preprocess)
print(f'Number of duplicate examples: {comics_data["description"].duplicated().sum()}')
comics_data.head()

Number of duplicate examples: 790


Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),shadow kirisaki mountain secret histori come l...,non-action
1,93339,The Mighty Valkyries (2021) #3,children afterlif kraven hunter stalk jane fos...,action
2,94884,The Mighty Valkyries (2021) #3 (Variant),children afterlif kraven hunter stalk jane fos...,action
3,93350,X-Corp (2021) #2,shark water x corp shock debut got fenc mend h...,non-action
4,94896,X-Corp (2021) #2 (Variant),shark water x corp shock debut got fenc mend h...,non-action


In [99]:
comics_data = comics_data.drop_duplicates('description')
print(f'Number of duplicate examples: {comics_data["description"].duplicated().sum()}')
comics_data.head()

Number of duplicate examples: 0


Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),shadow kirisaki mountain secret histori come l...,non-action
1,93339,The Mighty Valkyries (2021) #3,children afterlif kraven hunter stalk jane fos...,action
3,93350,X-Corp (2021) #2,shark water x corp shock debut got fenc mend h...,non-action
5,93645,Heroes Reborn: Weapon X & Final Flight (2021) #1,best world without aveng squadron suprem prote...,non-action
6,93052,Heroes Reborn (2021) #6,eon fabl daughter utopia isl known power princ...,non-action


In [101]:
comics_corpus = comics_data[['description', 'y']].copy()
comics_corpus['y'] = comics_corpus['y'].map(lambda x: 1 if x == 'action' else 0)
comics_corpus.head()

Unnamed: 0,description,y
0,shadow kirisaki mountain secret histori come l...,0
1,children afterlif kraven hunter stalk jane fos...,1
3,shark water x corp shock debut got fenc mend h...,0
5,best world without aveng squadron suprem prote...,0
6,eon fabl daughter utopia isl known power princ...,0


In [103]:
sentence_vec, vocab = rnn_vectorization(comics_corpus['description'], max_tokens=MAX_TOKENS)
print(f'Vocabulary size: {len(vocab)}')

Vocabulary size: 15884


In [150]:
X_vec = sentence_vec(comics_corpus['description'])

MAX_LEN = max([len(sequence) for sequence in X_vec])
print(f'Max len: {MAX_LEN}')

X_pad = rnn_padding(X_vec, maxlen=MAX_LEN)
print(f'Tokenized and padded corpus shape: {X_pad.shape}\n\nFirst padded sequence: {X_pad[0]}')

Max len: 166
Tokenized and padded corpus shape: (16137, 166)

First padded sequence: [  452  9411  1210    27   241    18   617   375    84   469  5131  5235
  6534    42  3773 10503  1473    46  1868    73   437   558   483  7397
  9289  2825    29   171    64   441   603  3773   586   103  9079  9411
  1210     9   186  1902   153   313   174     7   348    83    87   323
  1222  3540  5235  6534   276    57   720   375    84   469     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0   

In [145]:
labels = comics_corpus['y'].to_numpy().reshape(-1, 1)

comics_tokens = np.concatenate([X_pad, labels], axis=1)
print(f'Preprocessed dataset shape: {comics_tokens.shape}')

Preprocessed dataset shape: (16137, 167)


In [154]:
split_train = StratifiedShuffleSplit(n_splits=1, test_size=.4, random_state=42)
for train_index, subset_index in split_train.split(comics_tokens, comics_tokens[:, -1]):
    train_corpus, subset_corpus = comics_tokens[train_index], comics_tokens[subset_index]


split_test = StratifiedShuffleSplit(n_splits=1, test_size=.5, random_state=42)
for valid_index, test_index in split_test.split(subset_corpus, subset_corpus[:, -1]):
    valid_corpus, test_corpus = subset_corpus[valid_index], subset_corpus[test_index]

print(f'Train set shape: {train_corpus.shape}\nValidation set shape: {valid_corpus.shape}\nTest set shape: {test_corpus.shape}')

Train set shape: (9682, 167)
Validation set shape: (3227, 167)
Test set shape: (3228, 167)


Loading each preprocessed dataset into the `../data/preprocessed/` directory (Carregando cada dataset pré-processado no diretório `../data/preprocessed/`).

In [115]:
# Dataset with initial preprocessing
comics_corpus.to_csv('../data/preprocessed/comics_corpus.csv', index=False)

# Tokenizeds datasets
np.save('../data/preprocessed/train_corpus.npy', train_corpus)
np.save('../data/preprocessed/valid_corpus.npy', valid_corpus)
np.save('../data/preprocessed/test_corpus.npy', test_corpus)

<a name="2.2"></a>
### Transformer Preprocessing (Pré-processamento RNN)

In [117]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased-finetuned-sst-2-english')
comics_transformers = tokenizer(
    comics_data['description'].tolist(),
    return_tensors='pt',
    padding=True
)
transformers_tokens = comics_transformers['input_ids']
transformers_tokens

tensor([[  101,  5192, 11382,  ...,     0,     0,     0],
        [  101,  2336,  2044,  ...,     0,     0,     0],
        [  101, 11420,  2300,  ...,     0,     0,     0],
        ...,
        [  101,  2010,  4263,  ...,     0,     0,     0],
        [  101,  2028,  2707,  ...,     0,     0,     0],
        [  101,  9530, 20464,  ...,     0,     0,     0]])

In [152]:
comics_transformers = np.concatenate(
    [transformers_tokens.numpy(), labels], axis=1
)
print(f'Preprocessed transformers dataset shape: {comics_transformers.shape}')

Preprocessed transformers dataset shape: (16137, 252)


In [156]:
train_transformers, valid_transformers, test_transformers = comics_transformers[train_index], comics_transformers[valid_index], comics_transformers[test_index]
print(f'Train transformers set shape: {train_transformers.shape}\nValidation transformers set shape: {valid_transformers.shape}\nTest transformers set shape: {test_transformers.shape}')

Train transformers set shape: (9682, 252)
Validation transformers set shape: (3227, 252)
Test transformers set shape: (3228, 252)


Loading each preprocessed dataset into the `../data/preprocessed/` directory (Carregando cada dataset pré-processado no diretório `../data/preprocessed/`).

In [123]:
np.save('../data/preprocessed/train_transformers.npy', train_transformers)
np.save('../data/preprocessed/valid_transformers.npy', valid_transformers)
np.save('../data/preprocessed/test_transformers.npy', test_transformers)