# Preprocessing Step (Etapa de Pré-Processamento)

## Table of Contents
* [Packages](#1)
* [Preprocessing](#2)
    * [Global Variables](#2.1)
    * [RNN Preprocessing](#2.2)
    * [Transformer Preprocessing](#2.3)
* [Data Load](#3)

<a name="1"></a>
## Packages (Pacotes)
**[EN-US]**

Packages used in the system.
* [pandas](https://pandas.pydata.org/): is the main package for data manipulation;
* [numpy](www.numpy.org): is the main package for scientific computing;
* [scikit-learn](https://scikit-learn.org/stable/): open source machine learning library;
* [matplotlib](http://matplotlib.org): is a library to plot graphs;
* [seaborn](https://seaborn.pydata.org/): data visualization library based on matplotlib;
* [plotly](https://plotly.com/python/): makes interactive, publication-quality graphs.

**[PT-BR]**

Pacotes utilizados no sistema.
* [pandas](https://pandas.pydata.org/): é o principal pacote para manipulação de dados;
* [numpy](www.numpy.org): é o principal pacote para computação científica;
* [random](https://docs.python.org/pt-br/3/library/random.html): gerador de números pseudoaleatórios para várias distribuições;
* [scikit-learn](https://scikit-learn.org/stable/): biblioteca open-source de machine learning;
* [matplotlib](http://matplotlib.org): é uma biblioteca para plotar gráficos;
* [seaborn](https://seaborn.pydata.org/): biblioteca de visualização de dados baseada em matplotlib;
* [plotly](https://plotly.com/python/): cria gráficos interativos com qualidade de publicação.

In [2]:
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import tensorflow as tf
from tensorflow.data import Dataset, AUTOTUNE
from tensorflow.keras.layers import TextVectorization
from tensorflow.keras.utils import pad_sequences

import os
import sys
PROJECT_ROOT = os.path.abspath( # Getting Obtaining the absolute normalized version of the project root path (Obtendo a versão absoluta normalizada do path raíz do projeto)
    os.path.join( # Concatenating the paths (Concatenando os paths)
        os.getcwd(), # # Getting the path of the notebooks directory (Obtendo o path do diretório dos notebooks)
        os.pardir # Gettin the constant string used by the OS to refer to the parent directory (Obtendo a string constante usada pelo OS para fazer referência ao diretório pai)
    )
)
# Adding path to the list of strings that specify the search path for modules
# Adicionando o path à lista de strings que especifica o path de pesquisa para os módulos
sys.path.append(PROJECT_ROOT)
from src.preprocessing import *

<a name="2"></a>
## Preprocessing (Pré-processamento)

In [4]:
comics_data = pd.read_csv('../data/raw/comics_corpus.csv')
comics_data.head()

Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...,non-action
1,93339,The Mighty Valkyries (2021) #3,CHILDREN OF THE AFTERLIFE! While Kraven the Hu...,action
2,94884,The Mighty Valkyries (2021) #3 (Variant),CHILDREN OF THE AFTERLIFE! While Kraven the Hu...,action
3,93350,X-Corp (2021) #2,A SHARK IN THE WATER! After X-CORP’s shocking ...,non-action
4,94896,X-Corp (2021) #2 (Variant),A SHARK IN THE WATER! After X-CORP?s shocking ...,non-action


<a name="2.1"></a>
### Global Variables (Variáveis Globais)

In [6]:
MAX_TOKENS = None
SHUFFLE_BUFFER_SIZE = 1000
BATCH_SIZE = 128

<a name="2.2"></a>
### RNN Preprocessing (Pré-processamento RNN)

In [8]:
stopwords_en = stopwords.words('english')
print(f'Stopwords: {stopwords_en}')

Stopwords: ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so'

In [9]:
comics_data['description'] = comics_data['description'].map(rnn_preprocess)
comics_data.head()

Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),shadow kirisaki mountain secret histori come l...,non-action
1,93339,The Mighty Valkyries (2021) #3,children afterlif kraven hunter stalk jane fos...,action
2,94884,The Mighty Valkyries (2021) #3 (Variant),children afterlif kraven hunter stalk jane fos...,action
3,93350,X-Corp (2021) #2,shark water x corp shock debut got fenc mend h...,non-action
4,94896,X-Corp (2021) #2 (Variant),shark water x corp shock debut got fenc mend h...,non-action


In [18]:
print(f'Number of duplicate examples: {comics_data["description"].duplicated().sum()}')

Number of duplicate examples: 790


In [34]:
comics_data = comics_data.drop_duplicates('description')
comics_data.head()

Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),shadow kirisaki mountain secret histori come l...,non-action
1,93339,The Mighty Valkyries (2021) #3,children afterlif kraven hunter stalk jane fos...,action
3,93350,X-Corp (2021) #2,shark water x corp shock debut got fenc mend h...,non-action
5,93645,Heroes Reborn: Weapon X & Final Flight (2021) #1,best world without aveng squadron suprem prote...,non-action
6,93052,Heroes Reborn (2021) #6,eon fabl daughter utopia isl known power princ...,non-action


In [22]:
print(f'Number of duplicate examples: {comics_data["description"].duplicated().sum()}')

Number of duplicate examples: 0


In [36]:
comics_corpus = comics_data[['description', 'y']].copy()
comics_corpus['y'] = comics_corpus['y'].map(lambda x: 1 if x == 'action' else 0)
comics_corpus.head()

Unnamed: 0,description,y
0,shadow kirisaki mountain secret histori come l...,0
1,children afterlif kraven hunter stalk jane fos...,1
3,shark water x corp shock debut got fenc mend h...,0
5,best world without aveng squadron suprem prote...,0
6,eon fabl daughter utopia isl known power princ...,0


In [38]:
sentence_vec, vocab = rnn_vectorization(comics_corpus['description'], max_tokens=MAX_TOKENS)
len(vocab)

15884

In [65]:
X_vec = sentence_vec(comics_corpus['description'])
print(f'Toknezined corpus shape: {X_vec.shape}')

Toknezined corpus shape: (16137, None)


In [42]:
MAX_LEN = max([len(text) for text in X_vec])
MAX_LEN

166

In [57]:
X_pad = rnn_padding(X_vec, maxlen=MAX_LEN)
X_pad[0]

array([  452,  9411,  1210,    27,   241,    18,   617,   375,    84,
         469,  5131,  5235,  6534,    42,  3773, 10503,  1473,    46,
        1868,    73,   437,   558,   483,  7397,  9289,  2825,    29,
         171,    64,   441,   603,  3773,   586,   103,  9079,  9411,
        1210,     9,   186,  1902,   153,   313,   174,     7,   348,
          83,    87,   323,  1222,  3540,  5235,  6534,   276,    57,
         720,   375,    84,   469,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,     0,     0,     0,     0,     0,     0,     0,
           0,     0,

In [50]:
labels = comics_corpus['y'].to_numpy().reshape(-1, 1)

comics_tokens = np.concatenate([X_pad, labels], axis=1)
comics_tokens.shape

(16137, 167)

<a name="2.3"></a>
### Transformer Preprocessing (Pré-processamento RNN)

<a name="3"></a>
### Data Load (Carregamento dos Dados)
Loading each preprocessed dataset into the `../data/preprocessed/` directory (Carregando cada dataset pré-processado no diretório `../data/preprocessed/`).

In [54]:
comics_corpus.to_csv('../data/preprocessed/comics_corpus.csv', index=False)
np.save('../data/preprocessed/comics_tokens.npy', comics_tokens)