# Ingestion Step (Etapa de Ingestão)
## Table of Contents
* [Packages](#1)
* [Environment Variables](#2)
* [Extract, Transform and Load (ETL)](#3)
    * [Data Extract](#3.1)
    * [Data Transform](#3.2)
        * [Data Labeling](#3.2.1)
            * [Zero-Shot Learning](#3.2.1.1)
    * [Data Load](#3.3)

<a name="1"></a>
## Packages (Pacotes)
**[EN-US]**

Packages used in the system.
* [os](https://docs.python.org/3/library/os.html): built-in module, provides a portable way of using operating system dependent functionality;
* [requests](https://pypi.org/project/requests/): is a simple HTTP library, for making HTTP requests;
* [haslib](https://docs.python.org/3/library/hashlib.html): implements a common interface to many different secure hash and message digest algorithms;
* [time](https://docs.python.org/3/library/time.html): provides various time-related functions;
* [dotenv](https://pypi.org/project/python-dotenv/): reads key-value pairs from a .env file and can set them as environment variables;
* [pandas](https://pandas.pydata.org/): is the main package for data manipulation;

**[PT-BR]**

Pacotes utilizados no sistema.
* [os](https://docs.python.org/3/library/os.html): módulo integrado, fornece uma maneira portátil de usar funcionalidades dependentes do sistema operacional;
* [requests](https://pypi.org/project/requests/): é uma biblioteca HTTP simples, para fazer solicitações HTTP;
* [haslib](https://docs.python.org/3/library/hashlib.html): implementa uma interface comum para muitos algoritmos diferentes de hash seguro e resumo de mensagens;
* [time](https://docs.python.org/3/library/time.html): fornece várias funções relacionadas ao tempo;
* [dotenv](https://pypi.org/project/python-dotenv/): lê pares de chave-valor de um arquivo .env e pode defini-los como variáveis de ambiente;
* [pandas](https://pandas.pydata.org/): é o principal pacote para manipulação de dados;

In [41]:
import os
import sys
from requests import get
from hashlib import md5
from time import time
from dotenv import load_dotenv
load_dotenv()

import pandas as pd
import numpy as np
from transformers import pipeline

PROJECT_ROOT = os.path.abspath( # Getting Obtaining the absolute normalized version of the project root path (Obtendo a versão absoluta normalizada do path raíz do projeto)
    os.path.join( # Concatenating the paths (Concatenando os paths)
        os.getcwd(), # # Getting the path of the notebooks directory (Obtendo o path do diretório dos notebooks)
        os.pardir # Gettin the constant string used by the OS to refer to the parent directory (Obtendo a string constante usada pelo OS para fazer referência ao diretório pai)
    )
)
# Adding path to the list of strings that specify the search path for modules
# Adicionando o path à lista de strings que especifica o path de pesquisa para os módulos
sys.path.append(PROJECT_ROOT)
from utils.ingestion import *

<a name="3"></a>
## Environment Variables (Variáveis de Ambiente)
**[EN-US]**

Setting the environment variables:
* `MARVEL_PUBLIC_KEY`: the public key for connecting and using the APIs.
* `MARVEL_PRIVATE_KEY`: the private key for connecting and using the APIs.

**[PT-BR]**

Definindo as variáveis de ambiente:
* `MARVEL_PUBLIC_KEY`: a public key para conexão e uso das APIs.
* `MARVEL_PRIVATE_KEY`: a private key para conexão e uso das APIs.

In [38]:
PUBLIC_KEY = str(os.environ['MARVEL_PUBLIC_KEY'])
PRIVATE_KEY = str(os.environ['MARVEL_PRIVATE_KEY'])

<a name="3"></a>
## Extract, Transform and Load (Extração, Transformação e Carregamento)
**[EN-US]**

ETL (Extract, Transform and Load), requesting authorization from Marvel, extracting data from Marvel characters and comics, transforming this data and loading the data to disk.

**[PT-BR]**

ETL (Extração, Transformação e Carregamento), requisitando a autorização da Marvel, extraindo os dados dos personagens e dos comics da Marvel, transformando esses dados e carregamando os dados no disco.

<a name="3.1"></a>
### Data Extract (Extração dos Dados)
**[EN-US]**

Extracting data from characters and comics and their respective descriptions and then projecting the first 5 examples from the dataset of comics `df_comics` and characters `df_characters`.

**[PT-BR]**

Extraindo os dados dos personagens e comics e, suas respectivas descrições e, em seguida, projetando os 5 primeiros exemplos do dataset dos comics `df_comics` e dos personagens `df_characters`.

Initializing the class (Inicializando a classe).

In [234]:
ingestion = MarvelIngestion(PUBLIC_KEY, PRIVATE_KEY)

Extracting the `df_comics` and `df_characters` datasets, and projecting the first 5 examples from each dataset (Extraindo os datasets `df_comics` e `df_characters`, e projetando os primeiros 5 exemplos de cada dataset).

In [71]:
df_comics = ingestion(endpoint='comics', format_='comic')
print(f'Comics dataset dimension: {df_comics.shape}')
df_comics.head()

Comics dataset dimension: (18802, 3)


Unnamed: 0,id,title,description
0,94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
1,94801,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
2,94802,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
3,94803,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
4,94804,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...


In [81]:
df_characters = ingestion(endpoint='characters', offset=0)
print(f'Characters dataset dimension: {df_characters.shape}')
df_characters.head()

Characters dataset dimension: (305, 3)


Unnamed: 0,id,name,description
0,1017100,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...
1,1009144,A.I.M.,AIM is a terrorist organization bent on destro...
2,1009146,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie..."
3,1010354,Adam Warlock,Adam Warlock is an artificially created human ...
4,1011031,Agent X (Nijo),Originally a partner of the mind-altering assa...


<a name="3.2"></a>
### Data Transform (Transformação dos Dados)
**[EN-US]**

As we can see above, the `title` and `description` features of the comics dataset have duplicate data, regardless of whether their `id` is different. So let's delete these duplicate examples.

First, let's transform the `id` feature into the index of this dataset to count duplicate examples between the `title` and `description` features.

**[PT-BR]**

Como podemos ver acima, as features `title` e `description` do dataset dos comics tem dados duplicados, independente se o seu `id` é diferente. Portanto, vamos excluir esses exemplos duplicados.

Primeiro, vamos transformar a feature `id` no índice desse dataset para contar os exemplos duplicados entre as features `title` e `description`.

In [90]:
df_comics = df_comics.set_index('id')
df_comics.head()

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
94801,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
94802,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
94803,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
94804,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...


Counting duplicate examples (Contando os exemplos duplicados).

In [93]:
print(f'Number of duplicate examples: {df_comics.duplicated().sum()}')

Number of duplicate examples: 1069


Dropping duplicate examples (Deletando os exemplos duplicados).

In [141]:
df_comics = df_comics.drop_duplicates()
print(f'Number of duplicate examples: {df_comics.duplicated().sum()}')
df_comics = df_comics.reset_index().drop(columns=['index'])
df_comics.head()

Number of duplicate examples: 0


Unnamed: 0,id,title,description
0,94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
1,93339,The Mighty Valkyries (2021) #3,CHILDREN OF THE AFTERLIFE! While Kraven the Hu...
2,94884,The Mighty Valkyries (2021) #3 (Variant),CHILDREN OF THE AFTERLIFE! While Kraven the Hu...
3,93350,X-Corp (2021) #2,A SHARK IN THE WATER! After X-CORP’s shocking ...
4,94896,X-Corp (2021) #2 (Variant),A SHARK IN THE WATER! After X-CORP?s shocking ...


Checking duplicates only for the `description` feature of each comic (Checando as duplicatas apenas para a feature `description` de cada comic).

In [184]:
print(f'Number of duplicate examples: {df_comics[["description"]].duplicated().sum()}\ndataset dimension: {df_comics.shape}')

Number of duplicate examples: 806
dataset dimension: (17733, 3)


Dropping duplicate examples (Deletando os exemplos duplicados).

In [187]:
df_comics = df_comics.drop_duplicates('description')
df_comics.shape

(16927, 3)

Checking duplicates only for the `description` of each comic after deleting duplicate examples (Checando as duplicatas apenas para a `description` de cada comic após excluir os exemplos duplicados).

In [190]:
print(f'Number of duplicate examples: {df_comics[["description"]].duplicated().sum()}')

Number of duplicate examples: 0


Performing the same validation for the `df_characters` dataset (Fazendo a mesma validação para o dataset `df_characters`).

Checking duplicates only for the `description` feature of each character (Checando as duplicatas apenas para a feature `description` de cada personagem).

Counting duplicate examples (Contando os exemplos duplicados).

In [194]:
print(f'Number of duplicate examples: {df_characters.duplicated("description").sum()}')

Number of duplicate examples: 1


The `df_characters` dataset has no duplicate examples. So we won't do anything (O dataset `df_characters` não tem exemplos duplicados. Portanto, não faremos nada).

Resetting the dataset index to load it to disk and plotting the first 5 examples (Resetando o índice do dataset para carregá-lo no disco e plotando os primeiros 5 exemplos).

In [198]:
df_characters = df_characters.drop_duplicates('description')
df_characters.head()

Unnamed: 0,id,name,description
0,1017100,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...
1,1009144,A.I.M.,AIM is a terrorist organization bent on destro...
2,1009146,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie..."
3,1010354,Adam Warlock,Adam Warlock is an artificially created human ...
4,1011031,Agent X (Nijo),Originally a partner of the mind-altering assa...


In [207]:
print(f'Number of duplicate examples: {df_characters["description"].duplicated().sum()}')

Number of duplicate examples: 0


<a name="3.2.1"></a>
#### Data Labeling (Rotulação de Dados)
**[EN-US]**

We need to label the extracted data without duplicates. Therefore, we can label example by example manually (we call this method `in-house`), or we find a company to label this data (we call this method `outsourced`), or we use a crowdsourcing platform, so that a large group collectively labels the data (we call this method `crowdsourced`), or we use the `zero-shot learning` method, which was the method used in this system.

**[PT-BR]**

Precisamos rotular os dados extraídos e sem duplicatas. Portanto, podemos rotular exemplo por exemplo manualmente (chamamos esse método de `in-house`), ou encontramos uma empresa para rotular esses dados (chamamos esse método de `outsourced`), ou usamos uma plataforma crowdsourcing, para que um grande grupo rotule coletivamente os dados (chamamos esse método de `crowdsourced`), ou usamos o método de `zero-shot learning`, que foi o método utilizado nesse sistema.

Setting the labels and selecting the texts from each of the 2 datasets (Definindo os labels e selecionando os textos de cada um dos 2 dataset).

In [45]:
labels = [
    'suspense',
    'action',
    'comedy',
    'horror',
    'neutral'
]

corpus_comics = df_comics['description'].tolist()
corpus_char = df_characters['description'].tolist()

print(f'Comic corpus size: {len(corpus_comics)}\nCharacter corpus size: {len(corpus_char)}')

Comic corpus size: 16927
Character corpus size: 304


<a name="3.2.1.1"></a>
##### Zero-Shot Learning
**[EN-US]**

Traditionally, zero-shot learning (ZSL) most often referred to a fairly specific type of task: learn a classifier on one set of labels and then evaluate on a different set of labels that the classifier has never seen before. Recently, especially in NLP, it's been used much more broadly to mean get a model to do something that it wasn't explicitly trained to do. A well-known example of this is in the GPT-2 (Generative Pretrained Transformer-2, OpenAI) paper where the authors evaluate a language model on downstream tasks like machine translation without fine-tuning on these tasks directly.

Let's say we have a sequence embedding model $\phi_{\mathrm{sent}}$, a set of possible class names $\mathrm{C}$ and the cosine similarity metric $\mathrm{cos}$. Then, we classify a given sequence $\vec{x}$ according to:
$$\hat{c} = \mathrm{arg} \max_{c\ \epsilon\ \mathrm{C}} \cos( \phi_{\mathrm{sent}}(\vec{x}) Z,\ \phi_{\mathrm{sent}}(c) Z)$$
> **Reference**: [Zero-Shot Learning in Modern NLP](https://joeddav.github.io/blog/2020/05/29/ZSL.html)

Therefore, we use the `zero-shot learning` method to label our data.

Setting the pipeline with the `zero-shot-classification` task and the `facebook/bart-large-mnli` model.

**[PT-BR]**

Tradicionalmente, o zero-shot learning (ZSL) geralmente se referia a um tipo bastante específico de tarefa, aprender um classificador em um set de labels e depois avaliar em um set diferente de labels que o classificador nunca viu antes. Recentemente, especialmente em NLP, tem sido usado de forma muito mais ampla para significar fazer com que um modelo faça algo para o qual não foi explicitamente treinado. Um exemplo bem conhecido disso está no paper do GPT-2 (Generative Pretrained Transformer-2, OpenAI), onde os autores avaliam um modelo de linguagem em tarefas posteriores, como neural machine translation (NMT), sem fine-tuning direto nessas tarefas.

Digamos que temos um modelo de embedding $\phi_{\mathrm{sent}}$, um set de possíveis classes $\mathrm{C}$ e a métrica de similaridade de cosseno $\mathrm{cos}$. Então, classificamos uma determinada sequência $\vec{x}$ de acordo com:
$$\hat{c} = \mathrm{arg} \max_{c\ \epsilon\ \mathrm{C}} \cos( \phi_{\mathrm{enviado}}(\vec{x}) Z,\ \phi_{\mathrm{enviado}}(c) Z)$$
> **Referência**: [Zero-Shot Learning in Modern NLP](https://joeddav.github.io/blog/2020/05/29/ZSL.html)

Portanto, usamos o método `zero-shot learning` para rotular os nossos dados.

Definindo a pipeline com a tarefa de `zero-shot-classification` e o modelo `facebook/bart-large-mnli`.

In [66]:
pipe_bart = pipeline(
    'zero-shot-classification',
    model='facebook/bart-large-mnli'
)

Device set to use cpu


Running the zero-shot learning task to label the data in each dataset (Executando a tarefa de zero-shot learning para rotular os dados de cada dataset).

In [51]:
output_bart_comics = pipe_bart(corpus_comics, labels)
output_bart_char = pipe_bart(corpus_char, labels)

print(f'Comic corpus label vector size: {len(output_bart_comics)}\nCharacter corpus label vector size: {len(output_bart_char)}')

Comic corpus label vector size: 16927
Character corpus label vector size: 304


**[EN-US]**

Going through the model output, selecting the id of the label with the highest score for each example, selecting the id from the list of labels, adding it to a list to add to the final dataset and plotting the first 5 examples of the dataset.

**[PT-BR]**

Percorrendo o output do modelo, selecionando o id do label com o maior score de cada exemplo, selecionando o id na lista de labels, adicionando em uma lista para adicionar ao dataset final e plotando os primeiros 5 exemplos do dataset.

In [59]:
labels_comic = []
for i in range(len(output_bart_comics)):
    idx = np.argmax(output_bart_comics[i]['scores'])
    label = output_bart_comics[i]['labels'][idx]
    labels_comic.append(label)

df_comics['y'] = labels_comic
df_comics.head()

Unnamed: 0,id,title,description,y
0,94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...,suspense
1,93339,The Mighty Valkyries (2021) #3,CHILDREN OF THE AFTERLIFE! While Kraven the Hu...,action
2,94884,The Mighty Valkyries (2021) #3 (Variant),CHILDREN OF THE AFTERLIFE! While Kraven the Hu...,action
3,93350,X-Corp (2021) #2,A SHARK IN THE WATER! After X-CORP’s shocking ...,suspense
4,94896,X-Corp (2021) #2 (Variant),A SHARK IN THE WATER! After X-CORP?s shocking ...,suspense


Repeating the process done for the comics dataset (Repetindo o processo feito para o dataset de comics).

In [61]:
labels_char = []
for i in range(len(output_bart_char)):
    idx = np.argmax(output_bart_char[i]['scores'])
    label = output_bart_char[i]['labels'][idx]
    labels_char.append(label)

df_characters['y'] = labels_char
df_characters.head()

Unnamed: 0,id,name,description,y
0,1017100,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...,action
1,1009144,A.I.M.,AIM is a terrorist organization bent on destro...,action
2,1009146,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie...",action
3,1010354,Adam Warlock,Adam Warlock is an artificially created human ...,action
4,1011031,Agent X (Nijo),Originally a partner of the mind-altering assa...,action


<a name="3.3"></a>
### Data Load (Carregamento dos Dados)
Loading each dataset into the `../data/raw/` directory (Carregando cada dataset no diretório `../data/raw/`).

In [241]:
df_comics.to_csv('../data/raw/comics_corpus.csv', index=False)
df_characters.to_csv('../data/raw/char_corpus.csv', index=False)