# Ingestion Step (Etapa de Ingestão)

## Table of Contents
* [Packages](#1)
* [Ingestion Utility Functions](#2)
* [Environment Variables](#3)
* [Extract, Transform and Load (ETL)](#4)
    * [Data Extract](#4.1)
    * [Data Transform](#4.2)
    * [Data Load](#4.3)

<a name="1"></a>
## Packages (Pacotes)
**[EN-US]**

Packages used in the system.
* [os](https://docs.python.org/3/library/os.html): built-in module, provides a portable way of using operating system dependent functionality;
* [requests](https://pypi.org/project/requests/): is a simple HTTP library, for making HTTP requests;
* [haslib](https://docs.python.org/3/library/hashlib.html): implements a common interface to many different secure hash and message digest algorithms;
* [time](https://docs.python.org/3/library/time.html): provides various time-related functions;
* [dotenv](https://pypi.org/project/python-dotenv/): reads key-value pairs from a .env file and can set them as environment variables;
* [pandas](https://pandas.pydata.org/): is the main package for data manipulation;

**[PT-BR]**

Pacotes utilizados no sistema.
* [os](https://docs.python.org/3/library/os.html): módulo integrado, fornece uma maneira portátil de usar funcionalidades dependentes do sistema operacional;
* [requests](https://pypi.org/project/requests/): é uma biblioteca HTTP simples, para fazer solicitações HTTP;
* [haslib](https://docs.python.org/3/library/hashlib.html): implementa uma interface comum para muitos algoritmos diferentes de hash seguro e resumo de mensagens;
* [time](https://docs.python.org/3/library/time.html): fornece várias funções relacionadas ao tempo;
* [dotenv](https://pypi.org/project/python-dotenv/): lê pares de chave-valor de um arquivo .env e pode defini-los como variáveis de ambiente;
* [pandas](https://pandas.pydata.org/): é o principal pacote para manipulação de dados;

In [40]:
import os
from requests import get
from hashlib import md5
from time import time
from dotenv import load_dotenv
load_dotenv()

import pandas as pd

<a name="2"></a>
## Ingestion Utility Function (Função Utilitária de Ingestão)
**[EN-US]**



**[PT-BR]**




In [43]:
class MarvelIngestion():
    """
    
    """
    def __init__(
        self,
        public_key,
        private_key,
        url='http://gateway.marvel.com/v1/public/',
        limit=100,
    ):
        """
        
        """
        super(MarvelIngestion, self).__init__()

        self.public_key = public_key
        self.private_key = private_key
        
        self.url = url
        self.limit = limit
        self.headers = {
            'Accept-Encoding': '*',
            'Accpet': '*/*',
            'Connection': 'keep-alive'
        }

    def get_params(self, offset, format_=None):
        """
        
        """
        ts = str(time())
        hash_ = md5(
            (
                ts + self.private_key + self.public_key
            ).encode('utf-8')
        ).hexdigest()
        params = {
                'ts': ts,
                'apikey': self.public_key,
                'hash': hash_,
                'limit': self.limit,
                'offset': offset,
                'format': format_
            }
        
        return params
    
    def __call__(self, endpoint, offset=0, format_=None, retries=5):
        """
        
        """        
        key = 'name' if endpoint == 'characters' else 'title'
        params = self.get_params(offset=offset, format_=format_)
        examples = []
        
        response = get(self.url + endpoint, params=params, headers=self.headers).json()
        total = response['data']['total']
                
        for page in range(offset, offset + total, self.limit):
            if retries > 0:
                try:
                    results = response['data']['results']
                    
                    for i in range(len(results)):
                        description = results[i]['description']
                        example = [
                                results[i]['id'],
                                results[i][key],
                                description
                            ]
                        if example not in examples and description and len(description) > 4:
                            examples.append(example)
                            
                    params = self.get_params(offset=page + self.limit, format_=format_)
                    response = get(self.url + endpoint, params=params, headers=self.headers).json()
                
                except:
                    retries -= 1

        features = ['id', key, 'description']
        df = pd.DataFrame(examples, columns=features)

        return df

<a name="3"></a>
## Environment Variables (Variáveis de Ambiente)
**[EN-US]**

Setting the environment variables:
* `MARVEL_PUBLIC_KEY`: the public key for connecting and using the APIs.
* `MARVEL_PRIVATE_KEY`: the private key for connecting and using the APIs.

**[PT-BR]**

Definindo as variáveis de ambiente:
* `MARVEL_PUBLIC_KEY`: a public key para conexão e uso das APIs.
* `MARVEL_PRIVATE_KEY`: a private key para conexão e uso das APIs.

In [77]:
PUBLIC_KEY = str(os.environ['MARVEL_PUBLIC_KEY'])
PRIVATE_KEY = str(os.environ['MARVEL_PRIVATE_KEY'])

<a name="4"></a>
## Extract, Transform and Load (Extração, Transformação e Carregamento)
**[EN-US]**

ETL (Extract, Transform and Load), requesting authorization from Marvel, extracting data from Marvel characters and comics, transforming this data and loading the data to disk.

**[PT-BR]**

ETL (Extração, Transformação e Carregamento), requisitando a autorização da Marvel, extraindo os dados dos personagens e dos comics da Marvel, transformando esses dados e carregamando os dados no disco.

<a name="4.1"></a>
### Data Extract (Extração dos Dados)
**[EN-US]**

Extracting data from characters and comics and their respective descriptions and then projecting the first 5 examples from the dataset of comics `df_comics` and characters `df_characters`.

**[PT-BR]**

Extraindo os dados dos personagens e comics e, suas respectivas descrições e, em seguida, projetando os 5 primeiros exemplos do dataset dos comics `df_comics` e dos personagens `df_characters`.

In [79]:
ingestion = MarvelIngestion(PUBLIC_KEY, PRIVATE_KEY)

In [71]:
df_comics = ingestion(endpoint='comics', format_='comic')
print(f'Comics dataset dimension: {df_comics.shape}')
df_comics.head()

Comics dataset dimension: (18802, 3)


Unnamed: 0,id,title,description
0,94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
1,94801,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
2,94802,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
3,94803,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
4,94804,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...


In [81]:
df_characters = ingestion(endpoint='characters', offset=0)
print(f'Characters dataset dimension: {df_characters.shape}')
df_characters.head()

Characters dataset dimension: (305, 3)


Unnamed: 0,id,name,description
0,1017100,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...
1,1009144,A.I.M.,AIM is a terrorist organization bent on destro...
2,1009146,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie..."
3,1010354,Adam Warlock,Adam Warlock is an artificially created human ...
4,1011031,Agent X (Nijo),Originally a partner of the mind-altering assa...


<a name="4.2"></a>
### Data Transform (Transformação dos Dados)
**[EN-US]**

As we can see above, the `title` and `description` features of the comics dataset have duplicate data, regardless of whether their `id` is different. So let's delete these duplicate examples.

First, let's transform the `id` feature into the index of this dataset to count duplicate examples between the `title` and `description` features.

**[PT-BR]**

Como podemos ver acima, as features `title` e `description` do dataset dos comics tem dados duplicados, independente se o seu `id` é diferente. Portanto, vamos excluir esses exemplos duplicados.

Primeiro, vamos transformar a feature `id` no índice desse dataset para contar os exemplos duplicados entre as features `title` e `description`.

In [90]:
df_comics = df_comics.set_index('id')
df_comics.head()

Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
94801,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
94802,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
94803,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
94804,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...


Counting duplicate examples (Contando os exemplos duplicados).

In [93]:
print(f'Number of duplicate examples: {df_comics.duplicated().sum()}')

Number of duplicate examples: 1069


Dropping duplicate examples (Deletando os exemplos duplicados).

In [96]:
df_comics = df_comics.drop_duplicates()
print(f'Number of duplicate examples: {df_comics.duplicated().sum()}')
df_comics.head()

Number of duplicate examples: 0


Unnamed: 0_level_0,title,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
93339,The Mighty Valkyries (2021) #3,CHILDREN OF THE AFTERLIFE! While Kraven the Hu...
94884,The Mighty Valkyries (2021) #3 (Variant),CHILDREN OF THE AFTERLIFE! While Kraven the Hu...
93350,X-Corp (2021) #2,A SHARK IN THE WATER! After X-CORP’s shocking ...
94896,X-Corp (2021) #2 (Variant),A SHARK IN THE WATER! After X-CORP?s shocking ...


Performing the same validation for the `df_characters` dataset (Fazendo a mesma validação para o dataset `df_characters`).

Transforming the feature `id` into the index (Transformando a feature `id` no índice).

In [101]:
df_characters = df_characters.set_index('id')
df_characters.head()

Unnamed: 0_level_0,name,description
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1017100,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...
1009144,A.I.M.,AIM is a terrorist organization bent on destro...
1009146,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie..."
1010354,Adam Warlock,Adam Warlock is an artificially created human ...
1011031,Agent X (Nijo),Originally a partner of the mind-altering assa...


Counting duplicate examples (Contando os exemplos duplicados).

In [104]:
print(f'Number of duplicate examples: {df_characters.duplicated().sum()}')

Number of duplicate examples: 0


The `df_characters` dataset has no duplicate examples. So we won't do anything (O dataset `df_characters` não tem exemplos duplicados. Portanto, não faremos nada).

Resetting the dataset index to load it to disk and plotting the first 5 examples (Resetando o índice do dataset para carregá-lo no disco e plotando os primeiros 5 exemplos).

In [108]:
df_comics = df_comics.reset_index()
df_characters = df_characters.reset_index()
df_comics.head()

Unnamed: 0,id,title,description
0,94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
1,93339,The Mighty Valkyries (2021) #3,CHILDREN OF THE AFTERLIFE! While Kraven the Hu...
2,94884,The Mighty Valkyries (2021) #3 (Variant),CHILDREN OF THE AFTERLIFE! While Kraven the Hu...
3,93350,X-Corp (2021) #2,A SHARK IN THE WATER! After X-CORP’s shocking ...
4,94896,X-Corp (2021) #2 (Variant),A SHARK IN THE WATER! After X-CORP?s shocking ...


<a name="4.3"></a>
### Data Load (Carregamento dos Dados)
Loading each dataset into the `../data/raw/` directory (Carregando cada dataset no diretório `../data/raw/`).

In [112]:
df_comics.to_csv('../data/raw/comics_corpus.csv', index=False)
df_characters.to_csv('../data/raw/char_corpus.csv', index=False)