# Ingestion Step (Etapa de Ingestão)

## Table of Contents
* [Packages](#1)
* [Ingestion Utility Functions](#2)
* [Environment Variables](#3)
* [Ingestion](#4)

<a name="1"></a>
## Packages (Pacotes)
**[EN-US]**

Packages used in the system.
* [os](https://docs.python.org/3/library/os.html): built-in module, provides a portable way of using operating system dependent functionality;
* [requests](https://pypi.org/project/requests/): is a simple HTTP library, for making HTTP requests;
* [haslib](https://docs.python.org/3/library/hashlib.html): implements a common interface to many different secure hash and message digest algorithms;
* [time](https://docs.python.org/3/library/time.html): provides various time-related functions;
* [dotenv](https://pypi.org/project/python-dotenv/): reads key-value pairs from a .env file and can set them as environment variables;
* [pandas](https://pandas.pydata.org/): is the main package for data manipulation;

**[PT-BR]**

Pacotes utilizados no sistema.
* [os](https://docs.python.org/3/library/os.html): módulo integrado, fornece uma maneira portátil de usar funcionalidades dependentes do sistema operacional;
* [requests](https://pypi.org/project/requests/): é uma biblioteca HTTP simples, para fazer solicitações HTTP;
* [haslib](https://docs.python.org/3/library/hashlib.html): implementa uma interface comum para muitos algoritmos diferentes de hash seguro e resumo de mensagens;
* [time](https://docs.python.org/3/library/time.html): fornece várias funções relacionadas ao tempo;
* [dotenv](https://pypi.org/project/python-dotenv/): lê pares de chave-valor de um arquivo .env e pode defini-los como variáveis de ambiente;
* [pandas](https://pandas.pydata.org/): é o principal pacote para manipulação de dados;

In [67]:
import os
from requests import get
from hashlib import md5
from time import time
from dotenv import load_dotenv
load_dotenv()

import pandas as pd

<a name="2"></a>
## Ingestion Utility Functions (Funçõs Utilitárias de Ingestão)

In [74]:
class MarvelIngestion():
    """
    
    """
    def __init__(
        self,
        public_key,
        private_key,
        url='http://gateway.marvel.com/v1/public/',
        limit=100,
    ):
        """
        
        """
        super(MarvelIngestion, self).__init__()

        self.public_key = public_key
        self.private_key = private_key
        
        self.url = url
        self.limit = limit
        self.headers = {
            'Accept-Encoding': '*',
            'Accpet': '*/*',
            'Connection': 'keep-alive'
        }

    def get_params(self, offset, format_=None):
        """
        
        """
        ts = str(time())
        hash_ = md5(
            (
                ts + self.private_key + self.public_key
            ).encode('utf-8')
        ).hexdigest()
        params = {
                'ts': ts,
                'apikey': self.public_key,
                'hash': hash_,
                'limit': self.limit,
                'offset': offset,
                'format': format_
            }
        
        return params
    
    def __call__(self, endpoint, offset=0, format_=None, retries=5):
        """
        
        """        
        key = 'name' if endpoint == 'characters' else 'title'
        params = self.get_params(offset=offset, format_=format_)
        examples = []
        
        response = get(self.url + endpoint, params=params, headers=self.headers).json()
        total = response['data']['total']
                
        for page in range(offset, offset + total, self.limit):
            if retries > 0:
                try:
                    results = response['data']['results']
                    
                    for i in range(len(results)):
                        description = results[i]['description']
                        example = [
                                results[i]['id'],
                                results[i][key],
                                description
                            ]
                        if example not in examples and description and len(description) > 4:
                            examples.append(example)
                            
                    params = self.get_params(offset=page + self.limit, format_=format_)
                    response = get(self.url + endpoint, params=params, headers=self.headers).json()
                
                except:
                    retries -= 1

        features = ['id', key, 'description']
        df = pd.DataFrame(examples, columns=features)

        return df

<a name="3"></a>
## Environment Variables (Variáveis de Ambiente)

In [77]:
PUBLIC_KEY = str(os.environ['MARVEL_PUBLIC_KEY'])
PRIVATE_KEY = str(os.environ['MARVEL_PRIVATE_KEY'])

<a name="4"></a>
## Ingestion (Ingestão)

In [79]:
ingestion = MarvelIngestion(PUBLIC_KEY, PRIVATE_KEY)

In [71]:
df_comics = ingestion(endpoint='comics', format_='comic')
print(f'Comics dataset dimension: {df_comics.shape}')
df_comics.head()

Comics dataset dimension: (18802, 3)


Unnamed: 0,id,title,description
0,94799,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
1,94801,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
2,94802,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
3,94803,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...
4,94804,Demon Days: Mariko (2021) #1 (Variant),IN THE SHADOW OF KIRISAKI MOUNTAIN?A SECRET HI...


In [81]:
df_characters = ingestion(endpoint='characters', offset=0)
print(f'Characters dataset dimension: {df_characters.shape}')
df_characters.head()

Characters dataset dimension: (305, 3)


Unnamed: 0,id,name,description
0,1017100,A-Bomb (HAS),Rick Jones has been Hulk's best bud since day ...
1,1009144,A.I.M.,AIM is a terrorist organization bent on destro...
2,1009146,Abomination (Emil Blonsky),"Formerly known as Emil Blonsky, a spy of Sovie..."
3,1010354,Adam Warlock,Adam Warlock is an artificially created human ...
4,1011031,Agent X (Nijo),Originally a partner of the mind-altering assa...


In [82]:
df_comics.to_csv('../data/raw/comics_corpus.csv', index=False)
df_characters.to_csv('../data/raw/char_corpus.csv', index=False)