## Task_18

##### Site:  https://erefdn.org

Utiliza o BeautifulSoup para processar a página.
Estrutura da página:
* Página principal: Possui uma lista com URLs para cada bolsa disponível
* Páginas dos Artigos: Uma página dedicada para cada bolsa, contendo todas as suas informações.




##### Processamento da página principal

O processamento da página principal percorre a lista de bolsas disponíveis, armazenando-as em uma lista.

<img src="imgs\principal.jpg" style="width: 500px;"/>

##### Processamento das páginas de cada bolsa (artigos)

Uma função encapsula todo o processamento de um artigo. Apenas a sua URL é passada.
O retorno ocorre através de um objeto JSON contendo os campos pertinentes.

<img src="imgs\artigo.jpg" style="width: 500px;"/>

### Imports

In [225]:
from bs4 import BeautifulSoup
import requests
import re
from lxml import html
from lxml import etree
import json

### Definições Iniciais

In [226]:
url_pag_principal = 'https://erefdn.org/research-grants-projects/currently-funded-projects'

In [227]:
arquivo_saida = 'resultado-raspagem-erefdn.txt'

### Funções Auxiliares

In [228]:
def processa_seletor(sel_bruto):
    """Processa a string bruta do seletor de CSS obtida no site para
    compatibilidade com a sitaxe esperada pelo Python/BeautifullSoup"""
    
    sel_processado = sel_bruto.replace('nth-child', 'nth-of-type')
    return sel_processado

In [229]:
def obtem_soup(url):
    """Obtem o objeto Soup para o artigo
    Recebe:
            url :: str
    Retorna:
            soup :: bs4.BeautifulSoup
    """
    
    headers = {'User-Agent': 
               'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
    html = requests.get(url, headers=headers)
    soup = BeautifulSoup(html.text, 'lxml')
    
    return soup

In [230]:
def obtem_sel_raiz(soup):
    """Obtem a string de selecao raiz de CSS
    Utiliza a classe 'article' para localizar o id.
    Assume que ha apenas um artigo por url, o que eh o caso na pagina pesquisada.
    
    Recebe:
            soup :: bs4.BeautifulSoup
    Retorna:
            sel_raiz :: str
    
    Exemplo: <url> -> '#post-1353'
    """
    
    id_artigo = soup.find('article').get('id')
    sel_raiz = f'#{id_artigo}'
    
    return sel_raiz

In [231]:
def valor_int(str_valor):
    """Converte o valor financeiro da bolsa de string para float.
    Recebe:
            str_valor_usd :: str
    Retorna:
            valor_int :: int
    """
    chars_a_remover = ['$', 'U', 'S', ',', ' ']
    valor_inteiro = ''.join([char for char in str_valor if char not in chars_a_remover])
    return valor_inteiro

In [232]:
def remove_novas_linhas(str_bruta, sep='|'):
    """Remove novas linhas '\n's em excesso e 
    devolve uma string com separador=sep.
    Recebe:
            str_bruta :: str
    Retorna:
            str_proc  :: str
            
    Exemplo:
        str_bruta = '\n\nCampo 1:\n\n\nValor 1\n\nValor2\n\n\n'
        str_proc = '|Campo 1:|Valor 1|Valor2|'
    """
    return re.sub('\n+', sep, str_bruta).strip()

### Funções de Parsing

In [233]:
def parse_cabecalho(soup, sel_raiz):
    """Parse do cabecalho, Titulo e Investigators
    Recebe: 
            soup do artigo
            seletor css para o artigo
    Retorna:
            Titulo :: str
            Investigator :: str
    """
    # Processa o seletor de CSS:
    sel_tronco =  'section > div:nth-child(1) > div > div > div > div > div'

    sel_css_titulo = processa_seletor(f'{sel_raiz} > {sel_tronco} > h1')
    sel_css_instituicao = processa_seletor(f'{sel_raiz} > {sel_tronco} > div > p')
    
    # Obtem o titulo e o investigador
    titulo = soup.select(sel_css_titulo)[0].text
    
    try:
        investigador_bruto = soup.select(sel_css_instituicao)[0].text
    except:
        sel_css_instituicao = processa_seletor(f'{sel_raiz} > {sel_tronco} > p')
        investigador_bruto = soup.select(sel_css_instituicao)[0].text
        print('!!! ERRO !!! ')
        print(soup.select(sel_css_instituicao)[0].text)
    investigador = investigador_bruto.split(':')[1].strip()
    
    return {'titulo': titulo, 'instituicao': investigador}

In [234]:
def parse_data_valor(soup, sel_raiz):
    """Parse do Data de Inicio, Valor da bolsa
    Recebe: 
            soup do artigo
            seletor css para o artigo
    Retorna:
            Data :: str
            Valor :: str
    """
    # Processa o seletor de CSS:
    sel_tronco =  'section > div:nth-child(2)'
    sel_css_data = processa_seletor(f'{sel_raiz} > {sel_tronco} > div:nth-child(1)')
    sel_css_valor = processa_seletor(f'{sel_raiz} > {sel_tronco} > div:nth-child(2)')
    
    # Obtem a data inicial e o valor
    data_ini_bruta = soup.select(sel_css_data)[0].text
    try:
        data_ini = remove_novas_linhas(data_ini_bruta, sep='|').split('|')[2]
    except:
        data_ini = 'N/A'
    
    valor_bruto = soup.select(sel_css_valor)[0].text
    valor_str  = remove_novas_linhas(valor_bruto, sep='|').split('|')[2]
    valor_num = valor_int(valor_str)
    
    return {'data_ini': data_ini, 'valor_bolsa': valor_num}

In [235]:
def parse_descricao_detalhada(soup, sel_raiz):
    """Parse da descricao detalhada 
    Recebe: 
            soup do artigo
            seletor css para o artigo
    Retorna:
            descricao_detalhada :: str
    """
    # Processa o seletor de CSS:
    sel_tronco =  'section > div:nth-child(3)'
    sel_css_descr_det = processa_seletor(f'{sel_raiz} > {sel_tronco}')

    # Obtem a descricao detalhada 
    descricao_detalhada_bruta = soup.select(sel_css_descr_det)[0].text
    descricao_detalhada = remove_novas_linhas(descricao_detalhada_bruta, sep='\n')
  
    return {'descr_detalhada': descricao_detalhada}

In [236]:
def parse_lista_bolsas(soup, sel_raiz):
    """Parse da pagina principal com 
       a lista de bolsas a ser coletada.
    Recebe: 
            soup da pagina 
            seletor css para o artigo
    Retorna:
            json contendo as seguintes informacoes para cada bolsa disponivel:
                titulo :: str
                url_artigo :: str
    """

    # Obtem a lista de bolsas disponiveis
    lista_bolsas = []
    tags_bolsas = soup.select(sel_raiz)[0]
    
    # Para cada bolsa, isola o link do artigo e adiciona à lista de retorno
    for el in tags_bolsas.findAll('a'):
        titulo = el.text
        url_artigo = el.get('href')

        lista_bolsas.append({'titulo': titulo, 'url_artigo': url_artigo})
  
    return lista_bolsas

In [237]:
def parse_artigo(soup, sel_raiz_artigo):
    """A partir do Soup e do seletor raiz do artigo, faz seu parsing
    devolvendo um objeto JSON com os campos pertinentes.
    Recebe: 
            soup do artigo
            seletor css para o artigo
    Retorna:
            json contendo as seguintes informacoes:
                titulo :: str
                instituicao :: str
                data_ini :: str
                valor_bolsa :: int (USD)
                descricao_detalhada :: str
    """
    campos = {}
    
    # Obtem titulo e instituicao
    campos.update(parse_cabecalho(soup, sel_raiz_artigo))
    
    # Obtem data_ini e valor
    campos.update(parse_data_valor(soup, sel_raiz_artigo))
    
    # Obtem descr_detalhada
    campos.update(parse_descricao_detalhada(soup, sel_raiz_artigo))
    
    json_campos = json.loads(json.dumps(campos))
    return json_campos

### Funções Processamento Completo

In [238]:
def processa_artigo(url_artigo):
    """Processamento completo de um artigo 
    Recebe: 
            url do artigo :: str
    Retorna:
            json contendo as seguintes informacoes:
                titulo :: str
                instituicao :: str
                data_ini :: str
                valor_bolsa :: int (USD)
                descricao_detalhada :: str
    """
    # Obtem o objeto soup para uma bolsa especifica
    soup_artigo = obtem_soup(url_artigo)
    sel_css_artigo = obtem_sel_raiz(soup_artigo)
    
    # Obtem o json com o resultado do parsing
    json_artigo = parse_artigo(soup_artigo, sel_css_artigo)
    
    return json_artigo

In [239]:
def processa_pag_principal(url_pag_principal):
    """Processamento completo da pagina. 
    Gera a lista de bolsas disponiveis. Para cada bolsa, 
        faz o processamento completo dos dados.
    Recebe: 
            url do artigo :: str
    Retorna:
            lista de JSONs com o parsing de cada bolsa.
    """
    # Obtem o objeto soup para a pagina inicial, 
    # com links para todas as bolsas oferecidas
    soup_pag_principal = obtem_soup(url_pag_principal)
    sel_css_pag_princ = obtem_sel_raiz(soup_pag_principal)
    
    # Obtem a lista de bolsas
    lista_bolsas = parse_lista_bolsas(soup_pag_principal, sel_css_pag_princ)
    
    # Para cada bolsa, processa o artigo correspondente, 
    # que fica em uma pag separada.
    for i, bolsa in enumerate(lista_bolsas):
        # Obtem informacoes basicas de cada bolsa
        titulo_artigo = bolsa['titulo']
        url_artigo = bolsa['url_artigo']
        print(f'\n----\nProcessando artigo {i + 1}..\n{titulo_artigo}\n..\n')
        
        # Obtem o parsing completo da bolsa
        json_artigo = processa_artigo(url_artigo)
        print(json.dumps(json_artigo, indent=4))


### Execução

In [240]:
processa_pag_principal(url_pag_principal)


----
Processando artigo 1..
Non-Recyclable Plastics to Pavements
..

{
    "titulo": "Non-Recyclable Plastics to Pavements",
    "instituicao": "University of Illinois Urbana-Champaign",
    "data_ini": "TBD",
    "valor_bolsa": "161075",
    "descr_detalhada": "This proposal seeks to create high-value and high-volume products from plastic waste for bitumen (asphalt binder) replacement in pavements. The bitumen replacement market is a potential repurposing for large quantities of waste plastics. It addresses an urgent economic and environmental need for plastic recycling as well as the transportation industry. With 4-5% replacement of bitumen, this market has the potential to consume 1 million tons of waste plastics out of the 26 million tons that go to landfills in the US. Also, the study goal is aligned with the global emphasis on enhancing transportation infrastructure sustainability. Moreover, asphalt pavements are 100% recyclable; therefore, plastic waste will remain in a recycli

{
    "titulo": "Polymer-Based Pre-Treatment for Removal of PFAS from Landfill Leachate ",
    "instituicao": "Geosyntec",
    "data_ini": "March 2020",
    "valor_bolsa": "105000",
    "descr_detalhada": "Per- and polyfluoroalkyl substances (PFAS) are a class of compounds with some or all of the hydrogens in their carbon chain substituted with fluorine. Traditional technologies for treating leachate, including aeration and sedimentation, are insufficient to remove PFAS. Thus, there is a need for pre-treatment alternatives to remove PFAS from landfill leachate and enable landfill operators to continue to sustainably manage landfill leachate via current disposal methods. The team seeks to adapt the technology of applying cationic polymer coagulants currently used in wastewater and drinking water treatment to sequester PFAS for integration with common landfill leachate pre-treatment practices to remove PFAS from leachate.\nThe specific objectives of the proposed work are:\nTo demonstrate

!!! ERRO !!! 
Investigators: University of Texas at Austin, Texas A&M University & EPRI
{
    "titulo": "Rapid and Cost-Effective Approach to Evaluate the Effectiveness of Wastewater and Treatment Byproduct Solidification and Stabilization",
    "instituicao": "University of Texas at Austin, Texas A&M University & EPRI",
    "data_ini": "Mar 2019",
    "valor_bolsa": "185000",
    "descr_detalhada": "Disposal of residual industrial waste streams and treatment byproducts (WTBs) presents many challenges for approaching zero liquid discharge. Solidification/stabilization (S/S) using combinations of additives such as lime, portland cement, and coal combustion residuals (e.g. fly ash) can provide a final disposal option. The solidified waste can be landfilled, where encapsulation of the contaminants prevents leaching into landfill leachate collection systems and ground water. Successful mixture designs will likely depend on the composition, pH, and contaminants of concern in the liquid wast

!!! ERRO !!! 
Investigator: Bridger Photonics and University of Delaware
{
    "titulo": "Gas Mapping Lidar and Tracer Correlation Methods for Landfill Methane Emissions Quantification",
    "instituicao": "Bridger Photonics and University of Delaware",
    "data_ini": "April 2018",
    "valor_bolsa": "240000",
    "descr_detalhada": "There is an established need to quantify methane emissions from landfills, driven primarily by Environmental Protection Agency (EPA) regulations. The EPA currently requires Method 21 for surface monitoring for controlled landfills and requires computational models for whole landfill emissions estimates, which determine tiers of regulation and remediation. However, Method 21 is time consuming and expensive, and whole landfill model input uncertainties and spatio-temporal variations can lead to considerable errors compared to actual emissions. A tracer correlation method (TCM), which represents the current state-of-the-art, can perform measurements under ce

!!! ERRO !!! 
Investigator: George Mason University
{
    "titulo": "Liner Systems for Aggressive Coal Combustion Product Leachates",
    "instituicao": "George Mason University",
    "data_ini": "October 2017",
    "valor_bolsa": "150000",
    "descr_detalhada": "New regulations established for disposal of coal combustion products (CCPs) require that disposal facilities include a composite liner consisting of a geomembrane overlying a 0.6-m-thick clay liner. An economical alternative is to use of a geosynthetic clay liner (GCL) in lieu of the clay liner. Implementation of the new regulations has revealed a large number of waste streams from coal-fired power plants that generate leachates much more concentrated than those from historical coal-combustion wastes (5 M vs. 0.7 M). Preliminary tests show the currently available GCLs cannot retain these concentrated leachates. Finding GCL materials that can withstand these aggressive leachates is critical to prevent leachate from entering th

!!! ERRO !!! 
Investigators: North Carolina State University
{
    "titulo": "Development and Assessment of Cost-Effective Sustainable Integrated Organics Management Strategies",
    "instituicao": "North Carolina State University",
    "data_ini": "Mar 2016",
    "valor_bolsa": "",
    "descr_detalhada": "Jurisdictions representing over 20% of the U.S. have considered or implemented policies that\u00a0require some food waste diversion from landfills, and there is increasing interest in opportunities\u00a0to manage organics in municipal solid waste (MSW). Given the interrelated nature of solid waste\u00a0management (SWM) systems, any new policies or strategies must be fully analyzed to ensure\u00a0that overall solid waste system performance is not negatively affected. This is especially true\u00a0considering how waste generation, composition, the energy system, and policies are changing.\nThe Solid Waste Optimization Life-cycle Framework (SWOLF) is a life-cycle assessment\u00a0(LCA) op

{
    "titulo": "Expert Review of Wisconsin\u2019s Landfill Organic Stability Rule",
    "instituicao": "Colorado State University",
    "data_ini": "Sep 2014",
    "valor_bolsa": "32000",
    "descr_detalhada": "Wisconsin\u2019s landfill organic stability rule (OSR) requires owners and operators of municipal solid waste landfills to \u201cincorporate landfill organic stability strategies into the plans of operation for their facilities\u201d (WDNR 2006). The rule has been in place for more than five years and the Principal Investigators (PIs) recently completed an independent review and report on the manner in which the rule is working. Overall, the PIs concluded that the rule is working well and that all ten landfills that were visited are meeting criteria outlined in the OSR.\nA general perspective shared by the landfill owners and operators interviewed is that the goals of the OSR currently coincide with the industry goals. The OSR provides a performance check on landfill operation

In [241]:
class Bolsa():
    def __init__(self, url):
        self.titulo = ''
        self.data_inicio = ''
        self.valor_bolsa = ''
        self.site_origem = url
        self.descicao_detalhada = ''
        
    def obtem_dados(self):
        pass
        