## Web Scraping en el Journal of Macroeconomics

El objetivo es vincularme con la página web del Journal of Macroeconomics para extraer la informacion de cada volumen, los articulos de cada uno, los autores, los links, entre otros.

Voy a trabajar con un entorno virtual llamado `env`. El Driver de Chrome se puede descargar en https://sites.google.com/chromium.org/driver/downloads
 

In [276]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from time import sleep

from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager

Nos vinculamos con la sección de articulos dentro del sitio web del Journal of Macroeconomics

In [261]:
def html_import(url):
    driver = webdriver.Chrome("./driver/chromedriver.exe")
    driver.get(url)

    try:
        # Debo clickear y habilitar todos los paneles (AÚN NO FUNCIONA)
        buttons = driver.find_element(By.CLASS_NAME, "accordion-panel-title").click()

    except:
        pass

    # Conseguir el código fuente como HTML
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")
    
    driver.close()

    return soup

Definimos los links para la importacion

In [268]:
url_1 = "https://www.sciencedirect.com/journal/journal-of-macroeconomics/issues?page=1"
url_2 = "https://www.sciencedirect.com/journal/journal-of-macroeconomics/issues?page=2"
url_3 = "https://www.sciencedirect.com/journal/journal-of-macroeconomics/issues?page=3"

page_1 = html_import(url_1)
page_2 = html_import(url_2)
page_3 = html_import(url_3)

  driver = webdriver.Chrome("./driver/chromedriver.exe")


Tras haber importado todos los elementos dentro de los links, es importante definir los elementos que deseamos importar y sus clases

* Secciones:          `<li class="accordion-panel js-accordion-panel">`
* Volumenes:          `<div class="issue-item u-margin-s-bottom">`
* Nombre del volumen: `<span class="anchor-text">`
* Link:               `<a class="anchor js-issue-item-link text-m anchor-default">`

In [269]:
def get_volumens(soup):
    sections = soup.find_all("li", {"class": "accordion-panel js-accordion-panel"})

    list_names=[]
    list_urls=[]

    for section in sections:
        volumens = section.find_all("div", {"class": "issue-item u-margin-s-bottom"})

        for volume in volumens:
            name = volume.find("span", {"class": "anchor-text"}).text
            url = volume.find("a", {"class": "anchor js-issue-item-link text-m anchor-default"}).get("href")

            # Guardando los resultados
            list_names.append(name)
            list_urls.append(url)

    return list_names, list_urls

In [295]:
names_1, urls_1 = get_volumens(page_1)
names_2, urls_2 = get_volumens(page_2)
names_3, urls_3 = get_volumens(page_3)

names = names_1 + names_2 + names_3
urls = urls_1 + urls_2 + urls_3

# Dataframe
dta_volumens = pd.DataFrame({"volume_name": names, "volume_url": urls})
dta_volumens.head()

Unnamed: 0,volume_name,volume_url
0,Volume 74,/journal/journal-of-macroeconomics/vol/74/suppl/C
1,Volume 73,/journal/journal-of-macroeconomics/vol/73/suppl/C
2,Volume 72,/journal/journal-of-macroeconomics/vol/72/suppl/C
3,Volume 71,/journal/journal-of-macroeconomics/vol/71/suppl/C
4,"Volume 24, Issue 4",/journal/journal-of-macroeconomics/vol/24/issue/4


* Article: `<h3 class="text-m u-font-serif u-display-inline">`
* Url: `<a class="anchor article-content-title u-margin-xs-top u-margin-s-bottom anchor-default">`
* Name: `<span class="js-article-title">`

In [284]:
def get_articles(array):
    
    list_articles = []

    for i in array:
        # Extraendo los nombres de los articulos en cada HTML
        soup = html_import(f"https://www.sciencedirect.com{i}")
        articles = soup.find_all("h3", {"class": "text-m u-font-serif u-display-inline"})

        for article in articles:
            name = article.find("span", {"class": "js-article-title"}).text
            url = article.find("a", {"class": "anchor article-content-title u-margin-xs-top u-margin-s-bottom anchor-default"}).get("href")
            
            # Guardando resultados
            list_articles.append([i, name, url])    

    return list_articles

In [286]:
urls = dta_volumens["volume_url"]

articles = get_articles(urls)

  driver = webdriver.Chrome("./driver/chromedriver.exe")


In [297]:
dta_articles = pd.DataFrame(articles, columns=["volume_url", "article_name", "article_url"])
dta_articles

Unnamed: 0,volume_url,article_name,article_url
0,/journal/journal-of-macroeconomics/vol/74/suppl/C,Congestion in a public health service: A macro...,/science/article/pii/S0164070422000477
1,/journal/journal-of-macroeconomics/vol/74/suppl/C,The wage dispersion effects of international m...,/science/article/pii/S0164070422000490
2,/journal/journal-of-macroeconomics/vol/74/suppl/C,Balanced-budget rules and macroeconomic stabil...,/science/article/pii/S0164070422000507
3,/journal/journal-of-macroeconomics/vol/74/suppl/C,Illiquid investments and the non-monotone rela...,/science/article/pii/S0164070422000532
4,/journal/journal-of-macroeconomics/vol/74/suppl/C,The health gap and its effect on economic outc...,/science/article/pii/S0164070422000544
...,...,...,...
216,/journal/journal-of-macroeconomics/vol/4/issue/1,Understanding inflation accounting: Timothy S....,/science/article/pii/0164070482900246
217,/journal/journal-of-macroeconomics/vol/4/issue/1,Understanding macroeconomics: Robert L. Heilbr...,/science/article/pii/0164070482900258
218,/journal/journal-of-macroeconomics/vol/4/issue/1,"The U.S. monetary system: money, banking, and ...",/science/article/pii/016407048290026X
219,/journal/journal-of-macroeconomics/vol/4/issue/1,Wealth and poverty: George Gilder. New York: B...,/science/article/pii/0164070482900271


* Autores-Banner: `<div class="author-group">`
* Autores: `<a class="author size-m workspace-trigger">`
* Nombre: `<span class="text given-name">`
* Apellido: `<span class="text surname">`
* Doi: `<a class="doi">`
* Keyword `<div class="keyword">`

In [310]:
def get_components(array):
    
    list_components = []

    for i in array:
        # Entraendo los componentes de cada uno de los articulos
        soup = html_import(f"https://www.sciencedirect.com{i}")

        # Elementos
        doi = soup.find("a", {"class": "doi"}).get("href")
        keywords = soup.find_all("div", {"class": "keyword"})
        group_authors = soup.find_all("a", {"class": "author size-m workspace-trigger"})

        authors = []
        for authors in group_authors:
            name = authors.find("span", {"class": "text given-name"}).text
            surname = authors.find("span", {"class": "text surname"}).text

            author = f"{surname}, {name}"
            authors.append(author)

        # Union
        list_components.append([i, authors, doi, keywords])

    return list_components

In [311]:
urls = dta_articles["article_url"]
urls = urls[0:2]
urls

0    /science/article/pii/S0164070422000477
1    /science/article/pii/S0164070422000490
Name: article_url, dtype: object

In [312]:
components = get_components(urls)

  driver = webdriver.Chrome("./driver/chromedriver.exe")


In [314]:
dta_components = pd.DataFrame(components, columns=["article_url", "authors", "doi", "keywords"]) ### Revisar resultados en authors y keywords!!!!
dta_components

Unnamed: 0,article_url,authors,doi,keywords
0,/science/article/pii/S0164070422000477,"[[[Michael], [Kuhn], [<sup>b</sup>], [<sup>c</...",https://doi.org/10.1016/j.jmacro.2022.103451,[]
1,/science/article/pii/S0164070422000490,"[[[Kristina], [Sargent], [<title>Person</title...",https://doi.org/10.1016/j.jmacro.2022.103454,[]
