# Scraping CRediT statements of journal articles

This code iteratively scrapes CRediT statements of journal articles. The list of articles to scrape were identified by using the "Writing – original draft" search term in the Dimensions AI database.

## Importing packages

In [36]:
import os
import json
import re
import pandas as pd
import doi
import requests
import cloudscraper
from pypdf import PdfReader
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

## Exploring the dataset

Loading DOIs to scrape

In [5]:
with open(os.path.abspath("data/credit_observatory_publication_data.json")) as json_data:
    dimensions_data = json.load(json_data)

Looking at one result to see the structure.

In [18]:
dimensions_data[0]

{'_stats': {'total_count': 200},
 'publications': [{'acknowledgements': 'We thank Kelley Voss for assistance tabulating data from video, Elisabeth Zieger, Christine Huffard, and Jennifer Mather for assistance with literature, Stefan Linquist for assistance with collection of images, Brian Farm for octopus line drawings, and Mick Saliwon and Lyn Cleary of the OceanTrek Diving Resort for their support in the field. The Graphical Abstract was drawn by Eliza Jewett-Hall (© P.G.-S.). This study was unobtrusive observation only of undisturbed wild non-protected invertebrate animals that were not manipulated in any way and therefore was not required to be reviewed by the Alaska Pacific University Institutional Review Board or the CUNY Institutional Animal Care and Use Committee. Financial support for this study was provided to P.G.-S. by the City University of New York and to D.S. through Alaska Pacific University from donations by the Pollock Conservation Consortium. Findings and conclusions

Checking the number of publications.

In [45]:
sum(len(l['publications']) for l in dimensions_data)

400

In [17]:
print(f"The results are stored in a nested array, where each item contains a number of publications nested within. To see the number of publications in each item lookup `{list(dimensions_data[0])[0]}`, whereas to access the publications lookup `{list(dimensions_data[0])[1]}`.")

The results are stored in a nested array, where each item contains a number of publications nested within. To see the number of publications in each item lookup `_stats`, whereas to access the publications lookup `publications`.


There are two main ways to access the articles within our collection. First, to retrieve them by their DOI using the `doi` key and the doi python package. This option mainly returns HTML pages. Second, to use the `linkout` key which sometimes returns HTML pages but often returns URL to the article in PDF format. I assume that if a PDF is returned it most likely includes the full-text of the article, whereas I cannot say this about the HTML pages.

Also, some articles within our collection are open access whereas others are closed. This too can influence whether we can access the full-text of the manuscript to scrape. Thus, we first look at the number of open-access and closed articles that include the URL to the PDF in order to make a more informed search strategy. 

In [48]:
# Creating a df to store results
includes_pdf = pd.DataFrame(columns=["open_access", "pdf"])
# Extracting whether the article includes a pdf or not
for batch in dimensions_data:
    for publication in batch['publications']:
        # Search PDF format at the end of the linkout
        match = re.match(r".*pdf$", publication['linkout'])
        includes_pdf = pd.concat([includes_pdf,
                                  pd.DataFrame([
                                      {"open_access": publication['open_access'][0], "pdf": bool(match)}
                                  ])
                                 ])
# Summarizing the results
includes_pdf.groupby(['open_access', 'pdf']).value_counts()

open_access  pdf  
oa_all       False    131
             True     269
Name: count, dtype: int64

The PDF is present in more than half of the cases. It would be better to access just the HTML pages so we can use one extraction method. However, after an initial exploration some webpages are protected by cloudflare so we would need to use a headless browser to access the content. This can be time consuming so it might make sense to download PDFS whenever it is possible and only resort to scraping with selinium if a PDF is not available.

## Accessing webpages

Creating an array of linkouts for download.

In [11]:
# Creating a container for storing the linkouts
publication_list = []
# Extracting linkouts and flagging PDFs
for batch in dimensions_data:
    for publication in batch['publications']:
        publication_list.append({'doi': publication['doi'], 'linkout': publication['linkout'], 'pdf': bool(re.match(r".*pdf$", publication['linkout']))})
        
# Returning the first few results
publication_list[1:5]

[{'doi': '10.1016/j.chembiol.2015.11.013',
  'linkout': 'http://www.cell.com/article/S2451945616000258/pdf',
  'pdf': True},
 {'doi': '10.1016/j.molcel.2016.01.004',
  'linkout': 'http://www.cell.com/article/S1097276516000058/pdf',
  'pdf': True},
 {'doi': '10.1016/j.celrep.2016.01.007',
  'linkout': 'http://www.cell.com/article/S2211124716000279/pdf',
  'pdf': True},
 {'doi': '10.1016/j.cub.2015.12.045',
  'linkout': 'http://www.cell.com/article/S0960982215015766/pdf',
  'pdf': True}]

Getting the HTML or PDF file corresponding to each publication iteratively. I plan to download each file and handle extraction in a separate step. We extract the PDFs whenever possible, otherwise we resort to the HTMLs.

### Setting up webdriver

In [111]:
# Decided to use Firefox as a browser as Chrome had issues

# Setting download dir
output_dir = os.path.abspath("data/")

# Firefox webdriver issue
## Selenium stops for some reason after download
## Some say it is a firefox issue (present only in new versions)
## There is a suggested solution here: https://stackoverflow.com/questions/63184810/selenium-firefox-browser-is-stuck-after-downloading-pdf

# Set Firefox browser preferences
#firefox_options = webdriver.FirefoxOptions()
#firefox_options.set_preference("browser.download.folderList", 2)
#firefox_options.set_preference("browser.download.dir", output_dir)
#firefox_options.set_preference("browser.download.useDownloadDir", True)
#firefox_options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
#firefox_options.set_preference("browser.download.manager.closeWhenDone", True)
#firefox_options.set_preference("browser.download.manager.focusWhenStarting", False)
#firefox_options.set_preference("browser.download.manager.showWhenStarting", False)
#firefox_options.set_preference("pdfjs.disabled", True)
#firefox_options.set_preference("plugin.scan.Acrobat", "99.0")
#firefox_options.set_preference("plugin.scan.plid.all", False)
#firefox_options.set_preference("pdfjs.enabledCache.state", False)
#firefox_options.set_preference("browser.download.alwaysOpenPanel", False)

# Set driver options
#firefox_options.add_argument("-headless")

# Geckodriver should be downloaded and placed in /usr/local/bin in linux
# Otherwise, uncomment this line and set path to geckodriver executable
# And include param executable_path=webdriver_path in webdriver instance
# webdriver_path = 'path/to/geckodriver'

# Create a new Firefox browser instance
driver = webdriver.Firefox(options=firefox_options)

### Accessing publications

In [109]:
driver.quit()

In [113]:
for index, publication in enumerate(publication_list):
    # If PDF is available downloading the file
    if publication['pdf']:
        # Print downloaded file info
        print(f"Downloading PDF file: {publication['linkout']}")
        # Access the URL using Selenium
        # We have to use Selenium in order to avoid 403 by Cloudflare
        driver.get(publication['linkout'])
        # get the URL from Selenium 
        current_pdf_url = driver.current_url
        # create a requests session
        session = requests.session()
        # add Selenium's cookies to requests
        selenium_cookies = driver.get_cookies()
        for cookie in selenium_cookies:
            session.cookies.set(cookie["name"], cookie["value"])
            
        # Finally, re-send the request with requests.session
        pdf_response = session.get(current_pdf_url)

        # access the bytes response from the session
        pdf_bytes = pdf_response.content
        print(pdf_bytes)
# Close the browser
driver.quit()

Downloading PDF file: http://www.cell.com/article/S0960982215015596/pdf


WebDriverException: Message: Failed to decode response from marionette


Loading the article files iteratively into a nested array. This code is only needed if the kernel was closed and the mined html files are lost from the working memory.

In [None]:
# Listing filenames
filenames = next(walk("data/sites/"), (None, None, []))[2]
# Creating array for storing the parsed htmls
sites = []
# Loading htmls and parsing them
for filename in filenames:
    with open(filename) as fp:
        html = BeautifulSoup(fp, "html.parser")
    sites.extend(html)

Finding elements in the html files including our search term: "Writing – original draft". It is possible that the term is present multiple times (e.g. if the paper discusses the CRediT taxonomy instead of using it to report the authors' contributions). It is also possible that the search term is not present in the html file even though Dimension AI found it in their records corresponding to the given research article. This is possible as Dimensions AI does not report in which metadata the search term was found. Also, not all of the papers in our sample are open access. Thus, it is possible that the returned html does not include the section of the article containing our search term. We will flag these cases for further exploration.