# Scraping CRediT statements of journal articles

This code iteratively scrapes CRediT statements of journal articles. The list of articles to scrape were identified by using the "Writing – original draft" search term in the Dimensions AI database.

## Importing packages

In [199]:
import os
import json
import re
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup
import doi
import requests
import cloudscraper
from time import sleep
from pypdf import PdfReader
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
import pdfminer
from pdfminer.high_level import extract_text
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTTextContainer
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
from pdfminer.high_level import extract_pages
from io import StringIO

## Exploring the dataset

Loading DOIs to scrape

In [2]:
with open(os.path.abspath("data/credit_observatory_publication_data.json")) as json_data:
    dimensions_data = json.load(json_data)

Looking at one result to see the structure.

In [119]:
dimensions_data[1]

{'_stats': {'total_count': 200},
 'publications': [{'acknowledgements': 'We thank Ben Lehner, Mirko Francesconi, Nobert Perrimon, Thomas Richardson, Robin Evans, Jerome Reboul, Peter Sarkies, Yanhui Hu, and staff at the Drosophila RNAi Screening Center; Todd Harris, Kimberly van Auken, and staff at WormBase, for discussion and advice; Justine Melo and Emily Troemel for the kind gift of strains; Clement Ghigo and Sophie Cypowyj for their contributions; Pierre Golstein and Yishi Jin for critical reading of the manuscript; and the bioinformatics platforms of the CIML and the Laboratoire d’Informatique Fondamentale de Marseille for providing computing resources, database, and web servers. We are grateful to system administrators Manuel Bertrand and Kai Poutrain for their support and input, and the anonymous reviewers for their very constructive criticisms. Some nematode strains were provided by the Caenorhabditis Genetics Center, which is funded by NIH Office of Research Infrastructure Pro

Checking the number of publications.

In [45]:
sum(len(l['publications']) for l in dimensions_data)

400

In [17]:
print(f"The results are stored in a nested array, where each item contains a number of publications nested within. To see the number of publications in each item lookup `{list(dimensions_data[0])[0]}`, whereas to access the publications lookup `{list(dimensions_data[0])[1]}`.")

The results are stored in a nested array, where each item contains a number of publications nested within. To see the number of publications in each item lookup `_stats`, whereas to access the publications lookup `publications`.


There are two main ways to access the articles within our collection. First, to retrieve them by their DOI using the `doi` key and the doi python package. This option mainly returns HTML pages. Second, to use the `linkout` key which sometimes returns HTML pages but often returns URL to the article in PDF format. I assume that if a PDF is returned it most likely includes the full-text of the article, whereas I cannot say this about the HTML pages. Moreover, downloading PDFs are faster than accessing HTML pages.

Also, some articles within our collection are open access whereas others are closed. This too can influence whether we can access the full-text of the manuscript to scrape. Thus, we first look at the number of open-access and closed articles that include the URL to the PDF in order to make a more informed search strategy. 

In [48]:
# Creating a df to store results
includes_pdf = pd.DataFrame(columns=["open_access", "pdf"])
# Extracting whether the article includes a pdf or not
for batch in dimensions_data:
    for publication in batch['publications']:
        # Search PDF format at the end of the linkout
        match = re.match(r".*pdf$", publication['linkout'])
        includes_pdf = pd.concat([includes_pdf,
                                  pd.DataFrame([
                                      {"open_access": publication['open_access'][0], "pdf": bool(match)}
                                  ])
                                 ])
# Summarizing the results
includes_pdf.groupby(['open_access', 'pdf']).value_counts()

open_access  pdf  
oa_all       False    131
             True     269
Name: count, dtype: int64

The PDF is present in more than half of the cases. It would be better to access just the HTML pages so we can use one extraction method. Some webpages are protected by cloudflare so we would need to use selenium to access the content. Opening the HTML pages are more time consuming then downloading the PDFs. Since we have a lot of files we download the PDFs wherever possible to save time. The downside is that PDF files take up much more space when downloaded. While it is better to save the files locally and then extract the information in a separate step (as in case of errors we do not have to download the file again), to save space we might download the file, extract the information and then delete the file. Also, it is a possible solution to keep batches of files at a time and proceed once the batch is cleaned up.

Note:
* It is also possible that downloading 1.6 million publications would take forever. Thus, we can download a few thousand from each year for each journal title.
* For open access papers we could search for the download link of the PDF of the full-text but that would take more time and lead to possible errors.

## Scraping CRediT statements

### Listing publications for scraping

Creating an array of publications with id, doi, and linkout for download.

In [181]:
# Creating a container for storing the linkouts
publication_list = []
# Extracting linkouts and flagging PDFs
## We also flag whether a paper is open access or not
for batch in dimensions_data:
    for publication in batch['publications']:
        publication_list.append({'id': publication['id'], 'doi': publication['doi'], 'linkout': publication['linkout'], 'pdf': bool(re.match(r".*pdf$", publication['linkout']))})
        
# Returning the first few results
publication_list[1:100]

[{'id': 'pub.1035271577',
  'doi': '10.1016/j.chembiol.2015.11.013',
  'linkout': 'http://www.cell.com/article/S2451945616000258/pdf',
  'pdf': True},
 {'id': 'pub.1017265217',
  'doi': '10.1016/j.molcel.2016.01.004',
  'linkout': 'http://www.cell.com/article/S1097276516000058/pdf',
  'pdf': True},
 {'id': 'pub.1012497217',
  'doi': '10.1016/j.celrep.2016.01.007',
  'linkout': 'http://www.cell.com/article/S2211124716000279/pdf',
  'pdf': True},
 {'id': 'pub.1002596264',
  'doi': '10.1016/j.cub.2015.12.045',
  'linkout': 'http://www.cell.com/article/S0960982215015766/pdf',
  'pdf': True},
 {'id': 'pub.1047271733',
  'doi': '10.1016/j.str.2015.12.008',
  'linkout': 'http://www.cell.com/article/S0969212615005341/pdf',
  'pdf': True},
 {'id': 'pub.1045666532',
  'doi': '10.1016/j.celrep.2015.12.089',
  'linkout': 'https://doi.org/10.1016/j.celrep.2015.12.089',
  'pdf': False},
 {'id': 'pub.1039891981',
  'doi': '10.1016/j.neuron.2015.12.020',
  'linkout': 'http://www.cell.com/article/S0896

Getting the HTML or PDF file corresponding to each publication iteratively. We extract the PDFs whenever possible, otherwise we resort to the HTMLs.

### Setting up webdriver

We tried using Firefox as our driver, however the driver got stuck after PDF download on new versions of Firefox. Thus, we resorted to using Chrome instead.

In [317]:
# Using Chrome webdriver

# Setting download directory
output_dir = os.path.abspath("data/publications/")

# Set Chrome browser preferences
options = ChromeOptions()
options.add_experimental_option('prefs',  {
    "download.default_directory": output_dir,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True,
    "safebrowsing.enabled": True
      }
  )

# Set driver options
#options.add_argument('--headless')
# In headless mode PDF files are not downloaded and there is no error message
# Some suggests firefox but that is not working for another reasons
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

# Adding chromedriver path during initiation
service = Service(r"/home/marton/Documents/chromedriver_linux64/chromedriver")
# Create a new Firefox browser instance
driver = webdriver.Chrome(service=service, options=options)

### Exception handling

We will store error messages separately to deal with them later. Since we plan to scrape a lot of papers adding any error handling would immensely increase the amount of time it takes to scrape the publications. On the other hand we would like to run the code overnight. Therefore, we only apply a general try-catch. We also, note pages that we could download.

### Downloading publications

In [312]:
def download_file(input, dir, driver):
    """
    Downloading PDF and HTML documents using Selenium
    
    Filenames are going to be the id for HTML pages, but will be random for PDFs. Renaming files would require wait time.
    
    :param input: Array where data corresponding to each file is stored in an object with the following keys:
        id: unique identifier of the publication
        linkout: link to the file to download
        pdf: true if the file is a pdf
    :param dir: Path to the dir where files should be downloaded
    :param driver: Webdriver to use for downloading
    
    :return: The function downloads the files as a side-effect and returns an array
        containing an object for each file with the following keys:
        id: the name of the file
        message: either an error message or 'downloaded'
    """
    # Create a container for storing the results
    res = []
    for publication in input:
        try:
            # If PDF is available downloading the file
            if publication['pdf']:
                # Print downloaded file info
                print(f"Downloading PDF file: {publication['linkout']}")
                # Download the PDF using Selenium
                # We have to use Selenium in order to avoid 403 by Cloudflare
                download_time = datetime.now()
                driver.get(publication['linkout'])
            # Every other case we download the HTML
            else:
                # Print downloaded file info
                print(f"Downloading HTML file: {publication['linkout']}")
                # Access the HTML
                ## It is possible that some pages use redirect links
                ## Waiting for pages to redirect would slow down process
                ## We will only resort to this if the number of downloaded files are limited
                driver.get(publication['linkout'])
                # Get the page source
                page_source = driver.page_source
                # Save the page source to a file
                ## Since some DOIs have backslashes we use Dimensions AI ID as a filename
                output_file = f"{publication['id']}.html"
                with open(os.path.join(output_dir, output_file), 'w', encoding='utf-8') as file:
                    file.write(page_source)
            res.append({'id': publication['id'], 'message': "downloaded"})
        except Exception as e:
            print(f"Error occured: {publication['linkout']}")
            print(str(e))
            res.append({'id': publication['id'], 'message': str(e)})
            continue
            
    # Returning the download state
    return res

In [316]:
# Downloading
download_list = []
for publication in input:
    download_list = download_file(input=publication_list[0:10], dir=output_dir, driver=driver)

Downloading PDF file: http://www.cell.com/article/S0960982215015596/pdf
Downloading PDF file: http://www.cell.com/article/S2451945616000258/pdf
Downloading PDF file: http://www.cell.com/article/S1097276516000058/pdf
Downloading PDF file: http://www.cell.com/article/S2211124716000279/pdf
Downloading PDF file: http://www.cell.com/article/S0960982215015766/pdf
Downloading PDF file: http://www.cell.com/article/S0969212615005341/pdf
Downloading HTML file: https://doi.org/10.1016/j.celrep.2015.12.089
Downloading PDF file: http://www.cell.com/article/S0896627315011228/pdf
Downloading PDF file: http://www.cell.com/article/S2211124715015399/pdf
Downloading PDF file: http://www.cell.com/article/S0896627315011216/pdf


In [318]:
# Close the browser
## The driver has to ba closed later manually as it is possible that the last download is still on the way while the driver is closed
driver.quit()

# Save exceptions as a json
with open(os.path.abspath("data/downloaded-paper_list_data.json"), 'w') as f:
    json.dump(download_list, f)

It is possible that in case of the HTML files some downloaded sites do not contain the _Authors contribution_ section due to two main reasons: the HTML do not contain the full-text of the manuscript or the downloaded site is not the site of the paper (but only a redirect). Because of this, the following two counts are only approximations. These cases will be dropped during the extraction process.

The number of cases where we could access the publication:

In [None]:
#with open(os.path.abspath("data/downloaded-paper_list_data.json")) as json_data:
    #download_list = json.load(json_data)

In [291]:
sum(l['message'] == "downloaded" for l in download_list)

6

In [300]:
download_list

[{'id': 'pub.1039110450', 'message': 'downloaded'},
 {'id': 'pub.1035271577', 'message': 'downloaded'},
 {'id': 'pub.1017265217', 'message': 'downloaded'},
 {'id': 'pub.1012497217', 'message': 'downloaded'},
 {'id': 'pub.1002596264', 'message': 'downloaded'},
 {'id': 'pub.1047271733', 'message': 'downloaded'},
 {'id': 'pub.1045666532', 'message': 'downloaded'},
 {'id': 'pub.1039891981', 'message': 'downloaded'},
 {'id': 'pub.1034674785', 'message': 'downloaded'},
 {'id': 'pub.1025201128', 'message': 'downloaded'}]

The number of cases where we could not access the publication:

In [293]:
sum(l['message'] != "downloaded" for l in download_list)

4

### Extracting CRediT roles

Listing the downloaded articles.

In [64]:
publications_downloaded = os.listdir(output_dir)

# Show a few names
print(publications_downloaded[0:5])

['pub.1045666532.html', 'PIIS0969212615005341.pdf', 'PIIS2451945616000258.pdf', 'PIIS2211124716000279.pdf', 'PIIS0960982215015596.pdf']


Trying to find a regular expression that catches _"Writing – original draft"_ regardless of how it is written.

In [138]:
pattern = r"(?i)\bwriting\b.{0,30}\boriginal\b.{0,30}\bdraft\b|\bwriting\b.{0,30}\bdraft\b.{0,30}\boriginal\b|\boriginal\b.{0,30}\bwriting\b.{0,30}\bdraft\b|\boriginal\b.{0,30}\bdraft\b.{0,30}\bwriting\b|\bdraft\b.{0,30}\bwriting\b.{0,30}\boriginal\b|\bdraft\b.{0,30}\boriginal\b.{0,30}\bwriting\b"
variations = ["Writing – original draft", "Writing – Original Draft", "Writing original draft", "Writing (Original draft)", "original draft writing", "Writing – review & editing", "Writing", "Original draft"]
for search_term in variations:
    match = re.match(pattern, search_term)
    if match:
        print("Search term:", search_term, "***", "Match found:", match.group())
    else:
        print("Search term:", search_term, "***", "Match not found")

Search term: Writing – original draft *** Match found: Writing – original draft
Search term: Writing – Original Draft *** Match found: Writing – Original Draft
Search term: Writing original draft *** Match found: Writing original draft
Search term: Writing (Original draft) *** Match found: Writing (Original draft
Search term: original draft writing *** Match found: original draft writing
Search term: Writing – review & editing *** Match not found
Search term: Writing *** Match not found
Search term: Original draft *** Match not found


Extract the CRediT roles separately for the HTML and PDF files.

In [None]:
# Container for storing the results of the extraction
publication_results = []

In [319]:
def local_extraction(input, dir, search_term):
    """
    Extracting paragraphs from PDF and HTML documents stored in a local directory based on a search term.
    
    :param input: Array of file names
    :param dir: Path to the dir containing the files
    :param search_term: Regex pattern to search for
    
    :return: The function returns every paragprah including the search term from the document.
        The function returns an array containing an object for each file with the following keys:
        id: the name of the file
        index: unique identifier for each match within the file
        text: the text of the extracted paragraph
    """
    # Container for storing the results
    output = []
    # Extracting text from inputs
    for publication in input:
        # Check the format of the file
        extension = os.path.splitext(publication)[1]
        filename = os.path.splitext(publication)[0]
        # Create full path
        file_path = os.path.join(dir, publication)
        if extension == ".pdf":
            # Getting pdf metadata
            with open(file_path, 'rb') as f:
                reader = PdfReader(f)
                number_of_pages = len(reader.pages)
                meta = reader.metadata
                # Extracting text to pdfminer object
                ## Extracting pages separately
                for page_layout in extract_pages(f, laparams=LAParams()):
                    # Iterating through elements on the page
                    # It is possible that multiple elements contain the search term
                    # Returning every match
                    match_index = 0
                    for element in page_layout:
                        # Only working with text elements
                        if isinstance(element, LTTextContainer):
                            # Look for container containing the search term
                            match = re.search(search_term, element.get_text(), re.IGNORECASE)
                            if match:
                                output.append({'id': filename, 'index': match_index, 'text': element})
                                match_index += 1
        elif extension == ".html":
            # Parsing html
            with open(file_path, 'rb') as f:
                source = f.read()
                soup = BeautifulSoup(source, "html.parser")
            # Finding the container including the search term
            # Returning every match
            for index, element in enumerate(soup(text=re.compile(search_term))):
                output.append({'id': filename, 'index': index, 'text': element.parent.text})
    # Returning the results
    return output

In [150]:
publication_results = local_extraction(input=publications_downloaded, dir=output_dir, search_term=pattern)
publication_results[0:5]

Since we do not catch cases during extraction where we could not find the _Authors contributions_ text, we count the number of extracted CRediT statements now.

The number of matches (each publication could contain multiple matches):

The number of unique publiction with a match:

The number of cases where we could access the publication but could not extract the CRediT statement:

## Exploring the results

Finding elements in the html files including our search term: "Writing – original draft". It is possible that the term is present multiple times (e.g. if the paper discusses the CRediT taxonomy instead of using it to report the authors' contributions). It is also possible that the search term is not present in the html file even though Dimension AI found it in their records corresponding to the given research article. This is possible as Dimensions AI does not report in which metadata the search term was found. Also, not all of the papers in our sample are open access. Thus, it is possible that the returned html does not include the section of the article containing our search term. We will flag these cases for further exploration.

Notes for myself: Do one with sci-hub instead it is cleaner. Create a version of the code where the pdf is mined after extraction, text is saved pdf is deleted. Use functions to make the code more flexible.