Skip to content

lamvin/medRxiv-data-statements

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

medRxiv-data-statements

This repository contains code used to scrape the data statements of all submissions to medRxiv, parse and categorize them based on regular expressions.

Usage

The main_data.py script will scrape the metadata of all submissions to medRxiv, will then collect the data statements of each submission individually, and will indentify which submission is related to COVID-19.

tag_topic.py will read the files generated by main_data.py and will categorize each data statement, and will generate a figure sumarising the percentage of each type of statements.

Methods

We created a web scrapper to extract the data statement of all submissions to medRxiv. This platform require authors to provide a data statement with each submissions “At submission authors must include a statement regarding the availability of all data referred to in the manuscript, along with links to any data hosted in an external repository.” The following regular expression was used on the submissions’ title to identify COVID-19 related research:

(\s|\b)(ncov)([^a-z]|\b)|(\s|\b)(novel)[\s-]?(corona\svirus)([^a-z]|\b(\s|\b)(sars-cov-2)([^a-z]|\b)|(\s|\b)(covid)([^a-z]|\b)|(\s|\b)(corona\svirus)[\s-]?2(\s|\b)|(\s|\b)(corona\svirus disease 2019)([^a-z]|\b)

This regular expression query allowed for variations in form and spelling of the following COVID related keywords:

  • ncov
  • novel coronavirus
  • sars-cov-2
  • covid
  • coronavirus
  • coronavirus disease 2019

We then ran another set of regular expression queries to classify the data statements of each manuscript. We first parsed the statement in sentences, and categorized each sentence as referring to the code, the data or both based on the presence of each of those words. We then ran the following regular expression queries to classify the content of each sentence. The following lists the categories and their respective queries:

  • Available: (is|are)\s(available|archived)|all (data|code) (are|is )?(fully )?(available|included)
  • Publicly available: (are|were|is) collected|\spublic(ly)?\s|open.?source|available.?online|public source
  • Within manuscript: supplement|(with)?in\s([a-zA-Z]*\s){0,2}(paper|main text|article|manuscript)
  • Available later: \bwill\s
  • Conditional availability: request|upon|author|on demand be(\b.\b)?available|can be made available|with ([a-zA-Z]\s){0,2}author|as necessary
  • Hyperlink: https
  • Not available: \b(no|not|none).*(available|online|share)|remains property
  • Not applicable: not ap.?licable|\bn.?a\b|no\s([a-zA-Z]\s){0,1}(data|code)\s([a-zA-Z]\s){0,1}(use|referred)

We then assign each of those categories to the following hierarchy to deal with the statements that are assigned to more than one of those categories:

  1. Conditional availability
  2. Not available
  3. Available later
  4. Publicly available
  5. Not applicable
  6. Within manuscript
  7. Available

We developed this hierarchy to disambiguate the statements that belong to more than one category. For instance, the statement “data are available on request” would be categorised as both available and conditional availability based on the previous regular expression, but the presence of the word request indicates that this statement should be categorised as conditional availability.

Similarly, although the statement “the data used are available in the text” would be classified under both available and within manuscript, the latter has priority.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages