medRxiv-data-statements

This repository contains code used to scrape the data statements of all submissions to medRxiv, parse and categorize them based on regular expressions.

Usage

The main_data.py script will scrape the metadata of all submissions to medRxiv, will then collect the data statements of each submission individually, and will indentify which submission is related to COVID-19.

tag_topic.py will read the files generated by main_data.py and will categorize each data statement, and will generate a figure sumarising the percentage of each type of statements.

Methods

We created a web scrapper to extract the data statement of all submissions to medRxiv. This platform require authors to provide a data statement with each submissions “At submission authors must include a statement regarding the availability of all data referred to in the manuscript, along with links to any data hosted in an external repository.” The following regular expression was used on the submissions’ title to identify COVID-19 related research:

(\s|\b)(ncov)([^a-z]|\b)|(\s|\b)(novel)[\s-]?(corona\svirus)([^a-z]|\b(\s|\b)(sars-cov-2)([^a-z]|\b)|(\s|\b)(covid)([^a-z]|\b)|(\s|\b)(corona\svirus)[\s-]?2(\s|\b)|(\s|\b)(corona\svirus disease 2019)([^a-z]|\b)

This regular expression query allowed for variations in form and spelling of the following COVID related keywords:

ncov
novel coronavirus
sars-cov-2
covid
coronavirus
coronavirus disease 2019

We then ran another set of regular expression queries to classify the data statements of each manuscript. We first parsed the statement in sentences, and categorized each sentence as referring to the code, the data or both based on the presence of each of those words. We then ran the following regular expression queries to classify the content of each sentence. The following lists the categories and their respective queries:

Available: (is|are)\s(available|archived)|all (data|code) (are|is )?(fully )?(available|included)
Publicly available: (are|were|is) collected|\spublic(ly)?\s|open.?source|available.?online|public source
Within manuscript: supplement|(with)?in\s([a-zA-Z]*\s){0,2}(paper|main text|article|manuscript)
Available later: \bwill\s
Conditional availability: request|upon|author|on demand be(\b.\b)?available|can be made available|with ([a-zA-Z]\s){0,2}author|as necessary
Hyperlink: https
Not available: \b(no|not|none).*(available|online|share)|remains property
Not applicable: not ap.?licable|\bn.?a\b|no\s([a-zA-Z]\s){0,1}(data|code)\s([a-zA-Z]\s){0,1}(use|referred)

We then assign each of those categories to the following hierarchy to deal with the statements that are assigned to more than one of those categories:

Conditional availability
Not available
Available later
Publicly available
Not applicable
Within manuscript
Available

We developed this hierarchy to disambiguate the statements that belong to more than one category. For instance, the statement “data are available on request” would be categorised as both available and conditional availability based on the previous regular expression, but the presence of the word request indicates that this statement should be categorised as conditional availability.

Similarly, although the statement “the data used are available in the text” would be classified under both available and within manuscript, the latter has priority.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
covid_scraper		covid_scraper
data		data
README.md		README.md
main_data.py		main_data.py
tag_topic.py		tag_topic.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

medRxiv-data-statements

Usage

Methods

About

Releases

Packages

Languages

lamvin/medRxiv-data-statements

Folders and files

Latest commit

History

Repository files navigation

medRxiv-data-statements

Usage

Methods

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages