Skip to content

jmelot/SoftwareImpactHackathon2023_InstitutionalOSS

Repository files navigation

Linking Research Software to Research Organizations

This project surfaces a list of possible links from software repositories to ROR IDs.

Criteria for Linking

The criteria for linking software to a research organization are:

  • The software has a creator/contributor who is affiliated with an organization.
  • The software is hosted by an account associated with the organization
  • The software repository explicitly references the organization (in code, readme, documentation, etc.)

Dataset

Our software to ROR links are consolidated using resources/consolidate_links.py and available in software_to_ror.csv and software_to_ror.json. There may be multiple entries per software-ror pair, if our linkage method returned multiple ROR ids for a given piece of software. We have currently found 1,442,022 software to ROR links (96,430 of moderate quality and 9,235 of highest quality) over 21,920 unique organizations and 176,619 unique GitHub repositories. Some of these are sure to be spurious links, and we're working on methods to identify and remove these.

The JSON output maps ROR ids to software to github slug (if available) and extraction method. The CSV contains the same information, structured like this:

Field name Description Field type
software_name Human-generated name of software (e.g. Tensorflow) text
github_slug Github owner and repo name, e.g. apache/airflow text
ror_id ROR id, in url form, e.g. https://ror.org/02qenvm24 text
extraction_methods semicolon-separated list of methods used to extract the software-ror pair, from the set described below text
quality 1 if high-quality, 0.5 if medium-quality, 0 if low-quality float

The extraction/matching methods we currently use - all of which are imperfect - are:

  • czi_affiliation_links - links software to the ROR ids of author affiliations of the paper most likely describing that software, for the CZI software mentions dataset. Code available in resources/czi_affiliation_links_pipeline.
  • joss_affiliation_links - links from software described in a JOSS paper to the ROR ids of the author affiliations of that paper. Code available in resources/joss_affiliations.
  • ner_text_extraction - links from github READMEs to ROR ids of affiliations extracted from those READMEs using NER. Code available in resources/ner_text_extraction_pipeline.
  • url_matches - links from github repo owner names, which may be individual user accounts or organization accounts, to ROR based on URL match. Code available in resources/github_org_url_matching_pipeline.
  • by_name - links from affiliation names, associated with software by a human in the SciCrunch dataset (see scicrunch_working_file_*.csv) to ROR matched using the ROR API. Code available in resources/scicrunch.
  • human_curated - ROR ids that a human identified as being affiliated with a piece of software, from the SciCrunch dataset. Data available in resources/scicrunch.
  • openaire_czi - links between software and ROR ids of author affiliations of papers mentioning that software from the CZI software mentions dataset, joined using OpenAIRE. Code available in resources/openaire_x_czi_pipeline.

Contributing

We are eager to work with new contributors! Please review our contribution guidelines for more information.

About this project

This repository was developed as part of the Mapping the Impact of Research Software in Science hackathon hosted by the Chan Zuckerberg Initiative (CZI). By participating in this hackathon, owners of this repository acknowledge the following:

  1. The code for this project is hosted by the project contributors in a repository created from a template generated by CZI. The purpose of this template is to help ensure that repositories adhere to the hackathon’s project naming conventions and licensing recommendations. CZI does not claim any ownership or intellectual property on the outputs of the hackathon. This repository allows the contributing teams to maintain ownership of code after the project, and indicates that the code produced is not a CZI product, and CZI does not assume responsibility for assuring the legality, usability, safety, or security of the code produced.
  2. This project is published under a MIT license.

Code of Conduct

Contributions to this project are subject to CZI’s Contributor Covenant code of conduct. By participating, contributors are expected to uphold this code of conduct.

Reporting Security Issues

If you believe you have found a security issue, please responsibly disclose by contacting the repository owner via the ‘security’ tab above.

About

Linking open-source software to research organizations

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages