Home

Stian Soiland-Reyes edited this page Mar 16, 2016 · 36 revisions

This is a wiki adaptation of the 10 Simple rules for design, provision, and reuse of identifiers for web-based life science data by Julie A McMurry, et al. (Why?) Feel free to contribute!

Citation

Original paper: doi:10.5281/zenodo.31765

Permalink URI (This wiki): https://w3id.org/id-rules/

10 Simple rules for design, provision, and reuse of identifiers for web-based life science data

  1. Use established identifiers
  2. Design identifiers for use by others
  3. Help local identifiers travel well: document Prefix and Namespace
  4. Opt for simple durable web resolution
  5. Avoid embedding meaning
  6. Make URIs clear and findable
  7. Implement a version management policy
  8. Do not re-assign or delete identifiers
  9. Document the identifiers you issue and use
  10. Reference responsibly

Introduction

Life science data is evolving to be ever larger, more distributed, and more natively web-based. However, our collective handling of identifiers has lagged behind these advances. Diverse identifier issues (for instance “link rot” and “content drift”) have hampered our ability to integrate data and derive new knowledge from it. Optimizing web-based identifiers is harder than it appears and no single scheme is perfect: Identifiers are reused in different ways for different reasons, by different consumers. Moreover, digital entities (e.g., files), physical entities (e.g., biosamples), and descriptive entities (e.g., ‘mitosis’) have different requirements for identifiers. Nevertheless, there is substantial room for improvement throughout the life sciences and several other groups have been converging on identifier standards that are broadly applicable: (Force11 Data Citation Principles data citation principles and practices, Resource Identification Initiative, FAIR data principles).

Building on these efforts and drawing on our experience, we focus on the use case of large-scale data integration: we outline the identifier qualities and best practices that we feel are most important in this context. Specifically, we propose actions that providers of online databases (repositories, registries, and knowledgebases) should take when designing new identifiers or maintaining existing ones (Rules 1-9).

In Rule 10, we conclude with guidance to data integrators and redistributors on how best to reference identifiers from these diverse sources. This article may also be useful to data generators and end users as it offers insight into the issues associated with data provision in a web environment. We call upon data providers to take a long-term view of their entities’ scope and lifecycle, and to consider existing identifier platforms and services.

Conclusion

Better identifier design, provisioning, documentation, and referencing can address many of the identifier problems encountered in the life science data cycle. We recognize that improvements to the quality, diversity, and uptake of identifier tooling would lower barriers to adoption of these rules. We will undertake to address these gaps in the relevant initiatives (Text S1). We also recognize the need for formal software-engineering specifications of identifier formats and/or alignment between existing specifications and hope that this paper can catalyze such efforts.