Dataset Management

jhpoelen edited this page Sep 20, 2017 · 17 revisions

This page discusses a way to share and publish your species interaction datasets. This method gives you full control of the data while making it available through GloBI.

For the impatient

Poke around existing data repositories on the status page. Then, make your species interaction data available by steps outlines on the how-to-contribute page.

Overview

GloBI integrates with GitHub and Zenodo to discover species interaction datasets. GitHub is sort of a Google Docs for software and data. Zenodo is an open data publication platform that archives digital artifacts (e.g. papers, figures, datasets, software) so that they can be cited in (academic) publications.

Every day or so, GloBI queries GitHub and looks for repositories that mention globalbioticinteractions in their README.md. Then, for each of these repositories, it attempts to locale and parse a file globi.json. This file contains instructions for GloBI on how to read the dataset. You'll notice that some repositories do not contain data, but point to existing, published datasets. For these "meta" datasets, the globi.json file points to resources that are hosted elsewhere, like on figshare or esapubs.

Aside from scanning GitHub, GloBI also queries Zenodo's Global Biotic Interaction community. This community contains published species interaction datasets. If a dataset is published from GitHub to Zenodo, the published version used instead of the "working copy" in the originating GitHub repository.

Quality Control

If you use the GitHub/Zenodo tools as your method to manage and publish your dataset, you'll be able to leverage some tools to help you monitor the integrity of your dataset.

HINT: Quality control measures have been enabled for https://github.com/globalbioticinteractions/template-dataset. Feel free to copy/paste/imitate from there.

readability

First, you can automatically run a readability test by connecting your repository to travis-ci.org after adding a specific GloBI travis configuration file .travis.yml to your repository. Once this is setup, each time you make a change to your dataset, or trigger a build in another way, travis will run a GloBI validation tool to check whether the data can be ingested. The tools spits out warnings and errors to help you detect data errors or other bugs. Travis sends notifications and offers badges like Build Status to help you monitor the health of your dataset.

issues

To err is human. If an issue with a data source cannot be easily fix right away, it might make sense to document them, so that you don't forget what the issue was and others can learn about known issues (and perhaps even chime in to help). One way to manage issue is using github: each data source is associated with one github repository and each repository has an issue management tool. , please click on this --> open issues.

searchability

Once you have added your data to a github repository, GloBI takes a day or so to make your accessible through GloBI. You can use a GloBI badge like GloBI as a way to keep track of the searchability of your dataset.

citability

While GitHub is suited to manage source code and data, it is not specifically built for publishing citable products or artifacts. However, GitHub and Zenodo collaborate and create a way to publish releases of GitHub repository to Zenodo. See https://guides.github.com/activities/citable-code/ for more information. After publishing your repository to Zenodo you can use their spiffy badges like DOI to let others know about your publication. If you submit your Zenodo publication to Zenodo's Global Biotic Interactions community, then GloBI will use your published version instead of your GitHub repository.

stats

To get a glimpse of the integration datasets, some basic stats are included. These include: the number of interaction records, the number of distinct names of the interacting entities and the percentage of names that resolved to an external name source like https://itis.gov . For more information about name matching please see the Taxonomy Matching page. These stats are included using short-hand notation like 123.2k / 320 / 98%, where the numbers represent number of interactions, number of names and name match percentage.

Conclusion

By reusing existing, openly available, infrastructures (e.g. GitHub, Zenodo, Travis, GloBI), you can manage and publish your openly accessible dataset and make it searchable through GloBI without breaking the bank.

If you are interested to see an example of a GitHub data repository with all the bells and whistles, please have a look at the template-dataset or poke around other existing datasets on the status page.

Please do holler if you have any questions about this, see contact info at doi:10.1016/j.ecoinf.2014.08.005 or open an issue.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.