Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R package to access dataset registry #109

Open
jhpoelen opened this issue Apr 7, 2020 · 16 comments
Open

R package to access dataset registry #109

jhpoelen opened this issue Apr 7, 2020 · 16 comments
Labels
help wanted Extra attention is needed

Comments

@jhpoelen
Copy link
Member

jhpoelen commented Apr 7, 2020

in discussion with @rachaelgallagher @jmadin @bmaitner - the creation of an R package came up.

Please share your thoughts on what should be included in the first basic version of the R package and who is interested to make the OTN dataset registry accessible programatically . . .

@iimog
Copy link
Member

iimog commented Apr 11, 2020

I'd be happy to help with implementation and testing. Making the OTN dataset registry accessible programmatically would be valuable.

@jhpoelen
Copy link
Member Author

@iimog thanks for offering your help.

One idea I had to make it easier to access the OTN registry programmatically, was to render the OTN dataset registry as json array (and tsv/csv perhaps?) using the existing jekyll framework (see e.g., https://github.com/snaptortoise/jekyll-json-feeds/blob/master/feed.links.json) . Then, we can make this available via something like: https://opentraits.org/datasets.json . If we feel ambitious, we can even try to render commonly used eml files from our custom dataset schema.

The R rpackage would then take the feed and munch on it to make it available in appropriate R data slang (e.g., data frames, vectors)

@iimog curious to hear your thoughts!

@jhpoelen jhpoelen added the help wanted Extra attention is needed label Apr 13, 2020
@fdschneider
Copy link
Contributor

I think I could help with the munching.

For rendering the data into standardised data.frames, the traitdataform package could be useful. It applies the ETS terms to an input dataset and returns a long-table output, while maintaining metadata elements in the attributes of the R object.

The feed would need to provide input recipes for all trait datasets, i.e.

  1. a URL of the source data (or rather a script that reads them successfully into R, as those will be provided in all kinds of formats),
  2. a vector of the exact trait names (partly available in Open Traits Registry, but in the future this could be a thesaurus table that also provides an unambiguous identifier from a public trait ontology),
  3. some other columns essential for interpretation of the input (species names, units, etc), and
  4. a valid bibliographic citation for the dataset.

Those can be parsed into a script such as this for the Arthropod Traits Set.

Then applying traitdataform::standardize() to these objects would standardize them into ETS terms and harmonize taxa to gbif terminology. Aterwards, one can easily merge data of different origin into one table.
A list of all original bibliographic references will also be maintained in the attributes.

@iimog
Copy link
Member

iimog commented Apr 14, 2020

I like these suggestions. Making the information from the registry available as json is a good first step. Using jekyll to do this task automatically makes a lot of sense. Munching the data with the traitdataform package allows us to build on existing code rather than starting from scratch - so I'm all for it.

@rachaelgallagher
Copy link
Contributor

Others who registered some interest via email are: Luca Santini, @willpearse, @hoganhaben, Jerome Mathieu and I've encouraged them to get involved via GitHub instead.

@willpearse
Copy link
Contributor

This is super cool; thanks for the ping Rachael. I would love to help with this! My two quick thoughts are:

For making the registry accessible @iimog , I think your idea of working directly with what we've got now from Jekyll is a Very Good Idea. That will mean we aren't doing things twice, which is going to make everyone's lives easier, even if the initial set-up is hard.

For loading actual trait datasets... I'm perhaps a bit biased, but given my experiences with doing something like this with MADtraits (https://github.com/willpearse/MADtraits) I think focusing on only those datasets that use traitdataform is the best way to go. It's going to be a lot of work to do it otherwise (indeed, it was for MADtraits) and I think that this would really encourage people to use traitdataform. My hope is that, if everyone starts using it, then frankly I won't have to write any more MADtraits code and that would be fantastic for everyone (I get to spend more time with my daughter, and the field gets a fantastic resource :D)

@hoganhaben
Copy link
Contributor

yes, Thank you @rachaelgallagher. I am happy to help with coding. I like @fdschneider's idea on using traitdataform::standardize to standardize the datasets. Please let me know how I can help.

@jhpoelen
Copy link
Member Author

Great to see excitement about making the OTN datasets easy to work with in R. Obviously there's a lot of work to do to make this happen (translations scripts, metadata improvements, OTN R package repository, picking a package name, etc) . . . but to get the party started, I just added the official(TM) OTN dataset feed via https://opentraits.org/datasets.json .

@willpearse
Copy link
Contributor

@rachaelgallagher can I make an R package called 'ROpenTraits'? If so, I'll get @jhpoelen 's JSON into an R package and we will have started "something".

Apologies for asking before doing something so trivial, but as there are only two repos on the GitHub account right now I didn't want to do anything naughty! :D

@rachaelgallagher
Copy link
Contributor

I think that's fine and can' see any issues @willpearse (though others may flag something!)

@willpearse
Copy link
Contributor

willpearse commented Apr 23, 2020

It's alive! It's alliiiiiiiive!

Link to quick feature requests thead: open-traits-network/ROpenTraits#2

Get started with the package (and, simultaneously, learn everything you need to know about the package :p):

library(devtools)
install_github("open-traits-network/ROpenTraits")
library(ROpenTraits)
datasets <- rotn_datasets()
head(datasets)

Could someone help me set up Travis?

@willpearse
Copy link
Contributor

(I appreciate that this is a trivial package, and so thank you all for doing the hard work so I could just swoop in!)

@fdschneider
Copy link
Contributor

Great! We have a package. The feature request threat is a very good idea, as it prompted me to think about the intention of the package.

A question that comes to my mind is: what is the relationship between ROpenTraits and MADtraits (which also makes lots of public trait data available computationally, right?) and traitdataform. Are the packages competing/duplicating work because many of the OpenTraits datasets are already in MADtraits (and harmonized)? Could the download and harmonization steps developed for MADtraits be useful for ROpenTraits? Or should we exclude any harmonisation from ROpenTraits and just return the raw data as provided by the authors? Then users (including the maintainers of MAD) must develop harmonization themselves (although traitdataform can help with that).

My take on this would be:

As a primary goal, ROpenTraits should provide all the registered data in their raw form as an R-object (data.frame), automatized through the json feed.

A secondary goal could be to provide recipes for harmonization with traitdataform. This might be facilitated by including the information required into the Registry table and json feed. The package would return data.frame objects with the required information stored into the object attributes. The user can then just use the data.frame as provided by the author, or apply standardize() to get ETS-conform data.
As far as I see, this would massively simplify the data integration on the MADtraits side and utilization of the Open Traits Registry for data synthesis. This could start with a minimal set (trait names, column for taxa, bibliographic citation), and In future development improve the layers and quality of information provided in the attributes (e.g. providing structured EML metadata, ontologies of traits, unifying units and factor levels, etc.).

@jhpoelen
Copy link
Member Author

Very nice to see the ideas around the R package floating around.

Some minor suggestions / comments:

  1. suggest to use lower case for repository (ROpenTraits -> ropentraits)
  2. I see an immediate benefit of the ropentraits package to be reviewing the existing dataset registries for things like: broken links, completeness, valid taxonomic names, check localities
  3. in line with 2. - enabling travis to run periodic automated test would help us to keep the registry clean and . . . allow to quality controlled changes if/when we decide to make metadata changes

Once this is in place, I think we are in a great position to do data integrations driven by individual research interests.

@jhpoelen
Copy link
Member Author

Also . . . there's some issue backlog that we might want to look at first . . . they contain a bunch of legacy dataset registrations that have not been added yet.

@willpearse
Copy link
Contributor

@fdschneider Thanks for bringing up overlaps et al. because these are important things. My understanding was the main idea of OTN was to bring people together and get people sharing data; thus I feel like your primary and secondary goals are more secondary and tertiary, if that makes sense? Most importantly, though, even with those goals I don't think MADtraits and OTN need to compete.

To my eyes, the main value MADtraits adds is (1) being part of the MADworld of other packages (e.g., MADneon, MADcomm, MADpandemic, etc.), and (2) synthesising existing datasets that have been missed because of their weird formats. The main value of OTN, to me, is making the field adopt consistent standards going forward - and the obvious one there, to me, is traitdataform. Thus I think having (R)OTN focusing on getting new datasets to be compliant with traitdataform, and then having all of those loaded in quickly and easily from within the R package, is a great idea.

This would allow MADtraits to focus on finding other, older datasets that 'fall through the cracks' and aren't compliant, and (R)OTN can focus on getting people to properly format all their new data. Eventually, MADtraits will have no new data to load in because it's all in (R)OTN and that will be fine by me! Perhaps, as we discussed at the meeting in New Orleans and as you mention above, MADtraits might then load some of those data in as well, but honestly I think the idea would be to get everyone so standardised that it doesn't really matter anymore because it's so trivial. If we can get data collectors on-board with OTN et al. then, fingers crossed, life becomes easier for everyone, right?

What do you think @rachaelgallagher ? I realise this has spiralled into a bit of a wider discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants