add trait dataset registry entries #4

jhpoelen · 2019-08-22T16:06:06Z

@caterinap suggested to re-use the existing google spreadsheet to populate the trait dataset registry entries.

rachaelgallagher · 2019-08-22T22:51:46Z

@jmadin created an earlier mock-up version of the registry in March using Ruby on Rails https://afternoon-tor-83256.herokuapp.com/

jmadin · 2019-08-23T01:48:26Z

I'd be happy to make this mock-up registry reflect the Google Sheets fields that @caterinap created if this helps. And move the rails project over here. The beauty of these modern web app approaches is that they are built on APIs, and so easy to pipe information onto maps, into R, directly into html text, etc. (e.g., "Currently there are XX registered trait datasets."). My only concern is, and someone else mentioned this earlier, that there may already be existing registries out there, and no point re-inventing the wheel. Plus it would take some time to develop properly - although, perhaps the barebones is okay for this stage.

caterinap · 2019-08-23T08:55:06Z

The Google Sheet fields correspond to Table1 in the paper and were created based on a discussion where all authors were invited to contribute, so maybe it would be worth updating the mock-up registry with those. But first a question, in the website, should we point to the web app or to the google form/csv? If it's possible to add new datasets via the web app then it would be way better than the Google solution! (right now it requires a log-in).

jhpoelen · 2019-08-23T14:45:03Z

While I can see a webapp being useful in later stages of the project, I'd suggest to keep things as simple as possible at this stage and make it easy for most, not just ruby/web developers, to contribute. Rather than introducing a webserver and a database, I'd suggest to stick with static, version controlled, lists of files (or tables) managed in github (or google sheets) for now. These lists can then be rendered in html using same Jekyll templates (also see https://jekyllrb.com/docs/datafiles/) or some javascript.

The cost of maintaining a webapp and managing associated data in a rail app on heroku should not be under-estimated. Also, since there's relatively few folks coding in rails relative to those that can edit tables or make html pages, I expect that the number of folks that can review, suggest and contribute functionality will be limited.

Instead, I'd favor an approach taken by https://github.com/OBOFoundry/OBOFoundry.github.io in which individual datasets (in this case ontologies) get their own file (see https://github.com/OBOFoundry/OBOFoundry.github.io/tree/master/ontology) and can be managed individually be those that maintain the datasets. In my mind, this promotes a sense of ownership and allows for delegation of maintenance of datasets info across a large group of folks.

bmaitner · 2019-08-23T22:07:20Z

@jhpoelen The structure used by the OBOFoundry seems easy enough, and it's easy enough to create a new file in Github by clicking the "create new file" button that we shouldn't have too much trouble getting folks that aren't Github-savvy to contribute. I agree that a webapp would be great, but maybe that's something we build into funding applications?

rachaelgallagher · 2019-08-25T22:47:00Z

Yes Jorrit and Brian - I think that we need to keep it simple for now and apply for funds to employ a developer to support better solutions as we grow.

bmaitner · 2019-08-26T21:45:48Z

Could we also adopt a similar infrastructure for a registry for scientists?

jhpoelen · 2019-08-26T22:07:23Z

+1 totally! @bmaitner would it help to work on same examples?

bmaitner · 2019-08-26T22:48:03Z

@jhpoelen I think so. Perhaps we should start a new Issue for that though? Would be good to discuss what information to include, format, etc. . Or perhaps this goes under the OTN member profiles and map issue?

jhpoelen · 2019-08-27T00:32:06Z

A separate issue found like a good plan. Should we re-use #3 or create a new one?

bmaitner · 2019-08-27T16:01:13Z

I think #3 is fine

jhpoelen · 2019-08-28T00:38:30Z

I've added placeholders for dataset entries at https://github.com/open-traits-network/open-traits-network.github.io/tree/master/_datasets . Also, I've created a placeholder dataset list page at https://github.com/open-traits-network/open-traits-network.github.io/tree/master/datasets.md which can be reached at https://opentraits.org/datasets .

I hope that others can help:

come up with minimal meta data elements for dataset registration
apply style to dataset pages and list
populate the dataset pages
add instructions on how to maintain dataset registry

bmaitner · 2019-08-28T16:06:57Z

Regarding metadata elements, the current trait registry mockup that @jmadin put together had these elements:

Dataset name
Contributor
Version
Dataset link
Dataset link type
Access mode
Dataset type
Reference
Reference link
Brief description
Terms of use
Terms of use type
Taxonomic group
Number of traits
Number of taxa
Number of observations
Start date
End date
Location
Latitude
Longitude
Taxa
Traits

To this list I think I'd add:
ID
Data standard
Code links for standardizing data (and the associated data standard)

I'd suggest that small subset of those fields would be mandatory, though. Perhaps:

ID
Contributor
Dataset name
Dataset link
Taxa
Brief description

jhpoelen · 2019-09-04T15:12:13Z

@bmaitner sounds good to me. Just curious - did you consider using EML-inspired structure https://knb.ecoinformatics.org/external//emlparser/docs/index.html ? What would really help me to provide feedback is a few examples along the lines of #3 . Let me know how I can help.

eflowerproject · 2019-09-06T01:28:50Z

Hi everyone, trying to register a dataset on the Google Form and have a couple of questions about traitList (a required field):

Can trait names contain spaces (and other symbols)? e.g., which of the following is best:
"100. Sex (D1) | 102. Ovary position (D1) | 201. Number of perianth parts (C1)"
"Sex | Ovary position | Number of perianth parts"
"Sex | Ovary_position | Number_of_perianth_parts"
Do we want to maintain internal IDs (100, etc. in my example) or trim them here?
In my own database (PROTEUS), which I used to generate this dataset, I distinguish primary characters (used to record data) from secondary characters (used in analyses): which do we want to register here?

I had a look at the mockup (https://afternoon-tor-83256.herokuapp.com/) but could not find an example trait list.

Thanks a lot!

Hervé

P.S.: I am totally new to GitHub and the dataset was part of this paper: https://www.nature.com/articles/ncomms16047

bmaitner · 2019-09-08T17:27:47Z

@bmaitner sounds good to me. Just curious - did you consider using EML-inspired structure https://knb.ecoinformatics.org/external//emlparser/docs/index.html ? What would really help me to provide feedback is a few examples along the lines of #3 . Let me know how I can help.

I like the idea of using existing standards (and the associated documentation) where possible, but I think we have to consider the trade-off of ease of entering data vs. ease of parsing/searching data. I wonder if relying on especially rigid formatting might discourage users from uploading datasets. However, perhaps some of the less-strict EML fields would be sufficient for our purposes? e.g. generalTaxonomicCoverage, geographicDescription, boundingCoordinates, etc? Possibly with a link to a full EML description?

caterinap · 2019-09-09T10:03:35Z

Just wanted to say that we had a first "discussion" (it was more a collaborative doc) about the fields of the registry, These are the ones reported in Table 1 of the paper. Concerning standards, it is currently pretty low, but I based all the fields I could on the Darwin Core (e.g. https://dwc.tdwg.org/terms/#decimalLongitude). Happy to change anything you want.

The app is not linked to the google form (they were created independently).
@eflowerproject : if you want to have a look at the datasets you can download them here.
Concerning your question 1, we did not really agree about standards in trait names or other fields, but it's surely something we should do (keeping in mind the trade-off mentioned by @bmaitner). For the moment we just recommend using existing ontologies (e.g. Plant Ontology or Flora Phenotype ontology) when they exist.
Concerning your question 2, you can register both primary and secondary characters and add this info in the usefulClasses field. You would put something like "inferred traits" to indicate your secondary traits.

eflowerproject · 2019-09-09T23:36:29Z

@caterinap thanks for your very helpful reply. Those example records are very useful. I agree we may need to clarify instructions for standardizing trait names further down the line. Unfortunately, most of my traits do not map to existing plant ontologies (this is a bigger issue that I'll have to solve separately).

jmadin · 2019-09-16T20:39:29Z

Hi @caterinap. I was just autogenerating the dataset markdown files for the website from your google link (above). The doesn't seem to be a field for dataset name. I wanted to name the files with this name, and also display it on dataset webpages. Is this something that you have? Or perhaps should add?

jmadin · 2019-09-16T20:40:08Z

I'll use the data set URL for now.

jmadin · 2019-09-17T07:27:27Z

Should we change the menu item and page name from "Datasets" to "Registry" (or "Dataset Registry"? We can continue to call instances of this registry a "Dataset". Thoughts?

caterinap · 2019-09-17T11:51:30Z

@jmadin I added a datasetID field (https://docs.google.com/forms/d/e/1FAIpQLSdWL1hMzSGOfSSOGDFhjwipT1a1j9XSLpiDoI0ziTEMywsW7w/viewform?usp=sf_link). Is that ok or you would prefer a more generic field with a dataset name?

jmadin · 2019-09-17T19:40:39Z

Excellent, thanks @caterinap
I guess we need both an ID and a name. To keep it simple, we could just require a name (e.g., "Coral Trait Database"), which would be used in the list of datasets and as a title for a dataset's page; and then generate the ID automatically for filenames and weblink (e.g., "coral-trait-database.md"). I'd say the dataset name is first priority and required, and you could create an (optional) field for ID as well. Is there something in Darwin Core for name?

caterinap · 2019-09-18T12:17:12Z

Agree! I added datasetName (from Darwin Core http://rs.tdwg.org/dwc/terms/datasetName) and kept the datasetID as optional. We can always autogenerate a second ID specific for the registry.

caterinap · 2019-09-18T13:33:35Z

One question: when the dataset has not an official name. Should we set a "standard" (format and content). Something like Geography Taxon Author (trait type? year?) Do we want spaces or underscores?
For example I put something like: "Global Passerine Morphology Ricklefs" for this dataset https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.1783
But it became more difficult/funny when having something like "Norfolk Carcinus maenas Pearse" for this one https://figshare.com/articles/Catching_crabs_a_case_study_in_local_scale_English_conservation/979288
Any thoughts?

(I agree that it's very frustrating to fill in datasets without a standard!)

bmaitner · 2019-09-18T16:21:11Z

+1 for a standard, even if it is one that we expect will need revision in the future. And I always prefer underscores. The dataset name is just a unique identifier, right? So it doesn't much matter if it sounds a bit odd. I think the more pressing concern is that it be clear so that folks don't accidentally add the same thing twice. For example, someone might also call this "England Decapod Pearse". So perhaps Author_Year_Geography_Taxon<_letter if more than one database present?>, with the constraints that 1) only the first author is used; 2) the smallest political division/taxonomic rank that encompasses all the records is used. I suggest placing author and year first since it might make it easier for folks to scroll through and see if a dataset has been entered already, since author name is less ambiguous than the geographical or taxonomic fields.

jhpoelen · 2019-09-18T17:24:50Z

About dataset ids . Most major registries I know (e.g., iDigBio, GBIF) are pushing for using randomly generated UUIDs to identify specific datasets.

Also, from Nelson et al. 2018 , https://doi.org/10.1002/aps3.1027 :

[...]
Opacity
GUID values can be transparent or opaque (Page, 2009). Transparent values include those that contain human‐decipherable text or human‐meaningful strings. DwC triplets (e.g., Uconn:CONN:CONN00050395) and HTTP URI identifiers (e.g., http://herbarium.bio.fsu.edu/000002561) are examples of transparent identifiers that connote some sense of meaning, such as ownership, to the human user. Opaque identifiers (e.g., a universally unique identifer [UUID], e3ad9bb3‐cb8e‐475c‐aff5‐87f877b56120) contain no apparent human‐decipherable information and are construed strictly as meaningless strings. The lack of apparent meaning underscores their universality and reduces the likelihood that they will be altered or replaced (McMurry et al., 2017).
[...]

Personally, I think that the pragmatism of using transparent identifiers sometimes outweighs the benefits of opaque ones. However, mixing of meta-data and identifiers can be tricky. From my own experience, I've seen that "fixing" typos in transparent ids can lead to funky behavior, especially when considering a lifespan of > 10 years.

caterinap · 2019-09-20T11:44:54Z

Can have both transparent and opaque identifiers? @bmaitner I like your proposal. What about things like TRY ? If we follow the standard we might loose the original name for them. Should we just keep the database name when there is one?

bmaitner · 2019-09-20T16:17:41Z

@caterinap but database name is an existing field, so there shouldn't be an issue with losing it (I think?). However, I think the issue of keeping different versions of a data set linked is one we should think about (especially if the main author changes between versions). It might also be worth considering keeping track of which data sets contain/are contained by other data sets. Do we need fields for these relationships?

jmadin · 2019-09-26T20:43:42Z

@caterinap Is there some way to add the missing dataset names in the Google Form/Sheet that we download? I just re-downloaded the spreadsheet to transfer to the website, but don't want to have to re-enter the missing dataset names each time we do this.

jhpoelen · 2019-10-09T15:39:03Z

I've attempted to capture the current dataset registration process at https://github.com/open-traits-network/open-traits-network.github.io/tree/master/_datasets#readme . Please note that the google form is no longer used or referenced. Remaining google form entries have been copied into #45 .

Thanks for all the discussion and input. Please open a new issue if you have suggestions ideas.

rachaelgallagher · 2019-10-09T23:36:45Z

Excellent - thanks very much Jorrit

…

________________________________ From: Jorrit Poelen <notifications@github.com> Sent: Thursday, 10 October 2019 2:39 AM To: open-traits-network/open-traits-network.github.io <open-traits-network.github.io@noreply.github.com> Cc: Rachael Gallagher <rachael.gallagher@mq.edu.au>; Comment <comment@noreply.github.com> Subject: Re: [open-traits-network/open-traits-network.github.io] add trait dataset registry entries (#4) I've attempted to capture the current dataset registration process at https://github.com/open-traits-network/open-traits-network.github.io/tree/master/_members#readme . Please note that the google form is no longer used or referenced. Remaining google form entries have been copied into #45<#45> . Thanks for all the discussion and input. Please open a new issue if you have suggestions ideas. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#4?email_source=notifications&email_token=AEJQ6N5NIHLHIOQGWQIK2M3QNX3JRA5CNFSM4IOWXLZKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAYKG2Y#issuecomment-540058475>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AEJQ6N5WNBTPJY76F2GW7QDQNX3JRANCNFSM4IOWXLZA>.

jhpoelen mentioned this issue Aug 23, 2019

pick a framework (jekyll, hugo, none) if needed #5

Closed

jhpoelen pushed a commit that referenced this issue Aug 28, 2019

add placeholders for #3 (member profiles) and #4 (dataset metadata)

bd590c8

jhpoelen mentioned this issue Aug 28, 2019

apply consistent style to opentraits.org pages #11

Closed

jmadin mentioned this issue Sep 17, 2019

Tidy up here and there #22

Merged

jhpoelen closed this as completed Oct 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add trait dataset registry entries #4

add trait dataset registry entries #4

jhpoelen commented Aug 22, 2019

rachaelgallagher commented Aug 22, 2019

jmadin commented Aug 23, 2019

caterinap commented Aug 23, 2019

jhpoelen commented Aug 23, 2019 •

edited

bmaitner commented Aug 23, 2019

rachaelgallagher commented Aug 25, 2019 via email •

edited

bmaitner commented Aug 26, 2019

jhpoelen commented Aug 26, 2019

bmaitner commented Aug 26, 2019 •

edited

jhpoelen commented Aug 27, 2019

bmaitner commented Aug 27, 2019

jhpoelen commented Aug 28, 2019 •

edited by bmaitner

bmaitner commented Aug 28, 2019

jhpoelen commented Sep 4, 2019

eflowerproject commented Sep 6, 2019

bmaitner commented Sep 8, 2019

caterinap commented Sep 9, 2019

eflowerproject commented Sep 9, 2019

jmadin commented Sep 16, 2019

jmadin commented Sep 16, 2019

jmadin commented Sep 17, 2019

caterinap commented Sep 17, 2019

jmadin commented Sep 17, 2019

caterinap commented Sep 18, 2019

caterinap commented Sep 18, 2019

bmaitner commented Sep 18, 2019

jhpoelen commented Sep 18, 2019 •

edited

caterinap commented Sep 20, 2019

bmaitner commented Sep 20, 2019

jmadin commented Sep 26, 2019

jhpoelen commented Oct 9, 2019 •

edited

rachaelgallagher commented Oct 9, 2019 via email

add trait dataset registry entries #4

add trait dataset registry entries #4

Comments

jhpoelen commented Aug 22, 2019

rachaelgallagher commented Aug 22, 2019

jmadin commented Aug 23, 2019

caterinap commented Aug 23, 2019

jhpoelen commented Aug 23, 2019 • edited

bmaitner commented Aug 23, 2019

rachaelgallagher commented Aug 25, 2019 via email • edited

bmaitner commented Aug 26, 2019

jhpoelen commented Aug 26, 2019

bmaitner commented Aug 26, 2019 • edited

jhpoelen commented Aug 27, 2019

bmaitner commented Aug 27, 2019

jhpoelen commented Aug 28, 2019 • edited by bmaitner

bmaitner commented Aug 28, 2019

jhpoelen commented Sep 4, 2019

eflowerproject commented Sep 6, 2019

bmaitner commented Sep 8, 2019

caterinap commented Sep 9, 2019

eflowerproject commented Sep 9, 2019

jmadin commented Sep 16, 2019

jmadin commented Sep 16, 2019

jmadin commented Sep 17, 2019

caterinap commented Sep 17, 2019

jmadin commented Sep 17, 2019

caterinap commented Sep 18, 2019

caterinap commented Sep 18, 2019

bmaitner commented Sep 18, 2019

jhpoelen commented Sep 18, 2019 • edited

caterinap commented Sep 20, 2019

bmaitner commented Sep 20, 2019

jmadin commented Sep 26, 2019

jhpoelen commented Oct 9, 2019 • edited

rachaelgallagher commented Oct 9, 2019 via email

jhpoelen commented Aug 23, 2019 •

edited

rachaelgallagher commented Aug 25, 2019 via email •

edited

bmaitner commented Aug 26, 2019 •

edited

jhpoelen commented Aug 28, 2019 •

edited by bmaitner

jhpoelen commented Sep 18, 2019 •

edited

jhpoelen commented Oct 9, 2019 •

edited