Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add trait dataset registry entries #4

Closed
jhpoelen opened this issue Aug 22, 2019 · 32 comments
Closed

add trait dataset registry entries #4

jhpoelen opened this issue Aug 22, 2019 · 32 comments

Comments

@jhpoelen
Copy link
Member

@caterinap suggested to re-use the existing google spreadsheet to populate the trait dataset registry entries.

@rachaelgallagher
Copy link
Contributor

@jmadin created an earlier mock-up version of the registry in March using Ruby on Rails https://afternoon-tor-83256.herokuapp.com/

@jmadin
Copy link
Collaborator

jmadin commented Aug 23, 2019

I'd be happy to make this mock-up registry reflect the Google Sheets fields that @caterinap created if this helps. And move the rails project over here. The beauty of these modern web app approaches is that they are built on APIs, and so easy to pipe information onto maps, into R, directly into html text, etc. (e.g., "Currently there are XX registered trait datasets."). My only concern is, and someone else mentioned this earlier, that there may already be existing registries out there, and no point re-inventing the wheel. Plus it would take some time to develop properly - although, perhaps the barebones is okay for this stage.

@caterinap
Copy link
Member

The Google Sheet fields correspond to Table1 in the paper and were created based on a discussion where all authors were invited to contribute, so maybe it would be worth updating the mock-up registry with those. But first a question, in the website, should we point to the web app or to the google form/csv? If it's possible to add new datasets via the web app then it would be way better than the Google solution! (right now it requires a log-in).

@jhpoelen
Copy link
Member Author

jhpoelen commented Aug 23, 2019

While I can see a webapp being useful in later stages of the project, I'd suggest to keep things as simple as possible at this stage and make it easy for most, not just ruby/web developers, to contribute. Rather than introducing a webserver and a database, I'd suggest to stick with static, version controlled, lists of files (or tables) managed in github (or google sheets) for now. These lists can then be rendered in html using same Jekyll templates (also see https://jekyllrb.com/docs/datafiles/) or some javascript.

The cost of maintaining a webapp and managing associated data in a rail app on heroku should not be under-estimated. Also, since there's relatively few folks coding in rails relative to those that can edit tables or make html pages, I expect that the number of folks that can review, suggest and contribute functionality will be limited.

Instead, I'd favor an approach taken by https://github.com/OBOFoundry/OBOFoundry.github.io in which individual datasets (in this case ontologies) get their own file (see https://github.com/OBOFoundry/OBOFoundry.github.io/tree/master/ontology) and can be managed individually be those that maintain the datasets. In my mind, this promotes a sense of ownership and allows for delegation of maintenance of datasets info across a large group of folks.

@bmaitner
Copy link
Contributor

@jhpoelen The structure used by the OBOFoundry seems easy enough, and it's easy enough to create a new file in Github by clicking the "create new file" button that we shouldn't have too much trouble getting folks that aren't Github-savvy to contribute. I agree that a webapp would be great, but maybe that's something we build into funding applications?

@rachaelgallagher
Copy link
Contributor

rachaelgallagher commented Aug 25, 2019 via email

@bmaitner
Copy link
Contributor

Could we also adopt a similar infrastructure for a registry for scientists?

@jhpoelen
Copy link
Member Author

+1 totally! @bmaitner would it help to work on same examples?

@bmaitner
Copy link
Contributor

bmaitner commented Aug 26, 2019

@jhpoelen I think so. Perhaps we should start a new Issue for that though? Would be good to discuss what information to include, format, etc. . Or perhaps this goes under the OTN member profiles and map issue?

@jhpoelen
Copy link
Member Author

A separate issue found like a good plan. Should we re-use #3 or create a new one?

@bmaitner
Copy link
Contributor

I think #3 is fine

@jhpoelen
Copy link
Member Author

jhpoelen commented Aug 28, 2019

I've added placeholders for dataset entries at https://github.com/open-traits-network/open-traits-network.github.io/tree/master/_datasets . Also, I've created a placeholder dataset list page at https://github.com/open-traits-network/open-traits-network.github.io/tree/master/datasets.md which can be reached at https://opentraits.org/datasets .

I hope that others can help:

  • come up with minimal meta data elements for dataset registration
  • apply style to dataset pages and list
  • populate the dataset pages
  • add instructions on how to maintain dataset registry

@bmaitner
Copy link
Contributor

Regarding metadata elements, the current trait registry mockup that @jmadin put together had these elements:

Dataset name
Contributor
Version
Dataset link
Dataset link type
Access mode
Dataset type
Reference
Reference link
Brief description
Terms of use
Terms of use type
Taxonomic group
Number of traits
Number of taxa
Number of observations
Start date
End date
Location
Latitude
Longitude
Taxa
Traits

To this list I think I'd add:
ID
Data standard
Code links for standardizing data (and the associated data standard)

I'd suggest that small subset of those fields would be mandatory, though. Perhaps:

ID
Contributor
Dataset name
Dataset link
Taxa
Brief description

@jhpoelen
Copy link
Member Author

jhpoelen commented Sep 4, 2019

@bmaitner sounds good to me. Just curious - did you consider using EML-inspired structure https://knb.ecoinformatics.org/external//emlparser/docs/index.html ? What would really help me to provide feedback is a few examples along the lines of #3 . Let me know how I can help.

@eflowerproject
Copy link
Contributor

Hi everyone, trying to register a dataset on the Google Form and have a couple of questions about traitList (a required field):

  1. Can trait names contain spaces (and other symbols)? e.g., which of the following is best:
    "100. Sex (D1) | 102. Ovary position (D1) | 201. Number of perianth parts (C1)"
    "Sex | Ovary position | Number of perianth parts"
    "Sex | Ovary_position | Number_of_perianth_parts"
    Do we want to maintain internal IDs (100, etc. in my example) or trim them here?
  2. In my own database (PROTEUS), which I used to generate this dataset, I distinguish primary characters (used to record data) from secondary characters (used in analyses): which do we want to register here?

I had a look at the mockup (https://afternoon-tor-83256.herokuapp.com/) but could not find an example trait list.

Thanks a lot!

Hervé

P.S.: I am totally new to GitHub and the dataset was part of this paper: https://www.nature.com/articles/ncomms16047

@bmaitner
Copy link
Contributor

bmaitner commented Sep 8, 2019

@bmaitner sounds good to me. Just curious - did you consider using EML-inspired structure https://knb.ecoinformatics.org/external//emlparser/docs/index.html ? What would really help me to provide feedback is a few examples along the lines of #3 . Let me know how I can help.

I like the idea of using existing standards (and the associated documentation) where possible, but I think we have to consider the trade-off of ease of entering data vs. ease of parsing/searching data. I wonder if relying on especially rigid formatting might discourage users from uploading datasets. However, perhaps some of the less-strict EML fields would be sufficient for our purposes? e.g. generalTaxonomicCoverage, geographicDescription, boundingCoordinates, etc? Possibly with a link to a full EML description?

@caterinap
Copy link
Member

Just wanted to say that we had a first "discussion" (it was more a collaborative doc) about the fields of the registry, These are the ones reported in Table 1 of the paper. Concerning standards, it is currently pretty low, but I based all the fields I could on the Darwin Core (e.g. https://dwc.tdwg.org/terms/#decimalLongitude). Happy to change anything you want.

The app is not linked to the google form (they were created independently).
@eflowerproject : if you want to have a look at the datasets you can download them here.
Concerning your question 1, we did not really agree about standards in trait names or other fields, but it's surely something we should do (keeping in mind the trade-off mentioned by @bmaitner). For the moment we just recommend using existing ontologies (e.g. Plant Ontology or Flora Phenotype ontology) when they exist.
Concerning your question 2, you can register both primary and secondary characters and add this info in the usefulClasses field. You would put something like "inferred traits" to indicate your secondary traits.

@eflowerproject
Copy link
Contributor

@caterinap thanks for your very helpful reply. Those example records are very useful. I agree we may need to clarify instructions for standardizing trait names further down the line. Unfortunately, most of my traits do not map to existing plant ontologies (this is a bigger issue that I'll have to solve separately).

@jmadin
Copy link
Collaborator

jmadin commented Sep 16, 2019

Hi @caterinap. I was just autogenerating the dataset markdown files for the website from your google link (above). The doesn't seem to be a field for dataset name. I wanted to name the files with this name, and also display it on dataset webpages. Is this something that you have? Or perhaps should add?

@jmadin
Copy link
Collaborator

jmadin commented Sep 16, 2019

I'll use the data set URL for now.

@jmadin
Copy link
Collaborator

jmadin commented Sep 17, 2019

Should we change the menu item and page name from "Datasets" to "Registry" (or "Dataset Registry"? We can continue to call instances of this registry a "Dataset". Thoughts?

@caterinap
Copy link
Member

@jmadin I added a datasetID field (https://docs.google.com/forms/d/e/1FAIpQLSdWL1hMzSGOfSSOGDFhjwipT1a1j9XSLpiDoI0ziTEMywsW7w/viewform?usp=sf_link). Is that ok or you would prefer a more generic field with a dataset name?

@jmadin
Copy link
Collaborator

jmadin commented Sep 17, 2019

Excellent, thanks @caterinap
I guess we need both an ID and a name. To keep it simple, we could just require a name (e.g., "Coral Trait Database"), which would be used in the list of datasets and as a title for a dataset's page; and then generate the ID automatically for filenames and weblink (e.g., "coral-trait-database.md"). I'd say the dataset name is first priority and required, and you could create an (optional) field for ID as well. Is there something in Darwin Core for name?

@caterinap
Copy link
Member

Agree! I added datasetName (from Darwin Core http://rs.tdwg.org/dwc/terms/datasetName) and kept the datasetID as optional. We can always autogenerate a second ID specific for the registry.

@caterinap
Copy link
Member

One question: when the dataset has not an official name. Should we set a "standard" (format and content). Something like Geography Taxon Author (trait type? year?) Do we want spaces or underscores?
For example I put something like: "Global Passerine Morphology Ricklefs" for this dataset https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecy.1783
But it became more difficult/funny when having something like "Norfolk Carcinus maenas Pearse" for this one https://figshare.com/articles/Catching_crabs_a_case_study_in_local_scale_English_conservation/979288
Any thoughts?

(I agree that it's very frustrating to fill in datasets without a standard!)

@bmaitner
Copy link
Contributor

+1 for a standard, even if it is one that we expect will need revision in the future. And I always prefer underscores. The dataset name is just a unique identifier, right? So it doesn't much matter if it sounds a bit odd. I think the more pressing concern is that it be clear so that folks don't accidentally add the same thing twice. For example, someone might also call this "England Decapod Pearse". So perhaps Author_Year_Geography_Taxon<_letter if more than one database present?>, with the constraints that 1) only the first author is used; 2) the smallest political division/taxonomic rank that encompasses all the records is used. I suggest placing author and year first since it might make it easier for folks to scroll through and see if a dataset has been entered already, since author name is less ambiguous than the geographical or taxonomic fields.

@jhpoelen
Copy link
Member Author

jhpoelen commented Sep 18, 2019

About dataset ids . Most major registries I know (e.g., iDigBio, GBIF) are pushing for using randomly generated UUIDs to identify specific datasets.

Also, from Nelson et al. 2018 , https://doi.org/10.1002/aps3.1027 :


[...]
Opacity
GUID values can be transparent or opaque (Page, 2009). Transparent values include those that contain human‐decipherable text or human‐meaningful strings. DwC triplets (e.g., Uconn:CONN:CONN00050395) and HTTP URI identifiers (e.g., http://herbarium.bio.fsu.edu/000002561) are examples of transparent identifiers that connote some sense of meaning, such as ownership, to the human user. Opaque identifiers (e.g., a universally unique identifer [UUID], e3ad9bb3‐cb8e‐475c‐aff5‐87f877b56120) contain no apparent human‐decipherable information and are construed strictly as meaningless strings. The lack of apparent meaning underscores their universality and reduces the likelihood that they will be altered or replaced (McMurry et al., 2017).
[...]

Personally, I think that the pragmatism of using transparent identifiers sometimes outweighs the benefits of opaque ones. However, mixing of meta-data and identifiers can be tricky. From my own experience, I've seen that "fixing" typos in transparent ids can lead to funky behavior, especially when considering a lifespan of > 10 years.

@caterinap
Copy link
Member

Can have both transparent and opaque identifiers? @bmaitner I like your proposal. What about things like TRY ? If we follow the standard we might loose the original name for them. Should we just keep the database name when there is one?

@bmaitner
Copy link
Contributor

@caterinap but database name is an existing field, so there shouldn't be an issue with losing it (I think?). However, I think the issue of keeping different versions of a data set linked is one we should think about (especially if the main author changes between versions). It might also be worth considering keeping track of which data sets contain/are contained by other data sets. Do we need fields for these relationships?

@jmadin
Copy link
Collaborator

jmadin commented Sep 26, 2019

@caterinap Is there some way to add the missing dataset names in the Google Form/Sheet that we download? I just re-downloaded the spreadsheet to transfer to the website, but don't want to have to re-enter the missing dataset names each time we do this.

@jhpoelen
Copy link
Member Author

jhpoelen commented Oct 9, 2019

I've attempted to capture the current dataset registration process at https://github.com/open-traits-network/open-traits-network.github.io/tree/master/_datasets#readme . Please note that the google form is no longer used or referenced. Remaining google form entries have been copied into #45 .

Thanks for all the discussion and input. Please open a new issue if you have suggestions ideas.

@jhpoelen jhpoelen closed this as completed Oct 9, 2019
@rachaelgallagher
Copy link
Contributor

rachaelgallagher commented Oct 9, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants