provide a method to indicate the status a taxon name #132

Closed
jhpoelen opened this Issue Apr 20, 2015 · 22 comments

Comments

Projects
None yet
4 participants
@jhpoelen
Owner

jhpoelen commented Apr 20, 2015

In personal communication, @ahhurlbert suggested to provide a method assign a state of a specific name for a possibly incorrect or outdated taxon name. This state would indicate that someone looked at the name (likely resulting from a no-match against external taxonomies), and report on the status of that name.

For instance, when an invalid or outdated name is used in a data source, we'd like to have a way to indicate that the name is indeed invalid or outdated without necessarily having to submit a name correction (see https://github.com/jhpoelen/eol-globi-data/wiki/Taxonomy-Matching#submitting-name-corrections).

Possible states might include: invalid, recently published or misspelled.

@jhammock

This comment has been minimized.

Show comment
Hide comment
@jhammock

jhammock Apr 20, 2015

Collaborator

@dimus can you suggest a names reconciliation service for this? I expect it's out there, and probably either in or touching Global Names...

Collaborator

jhammock commented Apr 20, 2015

@dimus can you suggest a names reconciliation service for this? I expect it's out there, and probably either in or touching Global Names...

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen Apr 24, 2015

Owner

@ahhurlbert I was thinking to allow annotation of the state of the name in the data source. For instance, you'd be able to have a fields like "taxon status" next to the fields you already provide to describe the subject/object source/target taxon occurrence. This way, you can use GloBI to exclude or indicate names that have known issues to avoid having to recheck names over and over. What do you think about this?

Owner

jhpoelen commented Apr 24, 2015

@ahhurlbert I was thinking to allow annotation of the state of the name in the data source. For instance, you'd be able to have a fields like "taxon status" next to the fields you already provide to describe the subject/object source/target taxon occurrence. This way, you can use GloBI to exclude or indicate names that have known issues to avoid having to recheck names over and over. What do you think about this?

@dimus

This comment has been minimized.

Show comment
Hide comment
@dimus

dimus Apr 25, 2015

It sounds like am interesting use case, can we chat about it on Monday?
On Apr 20, 2015 5:23 PM, "Jen Hammock" notifications@github.com wrote:

@dimus https://github.com/dimus can you suggest a names reconciliation
service for this? I expect it's out there, and probably either in or
touching Global Names...


Reply to this email directly or view it on GitHub
#132 (comment)
.

dimus commented Apr 25, 2015

It sounds like am interesting use case, can we chat about it on Monday?
On Apr 20, 2015 5:23 PM, "Jen Hammock" notifications@github.com wrote:

@dimus https://github.com/dimus can you suggest a names reconciliation
service for this? I expect it's out there, and probably either in or
touching Global Names...


Reply to this email directly or view it on GitHub
#132 (comment)
.

@ahhurlbert

This comment has been minimized.

Show comment
Hide comment
@ahhurlbert

ahhurlbert Apr 27, 2015

@jhpoelen Sounds possibly feasible but I'm a bit unclear on what the workflow would look like.

  1. Students enter data from an old research paper. Many of the names are obsolete, but they have no idea.
  2. Once in GloBI, names get checked and a list of invalid names is returned.
  3. Then a student goes through that list, finds each occurrence of an offending name in the database, and then assigns it a status. (which sounds potentially cumbersome, especially if certain names occur multiple times)
  4. Depending on the status, some names will cease to be on the flagged list in future checks. (but what happens the next time data are added with one of those invalid names--will we have to flag it again?)

Just trying to picture how this would work and whether it would be as efficient as simply having a separate names database (or relying on a names reconciliation service).

@jhpoelen Sounds possibly feasible but I'm a bit unclear on what the workflow would look like.

  1. Students enter data from an old research paper. Many of the names are obsolete, but they have no idea.
  2. Once in GloBI, names get checked and a list of invalid names is returned.
  3. Then a student goes through that list, finds each occurrence of an offending name in the database, and then assigns it a status. (which sounds potentially cumbersome, especially if certain names occur multiple times)
  4. Depending on the status, some names will cease to be on the flagged list in future checks. (but what happens the next time data are added with one of those invalid names--will we have to flag it again?)

Just trying to picture how this would work and whether it would be as efficient as simply having a separate names database (or relying on a names reconciliation service).

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen Apr 27, 2015

Owner

@dimus - thanks for your message please contact me by email to setup a time to talk jhpoelen at xs4all dot nl .

@ahhurlbert - thanks for sharing the use case. At this point, I can image three use cases: transcription mistake, outdated name, not-sure-that-this-is name. How about something like:

transcription mistake

  1. student makes mistake in transcribing a name "Avez"
  2. GloBI cannot find match for "Avez"
  3. avian diet database curator requests / received name report from GloBI
  4. student/ curator review the name "Avez" and checks against data source
  5. the data source mentions "Aves" instead of "Avez"
  6. the occurrences including "Avez" are corrected in the data source (e.g. AvianDietDatabase.txt)

outdated names
Data source contains a name that is no longer used and this outdated name is not available through (meta-) taxonomic services such as ITIS or EOL.

  1. name is transcribed correctly by student
  2. GloBI can't find a match
  3. after review of name list, the student double checks that the name is same as in source
  4. the current name for the taxon is determined and added to a specific taxon correction list (perhaps something like, or actually re-using, the GloBI general taxon correction list). The correction is described as "outdated name" or similar using a controlled vocabulary of naming terms.
  5. student submits the outdated name to a naming authority (e.g. ITIS) and suggests to add the name to the list of previously valid names,

not-sure-what-this-is name
Similar to the outdated name. Only for this unknown taxon name the correction code is something like "undetermined" or "unknown" and no suggestion is provided. Alternatively, a higher order taxon can be provided to provide some information about the taxon (if available). When GloBI provides a name report, the reason of the correction (or non-correction) is provided so that the student / curator can easily exclude "undetermined" or "unknown" names.

In short - fix the transcriptions errors (e.g. typos) in the source and introduce a way to annotate and correct outdated taxon names using a dedicated taxon correction list.

Ideally the avian diet database (or any other data source) should be publishable (e.g. data paper in esa pubs) by itself without having the rely on GloBI. GloBI is just a way to integrate, link and access this rich source of information into a larger body of interaction datasets: software comes and goes, but data is forever.

I'd be willing to discuss more over phone / skype if necessary (or organize a workshop?). In my mind, data peer review and access methods (which what I believe this is) can be super useful but might take some back and forth to figure out the most efficient way to implement them. Curious to hear your thoughts.

Owner

jhpoelen commented Apr 27, 2015

@dimus - thanks for your message please contact me by email to setup a time to talk jhpoelen at xs4all dot nl .

@ahhurlbert - thanks for sharing the use case. At this point, I can image three use cases: transcription mistake, outdated name, not-sure-that-this-is name. How about something like:

transcription mistake

  1. student makes mistake in transcribing a name "Avez"
  2. GloBI cannot find match for "Avez"
  3. avian diet database curator requests / received name report from GloBI
  4. student/ curator review the name "Avez" and checks against data source
  5. the data source mentions "Aves" instead of "Avez"
  6. the occurrences including "Avez" are corrected in the data source (e.g. AvianDietDatabase.txt)

outdated names
Data source contains a name that is no longer used and this outdated name is not available through (meta-) taxonomic services such as ITIS or EOL.

  1. name is transcribed correctly by student
  2. GloBI can't find a match
  3. after review of name list, the student double checks that the name is same as in source
  4. the current name for the taxon is determined and added to a specific taxon correction list (perhaps something like, or actually re-using, the GloBI general taxon correction list). The correction is described as "outdated name" or similar using a controlled vocabulary of naming terms.
  5. student submits the outdated name to a naming authority (e.g. ITIS) and suggests to add the name to the list of previously valid names,

not-sure-what-this-is name
Similar to the outdated name. Only for this unknown taxon name the correction code is something like "undetermined" or "unknown" and no suggestion is provided. Alternatively, a higher order taxon can be provided to provide some information about the taxon (if available). When GloBI provides a name report, the reason of the correction (or non-correction) is provided so that the student / curator can easily exclude "undetermined" or "unknown" names.

In short - fix the transcriptions errors (e.g. typos) in the source and introduce a way to annotate and correct outdated taxon names using a dedicated taxon correction list.

Ideally the avian diet database (or any other data source) should be publishable (e.g. data paper in esa pubs) by itself without having the rely on GloBI. GloBI is just a way to integrate, link and access this rich source of information into a larger body of interaction datasets: software comes and goes, but data is forever.

I'd be willing to discuss more over phone / skype if necessary (or organize a workshop?). In my mind, data peer review and access methods (which what I believe this is) can be super useful but might take some back and forth to figure out the most efficient way to implement them. Curious to hear your thoughts.

@dimus

This comment has been minimized.

Show comment
Hide comment
@dimus

dimus Apr 28, 2015

Hi Jorrit, I am available today in the second half of the day, and about
any time tomorrow. Google Hangout or Skype are good for me -- my skype is
dimus62

Cheers

Dima

dimus commented Apr 28, 2015

Hi Jorrit, I am available today in the second half of the day, and about
any time tomorrow. Google Hangout or Skype are good for me -- my skype is
dimus62

Cheers

Dima

@dimus

This comment has been minimized.

Show comment
Hide comment
@dimus

dimus Apr 28, 2015

Oups sorry, I forgot I do have a meeting at the second half of today -- and
tomorrow is still free for me.

Dima

dimus commented Apr 28, 2015

Oups sorry, I forgot I do have a meeting at the second half of today -- and
tomorrow is still free for me.

Dima

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen Apr 28, 2015

Owner

@dimus - I'll try and contact you tomorrow Wed 29 April at 11:00a eastern. Please let me know if you'd like to chat at another time.

@ahhurlbert - please let me know if you'd like to join.

Owner

jhpoelen commented Apr 28, 2015

@dimus - I'll try and contact you tomorrow Wed 29 April at 11:00a eastern. Please let me know if you'd like to chat at another time.

@ahhurlbert - please let me know if you'd like to join.

@ahhurlbert

This comment has been minimized.

Show comment
Hide comment
@ahhurlbert

ahhurlbert Apr 29, 2015

@jhpoelen Sorry can't make it. Maybe we can skype in a week or so?

@jhpoelen Sorry can't make it. Maybe we can skype in a week or so?

@dimus

This comment has been minimized.

Show comment
Hide comment
@dimus

dimus Apr 29, 2015

11:00 is good with me

On Tue, Apr 28, 2015 at 10:09 PM, ahhurlbert notifications@github.com
wrote:

@jhpoelen https://github.com/jhpoelen Sorry can't make it. Maybe we can
skype in a week or so?


Reply to this email directly or view it on GitHub
#132 (comment)
.

dimus commented Apr 29, 2015

11:00 is good with me

On Tue, Apr 28, 2015 at 10:09 PM, ahhurlbert notifications@github.com
wrote:

@jhpoelen https://github.com/jhpoelen Sorry can't make it. Maybe we can
skype in a week or so?


Reply to this email directly or view it on GitHub
#132 (comment)
.

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen Apr 29, 2015

Owner

Created GlobalNamesArchitecture/gni#38 after todays discussion with @dimus . Hopefully, GloBI data providers can use globalnames to help detect (and potentially correct) names in a way that others can also benefit from.

Owner

jhpoelen commented Apr 29, 2015

Created GlobalNamesArchitecture/gni#38 after todays discussion with @dimus . Hopefully, GloBI data providers can use globalnames to help detect (and potentially correct) names in a way that others can also benefit from.

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen May 4, 2015

Owner

Here's a list of taxon name descriptions I stumbled across:
https://en.wikipedia.org/wiki/Glossary_of_scientific_naming#Latin_descriptions_of_names_or_taxa

I imagine allowing the data sources to annotate names with their known status with terms from this list.

@ahhurlbert am able to do skype this week . . . let me know a good time for you.

Owner

jhpoelen commented May 4, 2015

Here's a list of taxon name descriptions I stumbled across:
https://en.wikipedia.org/wiki/Glossary_of_scientific_naming#Latin_descriptions_of_names_or_taxa

I imagine allowing the data sources to annotate names with their known status with terms from this list.

@ahhurlbert am able to do skype this week . . . let me know a good time for you.

@ahhurlbert

This comment has been minimized.

Show comment
Hide comment
@ahhurlbert

ahhurlbert May 5, 2015

How about 12 pm EST?

How about 12 pm EST?

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen May 5, 2015

Owner

Sounds good. Talk to you tomorrow (Tue) at 12p EST.

Owner

jhpoelen commented May 5, 2015

Sounds good. Talk to you tomorrow (Tue) at 12p EST.

@ahhurlbert

This comment has been minimized.

Show comment
Hide comment
@ahhurlbert

ahhurlbert May 5, 2015

I've added a Name_Status field to reflect the current taxonomic status of the prey name. I'm using 'verified' to indicate a presumably valid name that did not flag in GloBI, and 'unknown' for names that were flagged as invalid. I've gone through and fixed ~10 typos.

Also, I'm surprised that 'Bombidae' did not match any outdated taxonomies. It is an old family name for bumblebees which have since been incorporated into 'Apidae' within subtribe 'Bombini'. The only extant genus of this subtribe is 'Bombus', so I went ahead and changed all diet database entries with Prey_Family == 'Bombidae' to Prey_Family 'Apidae' and Prey_Genus 'Bombus'.

I've added a Name_Status field to reflect the current taxonomic status of the prey name. I'm using 'verified' to indicate a presumably valid name that did not flag in GloBI, and 'unknown' for names that were flagged as invalid. I've gone through and fixed ~10 typos.

Also, I'm surprised that 'Bombidae' did not match any outdated taxonomies. It is an old family name for bumblebees which have since been incorporated into 'Apidae' within subtribe 'Bombini'. The only extant genus of this subtribe is 'Bombus', so I went ahead and changed all diet database entries with Prey_Family == 'Bombidae' to Prey_Family 'Apidae' and Prey_Genus 'Bombus'.

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen May 5, 2015

Owner

@ahhurlbert Nice! Question - is the Name_Status associated with the predator or the prey name?

Owner

jhpoelen commented May 5, 2015

@ahhurlbert Nice! Question - is the Name_Status associated with the predator or the prey name?

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen May 5, 2015

Owner

After our discussion, I figured that adding two columns like: Name_Status and Prey_Name_Status, would probably make it clear which name the status related to.

Owner

jhpoelen commented May 5, 2015

After our discussion, I figured that adding two columns like: Name_Status and Prey_Name_Status, would probably make it clear which name the status related to.

@ahhurlbert

This comment has been minimized.

Show comment
Hide comment
@ahhurlbert

ahhurlbert May 5, 2015

The predator (bird) names are being checked as part of our workflow (and the taxonomic authority they are based on is listed in the Taxonomy field), so there should be no invalid names. That is, if I come across an old paper that uses an outdated bird name, the first thing I do is figure out what the currently accepted name is (using Avibase.org) and that's what is put in the table. Certainly there is the possibility for typos, but as those will be corrected as soon as they are identified I don't think there's a need to add a separate field for this status.

I've changed Name_Status to Prey_Name_Status to clarify which entity this describes.

The predator (bird) names are being checked as part of our workflow (and the taxonomic authority they are based on is listed in the Taxonomy field), so there should be no invalid names. That is, if I come across an old paper that uses an outdated bird name, the first thing I do is figure out what the currently accepted name is (using Avibase.org) and that's what is put in the table. Certainly there is the possibility for typos, but as those will be corrected as soon as they are identified I don't think there's a need to add a separate field for this status.

I've changed Name_Status to Prey_Name_Status to clarify which entity this describes.

@dimus

This comment has been minimized.

Show comment
Hide comment
@dimus

dimus May 6, 2015

To keep you up to date -- Wencan, our GSoC Student -- started working on algorithm for GN to fiture out status out of existing data/metadata.

dimus commented May 6, 2015

To keep you up to date -- Wencan, our GSoC Student -- started working on algorithm for GN to fiture out status out of existing data/metadata.

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen May 6, 2015

Owner

@dimus thanks for sharing!

Owner

jhpoelen commented May 6, 2015

@dimus thanks for sharing!

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen May 27, 2015

Owner

@ahhurlbert Hey Allen - I've prepared a new version of the taxon name report for you using the Prey_Name_Status data that you provide: you can find the current list of unmatched or suspicious name order by state by following http://tinyurl.com/hurlbertTaxonNameReportV4 and clicking on the looking glass (i.e. execute) button.

I've attached the result that came out. Note that the name status is treated as a controlled vocabulary term. In this case it would be the "Hurlbert Name Status" vocabulary. I suspect that we'll figure out a mapping to other name status vocabs at some point.

Let me know if this name report will help you manage your names better. If so, let me know how you'd like to receive / manage reports like these (download adhoc csv?, dedicated github repo with automatically updated name reports by GloBI data source).

screen shot 2015-05-27 at 12 58 38 pm

Owner

jhpoelen commented May 27, 2015

@ahhurlbert Hey Allen - I've prepared a new version of the taxon name report for you using the Prey_Name_Status data that you provide: you can find the current list of unmatched or suspicious name order by state by following http://tinyurl.com/hurlbertTaxonNameReportV4 and clicking on the looking glass (i.e. execute) button.

I've attached the result that came out. Note that the name status is treated as a controlled vocabulary term. In this case it would be the "Hurlbert Name Status" vocabulary. I suspect that we'll figure out a mapping to other name status vocabs at some point.

Let me know if this name report will help you manage your names better. If so, let me know how you'd like to receive / manage reports like these (download adhoc csv?, dedicated github repo with automatically updated name reports by GloBI data source).

screen shot 2015-05-27 at 12 58 38 pm

@jhpoelen

This comment has been minimized.

Show comment
Hide comment
@jhpoelen

jhpoelen Jul 6, 2015

Owner

GloBI now has a way to capture a taxonomic name status field.

@ahhurlbert please reopen issue if you feel the feature needs some more work.

Owner

jhpoelen commented Jul 6, 2015

GloBI now has a way to capture a taxonomic name status field.

@ahhurlbert please reopen issue if you feel the feature needs some more work.

@jhpoelen jhpoelen closed this Jul 6, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment