managing duplicate occurrenceIDs #78

MathewBiddle · 2021-11-10T15:57:30Z

Making unique occurrenceID for the rows. Unfortunately, the best I could do was to concatenate the station, organism, life stage, and type. This still left me w/ 205 duplicates (410 rows). Looking a little more closely at the duplicates (see screenshot below) it seems that there are duplicate observations (same methods, place, time, and species - stage is missing). Not sure what to do in this case, most likely need to reach back to the PI and get their 2 cents.

In one case the observations are the same down to the life stage:

To make unique occurrenceIDs for each row, we can concatenate the biomass value in the occurrenceID. I think this makes it globally unique and locally reproducible. See https://nbviewer.org/github/ioos/bio_data_guide/blob/main/datasets/AMBON_zooplankton/2017zooplanton_to_dwc(2)_mmb.ipynb#eventID-and-datasetID-and-OccurenceID-for-all-the-tables

For reference, the notebook can be found at https://github.com/ioos/bio_data_guide/blob/main/datasets/AMBON_zooplankton/2017zooplanton_to_dwc(2)_mmb.ipynb

Do we leave the duplicates and make a note in the metadata that there are duplicates? Or, do we add the biomass value to the ID to ensure uniqueness?

Of course, reaching out to the PI is a step as well.

The text was updated successfully, but these errors were encountered:

7yl4r · 2021-11-10T16:29:17Z

Definitely don't leave duplicate IDs. The only requirement on occurrenceID is that it be globally unique. Although there is no checking of that done...

These are curious duplicates though. I wonder if these are actually distinct observations or a quirk of some data processing that duplicated a single observation at some point.

mobb · 2021-11-10T22:07:36Z

comments from today's meeting:

If you have constructed an id based on fields that should be presumed to be uniq, and you have duplicates, here is the suggested workflow:

Contact the provider. These fields should be unique within the dataset, unless we have missed something about the collection method. (e.g., these two observations are replicates)
Is there a sample id in the original dataset? This might indicate that it was a replicate, and could be added to the occurrence id.
Find some way to keep both (rather than dropping one of each pair) Suggestions include
3.1. Construct an id by adding the abundance value (assuming it will still be uniq)
3.2. Add an auto-increment instead of the value

MathewBiddle · 2021-11-12T21:12:29Z

@albenson-usgs The processed files are updated to conform to the recommendation above.
https://github.com/ioos/bio_data_guide/tree/main/datasets/AMBON_zooplankton/data/processed

LMK if you need anything else.

albenson-usgs · 2021-11-15T15:58:15Z

Quick update that I uploaded the new versions to the OBIS-USA IPT but EurOBIS republished all their data ahead of this so it's going to take awhile to see the update live in OBIS.

albenson-usgs · 2021-11-18T15:54:33Z

occurrenceID issue is resolved https://obis.org/dataset/bc01451e-d990-4ad1-8315-e3fb6e9cf461

albenson-usgs closed this as completed Nov 18, 2021

MathewBiddle added the data help questions specific to a dataset label Nov 19, 2021

MathewBiddle self-assigned this Nov 19, 2021

MathewBiddle mentioned this issue Nov 22, 2021

Add an example of walking through resolving data issues using this repo #83

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

managing duplicate occurrenceIDs #78

managing duplicate occurrenceIDs #78

MathewBiddle commented Nov 10, 2021

7yl4r commented Nov 10, 2021

mobb commented Nov 10, 2021

MathewBiddle commented Nov 12, 2021

albenson-usgs commented Nov 15, 2021 •

edited

albenson-usgs commented Nov 18, 2021

managing duplicate occurrenceIDs #78

managing duplicate occurrenceIDs #78

Comments

MathewBiddle commented Nov 10, 2021

7yl4r commented Nov 10, 2021

mobb commented Nov 10, 2021

MathewBiddle commented Nov 12, 2021

albenson-usgs commented Nov 15, 2021 • edited

albenson-usgs commented Nov 18, 2021

albenson-usgs commented Nov 15, 2021 •

edited