Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

managing duplicate occurrenceIDs #78

Closed
MathewBiddle opened this issue Nov 10, 2021 · 5 comments
Closed

managing duplicate occurrenceIDs #78

MathewBiddle opened this issue Nov 10, 2021 · 5 comments
Assignees
Labels
data help questions specific to a dataset

Comments

@MathewBiddle
Copy link
Contributor

Making unique occurrenceID for the rows. Unfortunately, the best I could do was to concatenate the station, organism, life stage, and type. This still left me w/ 205 duplicates (410 rows). Looking a little more closely at the duplicates (see screenshot below) it seems that there are duplicate observations (same methods, place, time, and species - stage is missing). Not sure what to do in this case, most likely need to reach back to the PI and get their 2 cents.

image

In one case the observations are the same down to the life stage:

image

To make unique occurrenceIDs for each row, we can concatenate the biomass value in the occurrenceID. I think this makes it globally unique and locally reproducible. See https://nbviewer.org/github/ioos/bio_data_guide/blob/main/datasets/AMBON_zooplankton/2017zooplanton_to_dwc(2)_mmb.ipynb#eventID-and-datasetID-and-OccurenceID-for-all-the-tables

For reference, the notebook can be found at https://github.com/ioos/bio_data_guide/blob/main/datasets/AMBON_zooplankton/2017zooplanton_to_dwc(2)_mmb.ipynb

Do we leave the duplicates and make a note in the metadata that there are duplicates? Or, do we add the biomass value to the ID to ensure uniqueness?

Of course, reaching out to the PI is a step as well.

@7yl4r
Copy link
Contributor

7yl4r commented Nov 10, 2021

Definitely don't leave duplicate IDs. The only requirement on occurrenceID is that it be globally unique. Although there is no checking of that done...

These are curious duplicates though. I wonder if these are actually distinct observations or a quirk of some data processing that duplicated a single observation at some point.

@mobb
Copy link

mobb commented Nov 10, 2021

comments from today's meeting:

If you have constructed an id based on fields that should be presumed to be uniq, and you have duplicates, here is the suggested workflow:

Contact the provider. These fields should be unique within the dataset, unless we have missed something about the collection method. (e.g., these two observations are replicates)
Is there a sample id in the original dataset? This might indicate that it was a replicate, and could be added to the occurrence id.
Find some way to keep both (rather than dropping one of each pair) Suggestions include
3.1. Construct an id by adding the abundance value (assuming it will still be uniq)
3.2. Add an auto-increment instead of the value

@MathewBiddle
Copy link
Contributor Author

@albenson-usgs The processed files are updated to conform to the recommendation above.
https://github.com/ioos/bio_data_guide/tree/main/datasets/AMBON_zooplankton/data/processed

LMK if you need anything else.

@albenson-usgs
Copy link
Contributor

albenson-usgs commented Nov 15, 2021

Quick update that I uploaded the new versions to the OBIS-USA IPT but EurOBIS republished all their data ahead of this so it's going to take awhile to see the update live in OBIS.

@albenson-usgs
Copy link
Contributor

occurrenceID issue is resolved https://obis.org/dataset/bc01451e-d990-4ad1-8315-e3fb6e9cf461

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data help questions specific to a dataset
Projects
None yet
Development

No branches or pull requests

4 participants