New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
managing duplicate occurrenceIDs #78
Comments
Definitely don't leave duplicate IDs. The only requirement on These are curious duplicates though. I wonder if these are actually distinct observations or a quirk of some data processing that duplicated a single observation at some point. |
comments from today's meeting: If you have constructed an id based on fields that should be presumed to be uniq, and you have duplicates, here is the suggested workflow: Contact the provider. These fields should be unique within the dataset, unless we have missed something about the collection method. (e.g., these two observations are replicates) |
@albenson-usgs The processed files are updated to conform to the recommendation above. LMK if you need anything else. |
Quick update that I uploaded the new versions to the OBIS-USA IPT but EurOBIS republished all their data ahead of this so it's going to take awhile to see the update live in OBIS. |
occurrenceID issue is resolved https://obis.org/dataset/bc01451e-d990-4ad1-8315-e3fb6e9cf461 |
Making unique occurrenceID for the rows. Unfortunately, the best I could do was to concatenate the station, organism, life stage, and type. This still left me w/ 205 duplicates (410 rows). Looking a little more closely at the duplicates (see screenshot below) it seems that there are duplicate observations (same methods, place, time, and species - stage is missing). Not sure what to do in this case, most likely need to reach back to the PI and get their 2 cents.
In one case the observations are the same down to the life stage:
To make unique occurrenceIDs for each row, we can concatenate the biomass value in the
occurrenceID
. I think this makes it globally unique and locally reproducible. See https://nbviewer.org/github/ioos/bio_data_guide/blob/main/datasets/AMBON_zooplankton/2017zooplanton_to_dwc(2)_mmb.ipynb#eventID-and-datasetID-and-OccurenceID-for-all-the-tablesFor reference, the notebook can be found at https://github.com/ioos/bio_data_guide/blob/main/datasets/AMBON_zooplankton/2017zooplanton_to_dwc(2)_mmb.ipynb
Do we leave the duplicates and make a note in the metadata that there are duplicates? Or, do we add the biomass value to the ID to ensure uniqueness?
Of course, reaching out to the PI is a step as well.
The text was updated successfully, but these errors were encountered: