Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Geolocation assignments fail for duplicate location names #135

Open
joverlee521 opened this issue Dec 16, 2022 · 2 comments
Open

Geolocation assignments fail for duplicate location names #135

joverlee521 opened this issue Dec 16, 2022 · 2 comments
Assignees
Labels

Comments

@joverlee521
Copy link
Contributor

Current Behavior

  1. Locations defined in the source-data/geo_synonyms.tsv are ingested with the label column as the key in 3 dicts for location, division, and country. This means if there are multiple locations defined in the TSV file with the same label, the last one in the file is used as it overwrites all previous entries.

  2. Specifically in vdb/flu_upload, the ingested location is based on the single location label in the strain name, which makes it impossible to identify the specific location.

Possible solution

Use a hierarchical location curation process similar to what we have in ncov-ingest or monkeypox ingest.

@joverlee521
Copy link
Contributor Author

After updating the upload scripts to use the hierarchical location curation, we should also audit the metadata in fauna to check for any erroneous location assignments.

I believe these can be fixed if we re-upload them with the --overwrite flag to update the location info.

@joverlee521
Copy link
Contributor Author

Thank you @huddlej for pointing out the repeat_location in flu_upload. This looks like how these specific duplication location names are being handled in the current upload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

1 participant