Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resource.fs_filename and resource.url are sometimes unsynced #2544

Open
abulte opened this issue Oct 1, 2020 · 2 comments · Fixed by #2545
Open

resource.fs_filename and resource.url are sometimes unsynced #2544

abulte opened this issue Oct 1, 2020 · 2 comments · Fixed by #2545
Assignees

Comments

@abulte
Copy link
Contributor

abulte commented Oct 1, 2020

Sometimes, especially on community resources from transport.data.gouv.fr, our fs_filename is not synced with url. After our cleanup, it lead to purge-datasets failing because it tries to remove a not existing file.

Tests done so far

On demo.data.gouv.fr:

CommunityResources

300+ occurrences, cf complete list https://gist.github.com/abulte/f283a2c2e3dc9102d8f767f0c908637e.

It happened again this morning, cf

offre-de-transport-du-reseau-lio-arc-en-ciel-gtfs,08422838-434a-4a7d-907a-7d43b57b8639,https://static.data.gouv.fr/resources/offre-de-transport-du-reseau-lio-arc-en-ciel-gtfs/20201001-081210/reseau-lr-gtfs-20200924.zip.netex.zip,offre-de-transport-du-reseau-lio-arc-en-ciel-gtfs/20201001-081632/reseau-lr-gtfs-20200924.zip.netex.zip

Resources

>>> for d in Dataset.objects:
...     for r in d.resources:
...             if r.fs_filename and (not r.url.endswith(r.fs_filename)):
...                     print(d.id, r.id, r.url, r.fs_filename)
...
5d13a8b6634f41070a43dff3 1ac234c7-1da4-49cf-a122-646b21d64b43 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074922/export-tag-20200926-074922.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074919/export-tag-20200822-074919.csv
5d13a8b6634f41070a43dff3 970aafa0-3778-4d8b-b9d1-de937525e379 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074920/export-reuse-20200926-074920.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074917/export-reuse-20200822-074917.csv
5d13a8b6634f41070a43dff3 b7bbfedc-2448-4135-a6c7-104548d396e7 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074909/export-organization-20200926-074909.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074906/export-organization-20200822-074906.csv
5d13a8b6634f41070a43dff3 d77705e1-4ecd-461c-8c24-662d47c4c2f9 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074906/export-discussion-20200926-074906.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074903/export-discussion-20200822-074903.csv
5d13a8b6634f41070a43dff3 4babf5f2-6a9c-45b5-9144-ca5eae6a7a6d https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074811/export-resource-20200926-074811.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074809/export-resource-20200822-074809.csv
5d13a8b6634f41070a43dff3 f868cca6-8da1-4369-a78d-47463f19a9a3 https://static.data.gouv.fr/resources/catalogue-des-donnees-de-data-gouv-fr/20200926-074505/export-dataset-20200926-074505.csv catalogue-des-donnees-de-data-gouv-fr/20200822-074502/export-dataset-20200822-074502.csv
5448d3e0c751df01f85d0572 50625621-18bd-43cb-8fde-6b8c24bdabb3 https://static.data.gouv.fr/resources/fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques/20200920-224338/bornes-irve-20200920.csv fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques/20200820-224444/bornes-irve-20200820.csv

Cf #2542 for catalogue-des-donnees-de-data-gouv-fr. No idea why it failed for fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques, this is a pretty standard API upload.

@abulte
Copy link
Contributor Author

abulte commented Oct 1, 2020

About community resources, this was caused by transport's script which forced the URL to a previous when doing the PUT to update metadata. It's been fixed but there's still 87 dangling resources https://gist.github.com/abulte/f283a2c2e3dc9102d8f767f0c908637e#file-cr-unsynced-fs-filename-v2-csv. They can be removed.

Going further: we probably should not allow setting the URL from the API when a resource is of type file (ie not remote). This would have prevented this whole mess.

@abulte
Copy link
Contributor Author

abulte commented Oct 5, 2020

Keeping this open since this one is stil unexplained:

5448d3e0c751df01f85d0572 50625621-18bd-43cb-8fde-6b8c24bdabb3 https://static.data.gouv.fr/resources/fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques/20200920-224338/bornes-irve-20200920.csv fichier-consolide-des-bornes-de-recharge-pour-vehicules-electriques/20200820-224444/bornes-irve-20200820.csv

and #2542 must be fixed too.

@abulte abulte reopened this Oct 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants