Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove or move entries with non normalized codes in mongodb #7249

Open
alexgarel opened this issue Aug 25, 2022 · 2 comments
Open

Remove or move entries with non normalized codes in mongodb #7249

alexgarel opened this issue Aug 25, 2022 · 2 comments
Assignees
Labels
🧽 Data quality https://wiki.openfoodfacts.org/Quality MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. ✅ Task

Comments

@alexgarel
Copy link
Member

We have entries on the file system or in mongodb with a non normalized code, (see Products.pm::normalize_code). This leads to confusion for people using the data dump (see #7244)

Run a script to clean those entries. For each entry with a non normalized code:

  • always remove the non normalized version in mongodb
  • for the sto file:
    • if an entry with the normalized code exists: remove the non normalize one
    • else move the entry to its normalized code

Note: be sure to double search the non normalized codes : in mongodb and on the filesystem

@teolemon teolemon added the MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products label Oct 22, 2022
@CharlesNepote CharlesNepote added the 🧽 Data quality https://wiki.openfoodfacts.org/Quality label Jan 10, 2023
alexgarel added a commit that referenced this issue Jan 19, 2023
and int codes:

- on file system (sto files)
- in mongodb

fixes: Remove or move entries with non normalized codes in mongodb #7249
@alexgarel alexgarel self-assigned this Jan 23, 2023
@alexgarel
Copy link
Member Author

First dry-run on preprod, it was really really slow. I must add some progression indicators to understand what is the slowest.

real    3608m29.575s
user    11m10.066s
sys     34m11.942s

Difference user / real indicates it spends a lot of time waiting for something (file system). I suspect we have a problem with zfs mount on preprod. (see also #6164). Maybe it's just that we need to avoid being on an old version of the ZFS, and for that we need to do something ! (maybe mount / unmount the file system, I'm not sure).

Finally it died on mongodb stuff.

MongoDB::DatabaseError: Executor error during find command :: caused by :: operation exceeded t
ime limit

I attach the log of advertised changes from the first phase (it's a dry run, so it's not done yet).

I will:

  • add logs to get progression (with a specific option)
  • try to see if I can remount zfs and see if it brings a change
  • add options to only run part of the script (that make sense) to be able to jump to one part of the script, it will help debugging

Copy link
Contributor

github-actions bot commented Jan 3, 2024

This issue has been open 90 days with no activity. Can you give it a little love by linking it to a parent issue, adding relevant labels and projets, creating a mockup if applicable, adding code pointers from https://github.com/openfoodfacts/openfoodfacts-server/blob/main/.github/labeler.yml, giving it a priority, editing the original issue to have a more comprehensive description… Thank you very much for your contribution to 🍊 Open Food Facts

@github-actions github-actions bot added the ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. label Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧽 Data quality https://wiki.openfoodfacts.org/Quality MongoDB We have 2 mongodb collections: one for current products, and one for obsolete products ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. ✅ Task
Projects
Status: 🔖 Sprint (max 10)
Status: In Progress
Status: In progress
Development

No branches or pull requests

3 participants