-
-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ingredients parsing and taxonomy, reduce number of unknown ingredients #2023
Comments
Scanning through it, many are not ingredients, but things like traces. So try to parse those out first. The first one is at 71 crème pasteurisé. I can start with these. Is a nice list for testing the parser. The issues really appear at the bottom. |
PS can you accept a few pull requests. I am waiting for the accepts, before continuing. |
proteines de {allergene}lait{/allergene} |
In fact it is not to bad. The important ingredients that occur often seem to be covered, just some synonyms to be added. |
I added the ingredients here: #2027 (still work in progress) |
Why is this one not recognised, it is in the taxonomy. accent removed to early?: |
Current stats:
Almost 95%. :-) A lot of the remaining strings are actually not ingredients, but mentions related to ingredients that we could try to parse into labels. |
I'll filter out sentences like "percentages are expressed on the total product":
|
Changes above applied to production on scamark products:
It's getting much better :) |
Overall stats for fr, before updating all products: https://fr.openfoodfacts.org/ingredients?stats=1
Corresponding ingredient analysis: Présence d'huile de palme inconnue | 129786 | |
I'm going to try to significantly reduce the number of unknown ingredients for products that have correct ingredients lists (e.g. product data from manufacturers). This will allow better NOVA computation, but also new vegetarian / vegan recognition etc.
I'm going to use the Scamark / Leclerc import as a benchmark.
Today we have 5220 products with 5764 ingredients, 4035 o those ingredients are unknown (not in the taxonomy):
https://fr.openfoodfacts.org/editeur/scamark/ingredients
There are 2 main ways to improve this:
The text was updated successfully, but these errors were encountered: