Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ingredients parsing and taxonomy, reduce number of unknown ingredients #2023

Open
stephanegigandet opened this issue Jun 27, 2019 · 10 comments
Assignees
Labels
✨ Feature Features or enhancements to Open Food Facts server 🥗 Ingredients 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies

Comments

@stephanegigandet
Copy link
Contributor

I'm going to try to significantly reduce the number of unknown ingredients for products that have correct ingredients lists (e.g. product data from manufacturers). This will allow better NOVA computation, but also new vegetarian / vegan recognition etc.

I'm going to use the Scamark / Leclerc import as a benchmark.

Today we have 5220 products with 5764 ingredients, 4035 o those ingredients are unknown (not in the taxonomy):
https://fr.openfoodfacts.org/editeur/scamark/ingredients

There are 2 main ways to improve this:

  1. adding relevant ingredients to the taxonomy
  2. improving ingredient parsing (decompounding compound ingredients, extracting ingredient properties like origin, quality etc.)
@aleene
Copy link
Contributor

aleene commented Jun 27, 2019

Scanning through it, many are not ingredients, but things like traces. So try to parse those out first.

The first one is at 71 crème pasteurisé. I can start with these. Is a nice list for testing the parser. The issues really appear at the bottom.

@aleene
Copy link
Contributor

aleene commented Jun 27, 2019

PS can you accept a few pull requests. I am waiting for the accepts, before continuing.

@aleene
Copy link
Contributor

aleene commented Jun 27, 2019

proteines de {allergene}lait{/allergene}
farine de {allergene}ble{/allergene}
amidon modifié de maïs et/ou de pomme de terre
proteines de {allergene}soja{/allergene} rehydratees
fibres de {allergene}ble{/allergene}
{allergene}oeufs{/allergene} entiers frais
https://fr.openfoodfacts.org/editeur/scamark/ingredient/alcohol-denat
https://fr.openfoodfacts.org/editeur/scamark/ingredient/parfum
https://fr.openfoodfacts.org/editeur/scamark/ingredient/lt
https://fr.openfoodfacts.org/editeur/scamark/ingredient/zu:eau

@aleene
Copy link
Contributor

aleene commented Jun 27, 2019

In fact it is not to bad. The important ingredients that occur often seem to be covered, just some synonyms to be added.

@aleene
Copy link
Contributor

aleene commented Jun 28, 2019

I added the ingredients here: #2027 (still work in progress)

@aleene
Copy link
Contributor

aleene commented Jun 28, 2019

Why is this one not recognised, it is in the taxonomy. accent removed to early?:
purée de tomate

@stephanegigandet
Copy link
Contributor Author

Current stats:
https://fr.openfoodfacts.org/editeur/scamark/ingredients?stats=1

Type Unique tags Occurrences
known 1982 (37.00%) 110227 (94.99%)
unknown 3374 (62.98%) 5808 (5.01%)
all 5357 (100.00%) 116035 (100.00%)

Almost 95%. :-) A lot of the remaining strings are actually not ingredients, but mentions related to ingredients that we could try to parse into labels.

@stephanegigandet
Copy link
Contributor Author

stephanegigandet commented Aug 18, 2019

I'll filter out sentences like "percentages are expressed on the total product":
(e.g. for Scamark :)

les-pourcentages-sont-exprimes-sur-le-produit-total 135 *
pourcentages-exprimes-sur-le-produit-total 52 *
pourcentages-exprimes-sur-le-total-de-la-recette 19 *
pourcentages-exprimes-sur-la-recette-au-total 14 *
les-pourcentages-sont-exprimes-sur-le-produit-total-avant-friture 11 *
pourcentage-exprime-sur-la-sauce 9 *
pourcentages-exprimes-sur-le-produit-total-avant-friture 7 *
exprime-sur-la-sauce 7 *
exprimes-sur-la-salade-composee 5 *
pourcentages-exprimes-sur-les-nems 5 *
les-pourcentages-sont-exprimes-sur-le-produit-fini 4 *
exprimes-sur-le-mini-quatre-quarts 4 *
exprimes-sur-le-produit-total 4 *
les-pourcentages-sont-exprimes-sur-le-produit-total-avant-precuisson 4 *

@stephanegigandet
Copy link
Contributor Author

Changes above applied to production on scamark products:

Type Unique tags Occurrences
known 2019 (45.62%) 111302 (96.63%)
unknown 2406 (54.36%) 3876 (3.37%)
all 4426 (100.00%) 115178 (100.00%)

It's getting much better :)

@stephanegigandet
Copy link
Contributor Author

stephanegigandet commented Aug 21, 2019

Overall stats for fr, before updating all products:

https://fr.openfoodfacts.org/ingredients?stats=1

Type Unique tags Occurrences
known 3561 (0.74%) 3396667 (80.86%)
unknown 478100 (99.26%) 804162 (19.14%)
all 481662 (100.00%) 4200829 (100.00%)

Corresponding ingredient analysis:

Présence d'huile de palme inconnue | 129786 |  
Caractère végétarien inconnu | 127237 |  
Non végétalien | 97613 |  
Caractère végétalien inconnu | 84865 |  
Sans huile de palme | 56846 |  
Non végétarien | 38902 |  
Végétarien | 31865 |  
Végétalien | 23730 |  
Huile de palme | 13407 |  
Peut-être végétarien | 13379 |  
Pourrait contenir de l'huile de palme | 11344 |  
Peut-être végétalien | 5175

@teolemon teolemon removed the Q1-2019 label Nov 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
✨ Feature Features or enhancements to Open Food Facts server 🥗 Ingredients 🧬 Taxonomies https://wiki.openfoodfacts.org/Global_taxonomies
Projects
Status: In progress
Status: To discuss and validate
Development

No branches or pull requests

4 participants