Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract automatically ingredient list from OCR #242

Open
raphael0202 opened this issue Dec 13, 2022 · 1 comment
Open

Extract automatically ingredient list from OCR #242

raphael0202 opened this issue Dec 13, 2022 · 1 comment

Comments

@raphael0202
Copy link
Contributor

raphael0202 commented Dec 13, 2022

Why is it important

Knowing the ingredients of products is really important in Open Food Facts, as the ingredient list is used to compute the NOVA group (transformation score) and to inform users with allergies or intolerance that some products are not suitable for them.
It's also likely that the ingredient list is going to be used in future versions of the Ecoscore, the environmental score used on Open Food Facts.

The current process for ingredient extraction is the following:

  • a user uploads an image, crops it only to keep the ingredient list and selects it as an ingredient image
  • the user extracts the ingredient list by clicking on a button (OCR is performed)
  • the user may modify the text if some OCR errors occurred, and saves the product

This manual approach takes time, and most contributors don't extract the ingredients. As of December 2022, on 2.7M products, 1.9M don't have a completed ingredient list.

Proposal

We would like to extract automatically the ingredient list from image OCRs. As OCR is performed on all images, we already have the text, what needs to be done is to find the beginning and ending of the ingredient list.

A sequence tagger (NER-like model) can be trained to detect the beginning and end of the ingredient list (if any).
Open Food Facts is a global food database, so don't expect a single language to be present on the photos: the detector should work on at least the most common languages (FR, EN, ES, DE, IT, NL...).

We don't have any labeled dataset for this task.

Google Cloud Vision (the service we use for OCR) doesn't always detect well line continuation (how to link detected words to form a sentence), but based on a manual analysis of ingredient list images, this issue occurs in ~4.1% of cases (9/217).
We therefore rely on Cloud Vision block detection, keeping in mind that the ingredient list may be occasionally split in several parts due to incorrect block detection.

Documentation about OCR process: https://wiki.openfoodfacts.org/OCR

Requirements

You can use the framework you like, but the model should be exportable either in ONNX or SavedModel format (we use Triton to serve ML models).

@raphael0202
Copy link
Contributor Author

The annotation campaign has started here: https://annotate.openfoodfacts.org/projects/1/data

@teolemon teolemon added the ✨ enhancement New feature or request label May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants