Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality facet - "quantity-contains-e" has many false positives #2037

Closed
Tracked by #10273
CharlesNepote opened this issue Jun 30, 2019 · 6 comments
Closed
Tracked by #10273

Quality facet - "quantity-contains-e" has many false positives #2037

CharlesNepote opened this issue Jun 30, 2019 · 6 comments
Labels
🐛 bug This is a bug, not a feature request. 🧽 Data quality - Measure - Quality facets One of the facets available in Open Food Facts is /quality & allows us to spot products w/ bad data 🧽 Data quality https://wiki.openfoodfacts.org/Quality ⚖️ Quantity

Comments

@CharlesNepote
Copy link
Member

CharlesNepote commented Jun 30, 2019

What

See: https://world.openfoodfacts.org/quality/quantity-contains-e

Example: "1 litre" in https://world.openfoodfacts.org/product/3464660002434/jus-d-aloe-vera-pur-aloe

The regexp recognize too much things:

and ($product_ref->{quantity} =~ /(?:.*e$)|(?:[0-9]+\s*[kmc]?[gl]?\s*e)/i)

I replaced:

  • /(?:.*e$)|(?:[0-9]+\s*[kmc]?[gl]?\s*e)/i by
  • /(?:^e\s)|(?:\se[^a-z])|(?:\se$)/i

And it deleted more than 1500 false positives (see file attached).

4.txt

Part of

@CharlesNepote CharlesNepote added 🐛 bug This is a bug, not a feature request. 🧽 Data quality https://wiki.openfoodfacts.org/Quality 🧽 Data quality - Measure - Quality facets One of the facets available in Open Food Facts is /quality & allows us to spot products w/ bad data ⚖️ Quantity labels Jun 30, 2019
@stephanegigandet
Copy link
Contributor

I agree that the regexp is broken, but the proposed change also delete true positives.

What we really want to identify is cases like this:

500 ge

@VaiTon
Copy link
Member

VaiTon commented Aug 28, 2019

In the first place, why do we want to identify quantities containing "e"? @stephanegigandet

@teolemon teolemon changed the title Quality facet "quantity-contains-e" has many false positives Quality facet - "quantity-contains-e" has many false positives Oct 11, 2021
@benbenben2
Copy link
Collaborator

@stephanegigandet
Copy link
Contributor

In the first place, why do we want to identify quantities containing "e"? @stephanegigandet

That's a good question. :) (sorry @VaiTon I missed it 4 years ago).

The symbol is present on most products, I didn't see the point in including it in the quantity field.

@benbenben2
Copy link
Collaborator

Should we close this issue @CharlesNepote?

@alexgarel
Copy link
Member

I think we can close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug This is a bug, not a feature request. 🧽 Data quality - Measure - Quality facets One of the facets available in Open Food Facts is /quality & allows us to spot products w/ bad data 🧽 Data quality https://wiki.openfoodfacts.org/Quality ⚖️ Quantity
Projects
Archived in project
Development

No branches or pull requests

5 participants