Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality Facets - Create a quality facet with detected errors #912

Open
2 of 44 tasks
Tracked by #10273
teolemon opened this issue Oct 14, 2017 · 7 comments
Open
2 of 44 tasks
Tracked by #10273

Quality Facets - Create a quality facet with detected errors #912

teolemon opened this issue Oct 14, 2017 · 7 comments
Labels
🧽 Data quality - Measure - Quality facets One of the facets available in Open Food Facts is /quality & allows us to spot products w/ bad data 🧽 Data quality https://wiki.openfoodfacts.org/Quality

Comments

@teolemon
Copy link
Member

teolemon commented Oct 14, 2017

TODO INGREDIENTS

  • several-languages
  • detected-language-does-not-match-field-language
  • ingredient-word-in-ingredient-list
  • recycling-instructions-in-ingredients
  • phone-number-in-ingredients
  • Quality facet: url-in-ingredients #9584
  • digit-within-a-word (unt8denglekhen)
  • vegan-label-non-vegan-ingredient
  • halal-label-non-halal-ingredient
  • kosher-label-non-kosher-ingredient
  • gluten-free-label-gluten-ingredient
  • traces-in-ingredients-but-not-in-field
  • allergens-in-ingredients-but-not-in-field
  • storage-instructions-in-ingredients
  • unexpected-characters # $ € £ ! °
  • question-marks ?
  • various-dictionary-errors "antioxy- dant" "acide ci- rique"
  • country-name-in-ingredients/ (can be combined with origin-of-ingredients state)
  • packager-code-detected

TODO Photo safety

  • world.off.org/quality/face-detected-in-photo/

TODO Additives consistency

TODO Various

DONE

  • Ingredients-ingredient-tag-length-greater-than-50
  • Ingredients-ingredient-tag-length-greater-than-100
  • Ingredients-en-4-consonants
  • Ingredients-en-ending-comma
  • Ingredients-en-unexpected-chars-exclamation-mark
  • Ingredients-en-unexpected-chars-question-mark
  • Ingredients-en-includes-fr-nutrition-facts
  • Ingredients-en-unexpected-chars-currencies
  • Ingredients-en-unexpected-chars-arobase
  • Ingredients-en-4-repeated-chars
  • Ingredients-en-includes-fr-instructions
  • Ingredients-en-5-vowels
@teolemon teolemon added the 🧽 Data quality https://wiki.openfoodfacts.org/Quality label Oct 14, 2017
@teolemon
Copy link
Member Author

By default, They should be sorted by shortest to longest ingredient list. That way, the easier stuff will be done quickly.
Courageous people can start with the longest ingredient lists.

@stephanegigandet
Copy link
Contributor

That sounds very good. Ideally, it would be great if we can make the rules reasonnably easy to write and add. (i.e. without needing to change real code). Something like the edit alerts we already have maybe.

@teolemon
Copy link
Member Author

teolemon commented Oct 21, 2017

From Stéphane:

  • ce qui serait simple à faire : on pourrait imaginer un script python en batch qui tourne la nuit, va chercher les produits dans mongodb un à un, et ajoute les facettes que tu veux dans mongodb
  • le seul truc c'est que les facettes en plus ne seraient pas conservées si le produit est édité ensuite via product opener, mais bon, elles seraient remises la nuit suivante
  • le gros avantage c'est que c'est complètement indépendant du code perl, tu lis le doc dans ton script python, tu fais ce que tu veux comme tu veux, et tu rajoutes juste un champ quality_tags dans le doc, et si t'as changé qq chose, tu l'écris dans mongo

@teolemon
Copy link
Member Author

teolemon commented Oct 21, 2017

Check for nutrition keywords in ingredients

def parse_nutrition(self):
    try:
    nutrition_statements = (
            "glucides",
            "sucres"
        )
        if any(x in self.encode('utf-8') for x in nutrition_statements):
            print "nutrition_statements"
    except IndexError:
        return None

Check for URLs in ingredients

def parse_url(self):
    try:
        import re
        urls_results = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', self)
    except IndexError:
        return None

Check for emails in ingredients

def parse_email(self):
    try:
        import re
        email_results = re.findall(r'[\w\.-]+@[\w\.-]+', self)
        return email_results
    except IndexError:
        return None

Check for phone numbers in ingredients

def parse_phonenumber(self):
    try:
        phonenumber_results = []
        import re
        phonenumber_results.append(re.findall(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}', self))
        #str = self.email.msgtext
        #phonerough = self.soup.select('.info-list')[0].li.findNext('li').get_text().strip()
        #phone = phonerough.split(':', 1)[-1].strip()
        #return "Phone Number"
        #return phonenumber_results
        #phonenumber_results.append("")
        return phonenumber_results
    except IndexError:
        return None

@teolemon
Copy link
Member Author

teolemon commented Oct 22, 2017

Taxonomy based solution for checks

en:Pork-based dishes
is_vegan:en:no
is_vegetarian:en:no
is_kocher:en:no
is_halal:en:no

Product/Image/OCR check

image
https://fr.openfoodfacts.org/produit/4003247203353/creme-amandes-olives-vitaquell
https://static.openfoodfacts.org/images/products/400/324/720/3353/5.json

{u'responses': [{u'logoAnnotations': [{u'score': 0.16612875, u'description': u'AB', u'boundingPoly': {u'vertices': [{u'y': 1507, u'x': 1946}, {u'y': 1507, u'x': 2014}, {u'y': 1630, u'x': 2014}, {u'y': 1630, u'x': 1946}]}}], u'textAnnotations': [{u'locale': u'fr', u'description': u"itaquell\nTartine\nCuisine\nsuggestion\nde pr\xe9sentation\nCreme Amandes\nno avec 12% d'huile dolive\nvierge extra\nNon ouvert, Stocker au frais en-dessous de +10\xb0C\nC E R\nconsommer de pr\xe9f\xe9rence avant le: voir dessous de barquette\nFauser Vitaquell DE-22506 Hamburg\ninfo@vitaquell.de lwww.vitaquell.de\nAGRICULTURE CERTIFIE PAR DE 0K0-013\nDistribution certifi\xe9e par FR-BIO-01\nBIOLOGIQUE AGRICUITUREUEINON UE\ne 125 g\nArt.-Nr: 20335\n4 003247 203353\nVEGAN\n", u'boundingPoly': {u'vertices': [{u'y': 377, u'x': 777}, {u'y': 377, u'x': 2487}, {u'y': 2117, u'x': 2487}, {u'y': 2117, u'x': 777}]}}]}]} 
\nVEGAN\n

@teolemon
Copy link
Member Author

teolemon commented Nov 27, 2017

Add conservation instructions in ingredient field detection

Instructions keyword list

A conserver dans un endroit frais
Fabriqué dans un atelier
à conserver l'abri de chaleur et de Ihurnidité.
Conditions de conservation :
Conseils de préparation :
À CONSOMMER AVEC MODÉRATION
À consommer de préférence avant
dont sucres
Matières grasses : 
dont acides gras saturés
Glucides
dont sucres
kcal
Plus d'infos sur
Poids net:
consigne

@hangy
Copy link
Member

hangy commented Jan 31, 2018

Taxonomy based solution for checks

en:Pork-based dishes
is_vegan:en:no
is_vegetarian:en:no
is_kocher:en:no
is_halal:en:no

What about a more generic approach? We already see attempts of cross-linking taxonomies. That could be used to denote that a category is incompatible with a label. For example, in the categories taxonomy, we could have

en:Pork-based dishes
!label:en:vegan
!label:en:vegetarian
!label:en:kocher
!label:en:halal

to say that a product in that category cannot be labelled as those. At the same time, it would enable us to support cross-label linking ("auto tagging"?) as already suggested in the categories taxonomy:

<en:Olive oils
en:Olive oils from France
fr:Huiles d'olive de France
nl:Franse olijfoliën
country:en:France

@hangy hangy added this to the Data Quality Checks milestone Mar 25, 2018
@stephanegigandet stephanegigandet added 🧽 Data quality - Measure - Quality facets One of the facets available in Open Food Facts is /quality & allows us to spot products w/ bad data and removed 🎯 P1 labels Jun 17, 2019
@teolemon teolemon changed the title Create a quality facet with detected errors Quality Facets - Create a quality facet with detected errors Oct 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧽 Data quality - Measure - Quality facets One of the facets available in Open Food Facts is /quality & allows us to spot products w/ bad data 🧽 Data quality https://wiki.openfoodfacts.org/Quality
Projects
Status: To discuss and validate
Status: In progress
Development

No branches or pull requests

3 participants