Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expiration date format #22

Open
Tracked by #600
hangy opened this issue Apr 7, 2019 · 7 comments
Open
Tracked by #600

Expiration date format #22

hangy opened this issue Apr 7, 2019 · 7 comments

Comments

@hangy
Copy link
Member

hangy commented Apr 7, 2019

I noticed that revision 29, robotoff added the expiration date 14/06/2019. In the JSON file for my uploaded picture, the date is written as 14.06.2019 (dd.mm.yyyy). Clearly, there's some kind of processing going on. In my opinion, the date should either be written in a format in the language of the uploaded picture, or normalized to ISO 8601, so that consumers don't need to play a guessing game about which digit is a day and which is a month. I prefer ISO 8601 for all languages.

@raphael0202
Copy link
Collaborator

Indeed there is a preprocessing going on, first to check that the candidate is really a date, and to normalize the date format. However only dates of format %d/%m/%Y are currently recognized, so mm.dd.yyyy dates are not recognized (unless the value for month is valid for day and inversely).

See https://github.com/openfoodfacts/robotoff/blob/master/robotoff/insights/ocr/expiration_date.py for more info.
I like the idea of normalizing to ISO 8601, but the format will be inconsistent between robotoff- and user- annotated products.

@hangy
Copy link
Member Author

hangy commented Apr 8, 2019

The regex full_digits_long does match 14.06.2019 if I test it: https://regexr.com/4br3a Otherwise, robotoff wouldn't have added 14/06/2019 as the products date.

@raphael0202
Copy link
Collaborator

Yes indeed, as I said above robotoff matches dates of format dd.mm.yyyy

@hangy
Copy link
Member Author

hangy commented Apr 8, 2019

Sorry, I must've misread. In that case, robotoff shouldn't replace the separators.

I like the idea of normalizing to ISO 8601, but the format will be inconsistent between robotoff- and user- annotated products.

I usually do use ISO when editing manually, because it's the only consistent syntax. 😁 But yes, there may be differences. There's an old openfoodfacts-server issue about normalizing the date, but it work on it hasn't been started, yet.

@hangy
Copy link
Member Author

hangy commented Jun 29, 2019

Yes indeed, as I said above robotoff matches dates of format dd.mm.yyyy

In https://de.openfoodfacts.org/produkt/4311501619872/harzer-minis-gut-gunstig?rev=20 the expiration date was updated to 06/07/2019 based on the text 06.07.19 in this image: https://de.openfoodfacts.org/images/products/431/150/161/9872/4.jpg The product's main language is German, an dd/mm/yyyy is not a known date pattern in Germany.

@raphael0202
Copy link
Collaborator

I've fixed the normalization issue by normalizing dates to ISO 8601 in 836b4eb.
Thanks for the report!

Regarding the other issue you mention, I think we could change the detection pattern given the detected language on the image. If most words returned by the OCR are detected as german -> dd/mm/yyyy pattern would not be used.

@hangy
Copy link
Member Author

hangy commented Nov 18, 2019

I've fixed the normalization issue by normalizing dates to ISO 8601 in 836b4eb.

Looks good, thank you!

Regarding the other issue you mention, I think we could change the detection pattern given the detected language on the image. If most words returned by the OCR are detected as german -> dd/mm/yyyy pattern would not be used.

That's probably a good idea. Wikipedia lists several interesting date patterns that could be used for parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To discuss and validate
Development

No branches or pull requests

4 participants