Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load packager codes for Slovenia #8958

Closed
4 tasks
Tracked by #338
benbenben2 opened this issue Sep 5, 2023 · 9 comments · Fixed by #10124
Closed
4 tasks
Tracked by #338

Load packager codes for Slovenia #8958

benbenben2 opened this issue Sep 5, 2023 · 9 comments · Fixed by #10124
Assignees
Labels
📍🏭 Packager codes https://blog.openfoodfacts.org/en/news/discover-what-food-products-are-made-near-you-with-made-near- 🇸🇮 Slovenia

Comments

@benbenben2
Copy link
Collaborator

benbenben2 commented Sep 5, 2023

What

Part of

@rkiddy
Copy link

rkiddy commented Sep 23, 2023

I use tabula, which is a java tool, for taking apart tables in a PDF from a government report here. I do not use the latest version as I do not like the way one has to call it, so I use tabula-1.0.4-SNAPSHOT-jar-with-dependencies.jar. I can try something with it later today. Unless someone else gets to it first.

@rkiddy
Copy link

rkiddy commented Sep 23, 2023

I can tell that there will be a few issues:

Screenshot from 2023-09-23 13-43-31

  • The 4th row, 3rd column looks like 3 lines but should be 2
  • The 9th row, 2nd column looks like 3 lines but should be 2.
  • The 10th row, 3rd column looks like 4 lines but should be 2.

Yes?

Well, if it is always 2 lines, that would be ok.

@rkiddy
Copy link

rkiddy commented Sep 23, 2023

I downloaded the pdf file and ran:

 $ java -jar ~/Projects/tabula/tabula-1.0.4-SNAPSHOT-jar-with-dependencies.jar \
     --batch /home/ray/Projects/OFF/slovenia_packaging \
     --lattice --format TSV --pages all

I got a tsv file, which I uploaded (with a copy of the pdf) to https://opencalaccess.org/OFF/slovenia_packaging/ and you can download it from there. But I will look to see what fixes need to be made to the tsv file.

If this is a workable approach, how can this be invoked? Will it be something that is done only once, or will it need to be repeated? Will it need to be done automatically?

@benbenben2
Copy link
Collaborator Author

This is great @rkiddy !
Thank you!

@rkiddy
Copy link

rkiddy commented Sep 24, 2023

Is the TSV sufficient? Are you able to process that? Is anything else needed? Are there going to be updates to the PDF that will have to be tracked? Or is this ticket done for now?

@benbenben2
Copy link
Collaborator Author

Remaining tasks:

  • clean the file (remove duplicated headers, each packaging code should have one address but there are 2 in the file, so need to find out which one to keep when they are differents)
  • create script to add geo coordinates
  • update front to add address for slovenian packaging codes
  • recreate .sto files for packaging

Last two steps are pretty easy, similar to previous pull request for Croatia.
Hopefully for geo coordinates, we can also reuse parts of the script used for Croatia

@rkiddy
Copy link

rkiddy commented Sep 25, 2023

The addresses seem as though they usually have a relationship to each other, but they are different. Can we ask someone to review the Slovenian documentation here? It seems as though one address could be a physical location, such as where something gets delivered, and the other could be the address of the managing business, or something like that. If there are two different addresses there, and they represent different kinds of something, which one do we want to keep? There is no way for us to guess.

@Ban3
Copy link
Contributor

Ban3 commented Sep 26, 2023

If there are two different addresses there, and they represent different kinds of something, which one do we want to keep?

The lists usually have the company headquarters address and the manufacturing location address. The codes are per manufacturing location so that's the one you want.

I'm not Slovenian nor speak the language, but the second address seems to be the actual location. I'm basing this on companies that appear multiple times have the same first address but different second address. For example 'MERCATOR D.O.O.' always has 'DUNAJSKA CESTA 107, 1000 LJUBLJANA' as the first address, but the second varies. Same order seems to apply to the title field.

@benbenben2
Copy link
Collaborator Author

benbenben2 commented Sep 26, 2023

There are some duplicated packaging codes (SI 873, for example).

@rkiddy, to explain a bit more how it is working.

  1. From lib/ProductOpener/Display.pm, the information about the packaging will be fetched ($packager_codes variable is defined in PackagerCodes.pm:
		my $packager_codes_ref = retrieve("$data_root/packager-codes/packager_codes.sto");
		%packager_codes = %{$packager_codes_ref};

and exported, and used in Display.pm)

Hence, this issue, is about a) updating these packager_codes.sto and geocode_adresses.sto files, b) update the display.pm file.

  1. To update the file, if you have cloned the project locally (and installed Docker, etc.), you just run:
docker exec -it po-backend-1 bash
./scripts/update_packager_codes.pl

First line to enter in docker and second line to recreate the file (script will run inside docker).
Note that this script is in the repository, but you run it locally only. It is not used by Open Food Facts itself.
Before to run the script you have to update it to include Slovenia and which column name contains the packaging code in the tsv file. tsv file that you should put in the packager-codes folder (this folder is not used by Open Food Facts itself, it is just used when someone recreate the sto files)

  1. To create the tsv file (it can be a csv as well), we can start from the actual tsv file, and we can rework it. To rework it, we can create a script. That we are going to put in scripts/packager-codes/ folder. This script as well will be in the repository but we will run it locally only. We will be able to reuse it in the future if we want to update the packaging_codes for Slovenia.
    Scripts are in Perl, but not only, there is a Python script and a Shell script there too. I would suggest to create a script based on the Croatian one (hr-packagers-refresh.pl) because it is using OpenStreetMap instead of Google Map (Google Map need an api key and it is not free)

See here for example:
https://github.com/openfoodfacts/openfoodfacts-server/pull/8921/files#diff-ee6a80b0f8a2d4d57b39ccd1eb9c35d82141642119110aa3d752615ae5f40ccc

@teolemon teolemon added 📍🏭 Packager codes https://blog.openfoodfacts.org/en/news/discover-what-food-products-are-made-near-you-with-made-near- 🇸🇮 Slovenia labels Sep 30, 2023
@benbenben2 benbenben2 self-assigned this Apr 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📍🏭 Packager codes https://blog.openfoodfacts.org/en/news/discover-what-food-products-are-made-near-you-with-made-near- 🇸🇮 Slovenia
Projects
Development

Successfully merging a pull request may close this issue.

4 participants