Skip to content

Commit

Permalink
docs: update docs about downloading images (#8415)
Browse files Browse the repository at this point in the history
Add reference to AWS dataset
  • Loading branch information
raphael0202 committed May 15, 2023
1 parent 0a73218 commit dbf1da6
Show file tree
Hide file tree
Showing 2 changed files with 99 additions and 12 deletions.
44 changes: 44 additions & 0 deletions docs/api/aws-images-dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Open Food Facts AWS images dataset

The Open Food Facts images dataset contains all images uploaded to Open Food
Facts and the OCR results on these images obtained using Google Cloud Vision.

The dataset is stored in the `openfoodfacts-images` bucket hosted in the
`eu-west-3` region. All data is stored in a single `/data` folder.

Data is synchronized every month between Open Food Facts server and S3 bucket,
as such some recent images are likely to be missing. You should not assume all
images are present on the S3 bucket.

To know the bucket key associated with an image for the product with barcode
'4012359114303', you should first split the barcode the following way:
`/401/235/911/4303`.

This splitting process is only relevant for EAN13 (barcodes with 13 digits),
for barcodes with a smaller number of digit (like EAN8), the directory path is
not splitted: `/20065034`.

To get the raw image '1' for barcode '4012359114303', simply add the image ID:
`/401/235/911/4303/1.jpg`. Here, you will get the "raw" image, as sent by the
contributor. If you don't need the full resolution image, a 400px resized
version is also available, by adding the `.400` suffix after the image ID:
`/401/235/911/4303/1.400.jpg`.

The OCR of the image is a gzipped JSON file, and has the same file name as the
raw image, but with the `.json.gz` extension: `/401/235/911/4303/1.json.gz`

To download images, you can either use AWS CLI, or perform an HTTP request
directly:

`wget https://openfoodfacts-images.s3.eu-west-3.amazonaws.com/data/401/235/911/4303/1.jpg`

You can know all existing objects (images, OCR results) on the bucket by
downloading the gzipped text file `s3://openfoodfacts-images/data/data_keys.gz`:

`wget https://openfoodfacts-images.s3.eu-west-3.amazonaws.com/data/data_keys.gz`

Then you can easily filter the files you want using `grep` (raw images, OCR
JSON) before downloading them. For example, to keep only 400px versions of all
images:

`zcat data_keys.gz | grep '.400.jpg'`
67 changes: 55 additions & 12 deletions docs/api/how-to-download-images.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,44 @@
# How to download product images

The prefered method of downloading Open Food Facts images depends on what you
which to achieve.

All images can be found on [https://images.openfoodfacts.org/images/products/](https://static.openfoodfacts.org/images/products/). Images of a product are stored in a single directory. The path of this directory can be inferred easily from the product barcode. If the product barcode length is lower or equal to 8 (ex: \"22222222\"), the directory path is simply the barcode: all images can be found on `https://images.openfoodfacts.org/images/products/{barcode}`.
Otherwise, the following regex is used to split the barcode into subfolders: `r"^(...)(...)(...)(.*)$"`. For example, the barcode 3435660768163 is split as follows: 343/566/076/8163, and all images of the products can be found on
If you want to download a limited number of images, especially if these images
have been uploaded recently, you should [download the image from Open Food
Facts
server](./how-to-download-images.md#download-from-open-food-facts-server).

If you plan to download a large amount of images, you should on the contrary
[use Open Food Facts images dataset hosted on
AWS](./how-to-download-images.md#download-from-aws).

## Download from AWS

If you want to download a large number of images, this is the recommended
option, as AWS S3 will be faster and allow concurrent download, contrary to
Open Food Facts server, where you should preferably download images one at a
time. See [AWS Images dataset](./aws-images-dataset.md) for more information
about how to download images from AWS dataset.

## Download from Open Food Facts server

All images can be found on
[https://images.openfoodfacts.org/images/products/](https://static.openfoodfacts.org/images/products/).
Images of a product are stored in a single directory. The path of this
directory can be inferred easily from the product barcode. If the product
barcode length is lower or equal to 8 (ex: "22222222"), the directory path is
simply the barcode: all images can be found on
`https://images.openfoodfacts.org/images/products/{barcode}`.

Otherwise, the following regex is used to split the barcode into subfolders:
`r"^(...)(...)(...)(.*)$"`. For example, the barcode `3435660768163` is split as
follows: `343/566/076/8163`, and all images of the products can be found on
[https://images.openfoodfacts.org/images/products/343/566/076/8163](https://images.openfoodfacts.org/images/products/343/566/076/8163).

To get the image file names, we have to use the database dump or the API. All images information are stored in the `images` field. For product [3168930010883](https://world.openfoodfacts.org/api/v0/product/3168930010883.json), we have:
To get the image file names, we have to use the database dump or the API. All
images information are stored in the `images` field. For product
[3168930010883](https://world.openfoodfacts.org/api/v0/product/3168930010883.json),
we have:

```json
{
Expand Down Expand Up @@ -180,21 +213,31 @@ To get the image file names, we have to use the database dump or the API. All im

The keys of the map are the keys of the images. These keys can be:

- digits: the image is the raw image sent by the contributor.
- selected images: `front_{lang}`, `nutrition_{lang}` and `ingredients_{lang}`, selected as front, nutrition and ingredients
images respectively for `lang`. Here, `lang` is a 2-letter ISO 639-1 language code (fr, en, es,\...).
- digits: the image is the raw image sent by the contributor (full resolution).
- selected images: `front_{lang}`, `nutrition_{lang}` and
`ingredients_{lang}`, selected as front, nutrition and ingredients images
respectively for `lang`. Here, `lang` is a 2-letter ISO 639-1 language code
(fr, en, es,\...).

Each image is available in different resolutions: \"100\", \"200\", \"400\" or \"full\", each corresponding to image height (\"full\" means
not resized). The available resolutions can be found in the `sizes` subfield.
Each image is available in different resolutions: `100`, `200`, `400` or
`full`, each corresponding to image height (`full` means not resized). The
available resolutions can be found in the `sizes` subfield.

Selected images have additional fields:

- `rev` (as revision) indicates the revision number of the image to use (each time a new image is selected, cropped or rotated, a new image with an incremented rev is generated).
- `rev` (as revision) indicates the revision number of the image to use (each
time a new image is selected, cropped or rotated, a new image with an
incremented rev is generated).
- `imgid`, the image ID of the raw image used to generate the selected image.
- `angle`, `x1`, `x2`, `y1`, `y2`: rotation angle and cropping coordinates.

For selected images, the file name is the image key followed by the revision number and the resolution: `front_fr.1.400.jpg`. For raw images, the file name is either the image ID (`1.jpg`) or the image ID followed by the resolution (`1.100.jpg`).
For selected images, the file name is the image key followed by the revision
number and the resolution: `front_fr.1.400.jpg`. For raw images, the file name
is either the image ID (`1.jpg`) or the image ID followed by the resolution
(`1.100.jpg`).

To get the full URL, simply concatenate the product directory path and the image name.
To get the full URL, simply concatenate the product directory path and the
image name. Examples:

If you want to download a significant number of images, let us know before on our [Slack](https://slack.openfoodfacts.org/) and don\'t be too eager to keep our servers safe!
- [https://images.openfoodfacts.org/images/products/343/566/076/8163/1.jpg](https://images.openfoodfacts.org/images/products/343/566/076/8163/1.jpg)
- [https://images.openfoodfacts.org/images/products/343/566/076/8163/1.400.jpg](https://images.openfoodfacts.org/images/products/343/566/076/8163/1.400.jpg)

0 comments on commit dbf1da6

Please sign in to comment.