Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate database export as parquet file #7660

Open
CharlesNepote opened this issue Nov 7, 2022 · 12 comments
Open

Investigate database export as parquet file #7660

CharlesNepote opened this issue Nov 7, 2022 · 12 comments
Labels
Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data delta exports ✅ Task

Comments

@CharlesNepote
Copy link
Member

CharlesNepote commented Nov 7, 2022

Apache Parquet is a file format aiming to manipulate data more easily.

  • allows VERY fast requests (milliseconds vs dozen of seconds with grep/csvgrep)
  • allows native internal compression
  • allows huge compression due to the nature of the storage (column based)
  • it might be easier to update old export with new data

Our current CSV (2023-03): 7.5 GB
Parquet file generated from our current CSV (thanks to csv2parquet): 643 MB with internal zstd compression. Here are the few steps to reproduce:

  • wget -c https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv # get CSV file
  • csvclean -t en.openfoodfacts.org.products.csv # clean up the CSV and convert it from TSV to CSV -- ~6'30s on my laptop
  • ./csv2parquet --header true -n en.openfoodfacts.org.products_out.csv products.pqt > parquet.schema
  • you might edit the ./parquet.schema file
    • for the code field => "data_type": "Utf8", instead of "data_type": "Int64",
    • for the serving_quantity field => "data_type": "Float64", instead of "data_type": "Utf8",
  • time ./csv2parquet --header true -c zstd -s parquet.schema en.openfoodfacts.org.products_out.csv products_zstd.pqt # ~1'45s on my laptop

It's a "young" format (born in 2013). Some tools are already reading/writing Parquet files. The easiest way to read/write Parquet files is to use duckdb (as easy to install as sqlite), natively reading or writing it (without import).
To give a simple example:

time ./duckdb test-duck.db "select * FROM
  (select count(data_quality_errors_tags) as products_with_issues 
    from read_parquet('products_zstd.pqt') where data_quality_errors_tags != ''),
  (select count(data_quality_errors_tags) as products_with_issues_but_without_images 
    from read_parquet('products_zstd.pqt') where data_quality_errors_tags != '' and last_image_datetime == '');"
┌──────────────────────┬─────────────────────────────────────────┐
│ products_with_issues │ products_with_issues_but_without_images │
├──────────────────────┼─────────────────────────────────────────┤
│ 156427               │ 13974                                   │
└──────────────────────┴─────────────────────────────────────────┘

real	0m0,250s
user	0m0,695s
sys	0m0,140s

The same query on a SQLite DB build from the same CSV export takes more than 10s:

time sqlite3 products.db "select * FROM 
  (select count(data_quality_errors_tags) as products_with_issues
     from [all] where data_quality_errors_tags != ''),
  (select count(data_quality_errors_tags) as products_with_issues_but_without_images
     from [all] where data_quality_errors_tags != '' and last_image_datetime == '');"
156427|13974

real	0m11.038s
user	0m1.016s
sys	0m10.010s

You can reproduce this query thanks to our Datasette instance.

Due to the nature of the format (column based), there is an overhead for some queries. select *, for example, need to iterate over all the columns (dozens if not hundreds in our case). Here is an example extracting all columns of the two first products.

time ./duckdb test-duck.db "select * FROM read_parquet('products_zstd.pqt') limit 2;" > /dev/null

real	0m4,377s
user	0m7,068s
sys	0m3,642s

That said, it should be a far better format than CSV for all operations not including all the columns, and even including those ones for complex queries.

(Thanks @moreymat to have suggested me to explore it.)

[EDIT]
Repeated the last test with 10,000 and 100,000 products:

time ./duckdb test-duck.db "select * FROM read_parquet('products_zstd.pqt') limit 10000;" > /dev/null

real	0m5,876s
user	0m8,830s
sys	0m3,670s

charles@barbapapa2020:~/ajeter2$ time ./duckdb test-duck.db "select * FROM read_parquet('products_zstd.pqt') limit 100000;" > /dev/null

real	0m21,756s
user	0m25,963s
sys	0m4,618s
@CharlesNepote CharlesNepote added ✅ Task Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data delta exports labels Nov 7, 2022
@CharlesNepote
Copy link
Member Author

As a Parquet file can be manipulated with SQL queries thanks to duckdb, it should be possible to merge new data with older data thanks to a unique SQL request (tests to be done).

@moreymat
Copy link

moreymat commented Nov 7, 2022

@CharlesNepote thanks a lot for following up on our conversation on this topic, and investigating the gains !

The gains in file size and query speed are impressive.

Two questions:

  1. Your latest request is slow, but I expect the execution time to increase very slowly if you select 2, 100, 1000, 10000, 1000000 or all products (ie. the execution time is clearly dominated by the number of columns). Would you be able to confirm this, or correct my assumption ?

  2. Parquet enables to split the dataset in files corresponding to subsets defined by criteria (eg. days for time-series with sub-hourly data). I was thinking that the barcode could be a good key to split the dataset, eg. one file for products with an EAN-8 code, another for products with EAN-13, and another for all other codes ; or more fine-grained splits based on the GS1 code prefixes. I assume they are natural subsets that are frequently queried separately, eg. 99% of scans in France would be on products within 1 to 5 GS1 prefixes hence could be used to cache results if relevant, or you could apply different processing routines or define different quality checks on these subsets.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2023

This issue is stale because it has been open 90 days with no activity.

@github-actions github-actions bot added the ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. label Feb 6, 2023
@ericemc3
Copy link

ericemc3 commented Sep 5, 2023

A parquet version would be very welcomed indeed! Currently, parsing en.openfoodfacts.org.products.csv is difficult due to, apparently, 15 lines with 'ingredients_text' containing \"
(for instance: \"ce produit contient des sulfites\").

Anyway, this DuckDB query seems to do the job: converting to a parquet version (840 mo).

COPY (
FROM read_csv_auto('C:/.../en.openfoodfacts.org.products.csv', quote='\\"') 
) TO 'C:/.../en.openfoodfacts.org.products.parquet';

And then, this sample query

SELECT last_modified_by, count(*) FROM 'C:/.../en.openfoodfacts.org.products.parquet' 
GROUP BY 1
ORDER BY 2 DESC LIMIT 10;

will run in 200 ms
image

@linogaliana
Copy link

Hi @CharlesNepote and others usual partners in (open data) crime !

This is a very encouraging discussion !

I am 💯 % in favor to having a parquet file alongside csv (since parquet is not yet universally known).

Parquet files at Insee

We aim to use parquet more and more in the French statistical system :

  • For internal data use, there are now guidelines for sharing data as parquet between producers and users.
  • For external data dissemination, Insee published a few months ago its first parquet file on datagouv. We hope that more and more dataset could be disseminated as parquet 🤞

In Insee's innovation teams, we advocate a lot for a more general use of parquet files.

  • We have recommendations about parquet in our good practices trainings (we don't totally discard CSV in this one, innovating is a step by step progress). Next week I should publish a blog post out regarding parquet files handling with DuckDB.
  • In a Python tutorial on dataviz I built (here), I propose to improve best performance by using DuckDB for data wrangling. I just created an observable notebook to test it with WASM. I have just taken a few minutes to build the notebook, it is full of bug but shows what could be technically possible to read on browser parquet tables if it were produced.

Other sources of inspiration :

  • HuggingFace viewer is behind the stage based on parquet + duckdb WASM
  • SSPCloud is going to implement a file explorer based on parquet + duckdb
  • The latest duckdb release might be a game changer since it is going to make it easier to use duckdb in third party softwares. For instance Tad is quite nice and its long standing issue about proxies is now solved.
  • @ericemc3 wrote nice blog posts about parquet file
  • A paper regarding parquet written in one of the reviews edited by Insee.

Challenge

I think there might be some variables that can be quite challenging in the prospect of getting a small parquet that can be handled easily. Among them I see : categories and ingredients.

It looks like they have, at the same time, sparse columns but, when information is present, they stack together very long strings.
This is, I think, the potential limit of parquet for your use case: you don't always have flat information in your database system.

Maybe these columns could be in a separate parquet file that would store basic nutritional facts, alongside product basic information (name, EAN, etc.). This could give a lightweighted file allowing most users to handle basic nutritional data. A star schema that would allow joining together multiple variables in different files could then do the trick.

I don't know if this idea is a good. However, in my opinion, if you want to further external web apps or dataviz connected to your dataset, this could help.

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Nov 25, 2023

I have made some tests to evaluate:

  • if duckdb is a good option to generate parquet files
  • the differences of file size and processing speed, depending on some duckdb options.

(@ericemc3 import seems to be ok if you provide quote=''.)

$ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='') 
) TO 'en.openfoodfacts.org.products.2023-11-24.parquet';
EOF

real 2m4,673s

$ ls -lh en.openfoodfacts.org.products.2023-11-24.parquet

-rw-r--r-- 1 charles charles 839M 25 nov.  10:45 en.openfoodfacts.org.products.2023-11-24.parquet

Is smaller ROW_GROUP_SIZE than the 122880 default change things? With 50000, size is 17 MB bigger (2%), and time is 11s smaller (9%).

$ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='') 
) TO 'en.openfoodfacts.org.products.2023-11-24.gps50000.parquet' (FORMAT PARQUET, ROW_GROUP_SIZE 50000);
EOF

real    1m53,181s

$ ls -lh en.openfoodfacts.org.products.2023-11-24.gps50000.parquet

-rw-r--r-- 1 charles charles 856M 25 nov.  11:30 en.openfoodfacts.org.products.2023-11-24.gps50000.parquet

Is bigger ROW_GROUP_SIZE than the 122880 default change things? With 300000, size is 13 MB smaller (1.5%), time is 15s bigger (12%).

time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='') 
) TO 'en.openfoodfacts.org.products.2023-11-24.gps300000.parquet' (FORMAT PARQUET, ROW_GROUP_SIZE 300000);
EOF

real    2m19,405s

$ ls -lh en.openfoodfacts.org.products.2023-11-24.gps300000.parquet

-rw-r--r-- 1 charles charles 826M 25 nov.  11:22 en.openfoodfacts.org.products.2023-11-24.gps300000.parquet

Is compression useful? Clearly yes: size is 313 MB smaller (37%), with a small impact on speed (13s, 10%).

$ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='') 
) TO 'en.openfoodfacts.org.products.2023-11-24.zstd.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF

real    2m17,476s

$  ls -lh en.openfoodfacts.org.products.2023-11-24.zstd.parquet

-rw-r--r-- 1 charles charles 527M 25 nov.  11:18 en.openfoodfacts.org.products.2023-11-24.zstd.parquet

Compression + ROW_GROUP_SIZE 50000 is providing not so much differences: ~ same time as with no option, and size = 541 MB. In our context priority should be given to file size to lower the barrier to usages. Time saving during parquet creation are low (~10%) when playing duckdb options. In conclusion I would tend to use zstd compression without any other option.

In my next tests:

  • I will check if duckdb has well detected the right type for each column.
  • I will evaluate if it's possible to produce data diffs, to be able to 1. store snapshots of the DB in the same parquet file, and/or 2. provide small parquet diffs to allow rebuild the whole DB at a certain time.

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Dec 12, 2023

If we add the products that have been modified to the parquet file, we are able to request only the last version of the products with: select *, max(last_modified_date) from "products.parquet" group by code;

See: http://sqlfiddle.com/#!5/5be4d/1/0

TODO: see if it's possible to create views or some mecanisms to:

  • request only the last version of the products
  • request products at a certain date
  • compare products from a certain date to their last version
  • compare the products from two different dates

@ericemc3
Copy link

Is this useful ?
SELECT code, arg_max(id, ldate::date) FROM tbl GROUP BY code;

@CharlesNepote
Copy link
Member Author

Side note: for Open Food Facts CSV this is very important to use sample_size = 3000000 to allow column types detection.
Otherwise, many *_100g fields are detected as BIGINT or VARCHAR instead of DOUBLE.

All *_100g fields should be detected as DOUBLE except nutrition-score-fr_100g and nutrition-score-uk_100g which are BIGINT. Using sample_size = 3000000, the import is longer but types detection is impressive.

$ ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.csv', quote='', sample_size = 3000000) 
) TO 'en.openfoodfacts.org.products.zstd.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF

$ ./duckdb x.db "DESCRIBE SELECT * FROM 'en.openfoodfacts.org.products.zstd.parquet';" -box
┌───────────────────────────────────────────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│                      column_name                      │ column_type │ null │ key │ default │ extra │
├───────────────────────────────────────────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ code                                                  │ VARCHAR     │ YES  │     │         │       │
│ url                                                   │ VARCHAR     │ YES  │     │         │       │
│ creator                                               │ VARCHAR     │ YES  │     │         │       │
│ created_t                                             │ BIGINT      │ YES  │     │         │       │
│ created_datetime                                      │ TIMESTAMP   │ YES  │     │         │       │
│ last_modified_t                                       │ BIGINT      │ YES  │     │         │       │
│ last_modified_datetime                                │ TIMESTAMP   │ YES  │     │         │       │
│ last_modified_by                                      │ VARCHAR     │ YES  │     │         │       │
│ product_name                                          │ VARCHAR     │ YES  │     │         │       │
│ abbreviated_product_name                              │ VARCHAR     │ YES  │     │         │       │
│ generic_name                                          │ VARCHAR     │ YES  │     │         │       │
│ quantity                                              │ VARCHAR     │ YES  │     │         │       │
│ packaging                                             │ VARCHAR     │ YES  │     │         │       │
│ packaging_tags                                        │ VARCHAR     │ YES  │     │         │       │
│ packaging_en                                          │ VARCHAR     │ YES  │     │         │       │
│ packaging_text                                        │ VARCHAR     │ YES  │     │         │       │
│ brands                                                │ VARCHAR     │ YES  │     │         │       │
│ brands_tags                                           │ VARCHAR     │ YES  │     │         │       │
│ categories                                            │ VARCHAR     │ YES  │     │         │       │
│ categories_tags                                       │ VARCHAR     │ YES  │     │         │       │
│ categories_en                                         │ VARCHAR     │ YES  │     │         │       │
│ origins                                               │ VARCHAR     │ YES  │     │         │       │
│ origins_tags                                          │ VARCHAR     │ YES  │     │         │       │
│ origins_en                                            │ VARCHAR     │ YES  │     │         │       │
│ manufacturing_places                                  │ VARCHAR     │ YES  │     │         │       │
│ manufacturing_places_tags                             │ VARCHAR     │ YES  │     │         │       │
│ labels                                                │ VARCHAR     │ YES  │     │         │       │
│ labels_tags                                           │ VARCHAR     │ YES  │     │         │       │
│ labels_en                                             │ VARCHAR     │ YES  │     │         │       │
│ emb_codes                                             │ VARCHAR     │ YES  │     │         │       │
│ emb_codes_tags                                        │ VARCHAR     │ YES  │     │         │       │
│ first_packaging_code_geo                              │ VARCHAR     │ YES  │     │         │       │
│ cities                                                │ VARCHAR     │ YES  │     │         │       │
│ cities_tags                                           │ VARCHAR     │ YES  │     │         │       │
│ purchase_places                                       │ VARCHAR     │ YES  │     │         │       │
│ stores                                                │ VARCHAR     │ YES  │     │         │       │
│ countries                                             │ VARCHAR     │ YES  │     │         │       │
│ countries_tags                                        │ VARCHAR     │ YES  │     │         │       │
│ countries_en                                          │ VARCHAR     │ YES  │     │         │       │
│ ingredients_text                                      │ VARCHAR     │ YES  │     │         │       │
│ ingredients_tags                                      │ VARCHAR     │ YES  │     │         │       │
│ ingredients_analysis_tags                             │ VARCHAR     │ YES  │     │         │       │
│ allergens                                             │ VARCHAR     │ YES  │     │         │       │
│ allergens_en                                          │ VARCHAR     │ YES  │     │         │       │
│ traces                                                │ VARCHAR     │ YES  │     │         │       │
│ traces_tags                                           │ VARCHAR     │ YES  │     │         │       │
│ traces_en                                             │ VARCHAR     │ YES  │     │         │       │
│ serving_size                                          │ VARCHAR     │ YES  │     │         │       │
│ serving_quantity                                      │ DOUBLE      │ YES  │     │         │       │
│ no_nutrition_data                                     │ VARCHAR     │ YES  │     │         │       │
│ additives_n                                           │ BIGINT      │ YES  │     │         │       │
│ additives                                             │ VARCHAR     │ YES  │     │         │       │
│ additives_tags                                        │ VARCHAR     │ YES  │     │         │       │
│ additives_en                                          │ VARCHAR     │ YES  │     │         │       │
│ nutriscore_score                                      │ BIGINT      │ YES  │     │         │       │
│ nutriscore_grade                                      │ VARCHAR     │ YES  │     │         │       │
│ nova_group                                            │ BIGINT      │ YES  │     │         │       │
│ pnns_groups_1                                         │ VARCHAR     │ YES  │     │         │       │
│ pnns_groups_2                                         │ VARCHAR     │ YES  │     │         │       │
│ food_groups                                           │ VARCHAR     │ YES  │     │         │       │
│ food_groups_tags                                      │ VARCHAR     │ YES  │     │         │       │
│ food_groups_en                                        │ VARCHAR     │ YES  │     │         │       │
│ states                                                │ VARCHAR     │ YES  │     │         │       │
│ states_tags                                           │ VARCHAR     │ YES  │     │         │       │
│ states_en                                             │ VARCHAR     │ YES  │     │         │       │
│ brand_owner                                           │ VARCHAR     │ YES  │     │         │       │
│ ecoscore_score                                        │ DOUBLE      │ YES  │     │         │       │
│ ecoscore_grade                                        │ VARCHAR     │ YES  │     │         │       │
│ nutrient_levels_tags                                  │ VARCHAR     │ YES  │     │         │       │
│ product_quantity                                      │ DOUBLE      │ YES  │     │         │       │
│ owner                                                 │ VARCHAR     │ YES  │     │         │       │
│ data_quality_errors_tags                              │ VARCHAR     │ YES  │     │         │       │
│ unique_scans_n                                        │ BIGINT      │ YES  │     │         │       │
│ popularity_tags                                       │ VARCHAR     │ YES  │     │         │       │
│ completeness                                          │ DOUBLE      │ YES  │     │         │       │
│ last_image_t                                          │ BIGINT      │ YES  │     │         │       │
│ last_image_datetime                                   │ TIMESTAMP   │ YES  │     │         │       │
│ main_category                                         │ VARCHAR     │ YES  │     │         │       │
│ main_category_en                                      │ VARCHAR     │ YES  │     │         │       │
│ image_url                                             │ VARCHAR     │ YES  │     │         │       │
│ image_small_url                                       │ VARCHAR     │ YES  │     │         │       │
│ image_ingredients_url                                 │ VARCHAR     │ YES  │     │         │       │
│ image_ingredients_small_url                           │ VARCHAR     │ YES  │     │         │       │
│ image_nutrition_url                                   │ VARCHAR     │ YES  │     │         │       │
│ image_nutrition_small_url                             │ VARCHAR     │ YES  │     │         │       │
│ energy-kj_100g                                        │ DOUBLE      │ YES  │     │         │       │
│ energy-kcal_100g                                      │ DOUBLE      │ YES  │     │         │       │
│ energy_100g                                           │ DOUBLE      │ YES  │     │         │       │
│ energy-from-fat_100g                                  │ DOUBLE      │ YES  │     │         │       │
│ fat_100g                                              │ DOUBLE      │ YES  │     │         │       │
[...]
│ nutrition-score-fr_100g                               │ BIGINT      │ YES  │     │         │       │
│ nutrition-score-uk_100g                               │ BIGINT      │ YES  │     │         │       │
[...]
│ sulphate_100g                                         │ DOUBLE      │ YES  │     │         │       │
│ nitrate_100g                                          │ DOUBLE      │ YES  │     │         │       │
└───────────────────────────────────────────────────────┴─────────────┴──────┴─────┴─────────┴───────┘

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Dec 18, 2023

I have started to build the bash script to create the parquet conversion AND the update of the parquet file with new products. Below for those who are curious (it's simple bash with many comments, you can try it, it should work out of the box if you're using Linux).

I'm still facing three issues.

  1. I'm facing errors from DuckDB yelling that there are UTF8 errors. But I'm not sure I can reproduce clearly.
    The issue, and its solution below, seems to be well-known: https://til.simonwillison.net/linux/iconv
    Though, I find it annoying to convert the CSV and waste time and 8GB+ for this.

  2. I'm adding updated products to the previous parquet file, building a completely historized file (see Store historical datasets on Amazon Datasets #9355).
    But the SQL to request only the relevant data is not so easy. What works with SQLite doesn't with DuckDB. I made it work with another query (see request 7 below), but it's more complex (and maybe bad for performance).

  3. Last, but not least, historization is based on last_modified_datetime which is not taking into account "silent" updates made by Product Opener. See Add last_updated_t for silent updates (in addition to existing last_modified_t for new product revisions) #8860.

#!/usr/bin/env bash

# duckdb executable path (without trailing slash)
DP=~

# TODO: gather stats and save them a log file; save info about main operations in the log file

# 0. find there is an old parquet export in the directory
[[ -f "en.openfoodfacts.org.products.parquet" ]] && OLD=1 || OLD=0

# 1. download latest CSV
wget -c https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv # get CSV file

# Discard invalid characters
# duckdb doesn't like invalid UTF8. It did not want to read some parquet file as such, with the following error:
# Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
# (occuring namely on this product: https://world.openfoodfacts.org/product/9900109008673?rev=4 )
# The issue, and its solution below, seems to be well-known: https://til.simonwillison.net/linux/iconv
iconv -f utf-8 -t utf-8 -c en.openfoodfacts.org.products.csv -o en.openfoodfacts.org.products.converted.csv

# 2. Create new temporary parquet file. From 2 to 5 minutes, depending on your machine
[[ -f "en.openfoodfacts.org.products.tmp.parquet" ]] && rm en.openfoodfacts.org.products.tmp.parquet
$DP/duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.converted.csv', quote='', sample_size=3000000, delim='\t') 
) TO 'en.openfoodfacts.org.products.tmp.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF

# 3. If a parquet file is already present, then merge the new data
if [[ $OLD = 1 ]]; then
  # Find last last_modified_date in the old parquet file
  LATEST_PRODUCT_MODIFICATION=$(
  $DP/duckdb :memory: -noheader -ascii -newline '' <<EOF
  SELECT last_modified_datetime FROM read_parquet('en.openfoodfacts.org.products.parquet')
    ORDER BY last_modified_datetime DESC LIMIT 1;
EOF
  )
  # TODO: if the last_modified_date is from today, do not update?

  # Create a temporary duckdb DB to merge current parquet file with the new data
  # (duckdb is not able to merge parket files directly)
  [[ -f "tempo.db" ]] && rm tempo.db
  $DP/duckdb tempo.db <<EOF
  CREATE TABLE products AS
    SELECT * FROM read_parquet('en.openfoodfacts.org.products.parquet');
EOF

  # Find all the products that have been modified, and insert them in the temporary duckdb DB
  $DP/duckdb tempo.db <<EOF
  INSERT INTO products
    SELECT * FROM read_parquet('en.openfoodfacts.org.products.tmp.parquet')
    WHERE last_modified_datetime > strptime('$LATEST_PRODUCT_MODIFICATION', '%Y-%m-%d %I:%M:%S');
EOF

  # Create the new parquet file based on the temporary duckdb DB
  $DP/duckdb tempo.db <<EOF
  COPY (SELECT * FROM products) TO 'en.openfoodfacts.org.products.new.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD')
EOF

else # if there isn't a previous parquet file, the temporary parquet file becomes the target result
  mv en.openfoodfacts.org.products.tmp.parquet en.openfoodfacts.org.products.parquet
fi

# TODO: gather some stats (nb of products, last modified product, nb of new products, etc.)

<<COMMENTS

# Requests to verify all is OK: you can try them manually by copy/pasting

# Variable to find duckdb paths:
DP=~

# 0. Request to verify the total number of products (including updated ones)
$DP/duckdb :memory: -noheader -ascii -newline '' "
select count(*) from read_parquet('en.openfoodfacts.org.products.new.parquet');"

# 1. Request to verify the parquet file is coherent with good data in the right columns
# vd is Visidata, an awesome tool you have to know if you're dealing with serious data: https://www.visidata.org/
$DP/duckdb :memory: -csv "
select * from read_parquet('en.openfoodfacts.org.products.new.parquet')
where completeness > 0.99 -- products with a good level of completeness
order by last_modified_datetime limit 10;
" | vd -f csv

# 2. Request to verify the parquet file is coherent with good data in the right columns
# (too long!!!) request (due to random() sort order), but useful to verify data are ok
$DP/duckdb :memory: -csv "
select * from read_parquet('en.openfoodfacts.org.products.new.parquet')
where completeness > 1.2 -- products with a good level of completeness
order by random() limit 10;
" | vd -f csv

# Previous tests before using iconv
  # 3. simple request
  $DP/duckdb :memory: <<EOF
  select * from read_parquet('en.openfoodfacts.org.products.new.parquet') order by last_modified_datetime limit 5;
  EOF
  => KO! Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
  => semble-t-il à cause de ce produit : https://world.openfoodfacts.org/cgi/product.pl?type=edit&code=9900109008673
  Je réessaye en ajoutant delim='\t' pour voir => même erreur.

  # 4. request with a "where" clause, to check performance
  $DP/duckdb :memory: <<EOF
  select * from read_parquet('en.openfoodfacts.org.products.new.parquet') where countries_en like '%Germany' limit 5;
  EOF
  => KO! Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
  => semble-t-il à cause de ce produit : https://world.openfoodfacts.org/cgi/product.pl?type=edit&code=9900109008673
  Je réessaye en ajoutant delim='\t' pour voir => même erreur.


# Requests to be tested again

# 5. multiple select + aggregation to see performance
time $DP/duckdb :memory: "select * FROM
  (select count(data_quality_errors_tags) as products_with_issues 
    from read_parquet('en.openfoodfacts.org.products.new.parquet') where data_quality_errors_tags != ''),
  (select count(data_quality_errors_tags) as products_with_issues_but_without_images 
    from read_parquet('en.openfoodfacts.org.products.new.parquet') where data_quality_errors_tags != '' and last_image_datetime is null);"

# 6. request to read the last version (ie the current) version of the database
$DP/duckdb :memory: <<EOF
select *, max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet')
  group by code
  order by last_modified_datetime desc
  limit 5
  ;
EOF
=> KO! Error: near line 1: Binder Error: column "creator" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(creator)" if the exact value of "creator" is not important.

# 7. request to read the last version (ie the current) version of the database (tried another way)
# TODO: create a view?
$DP/duckdb :memory: <<EOF
select * from read_parquet('en.openfoodfacts.org.products.new.parquet') t1
  where last_modified_datetime = 
    (select max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet') t2
     where t1.code = t2.code)
  order by last_modified_datetime desc
  limit 5
  ;
EOF

# 8. same request as 7, but counting the number of products
$DP/duckdb :memory: <<EOF
select count(*) from read_parquet('en.openfoodfacts.org.products.new.parquet') t1
  where last_modified_datetime = 
    (select max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet') t2
     where t1.code = t2.code)
  ;
EOF

# 9. idem 7 but another way
$DP/duckdb :memory: <<EOF
SELECT DISTINCT ON (code)
code, * 
FROM read_parquet('en.openfoodfacts.org.products.new.parquet')
ORDER BY code, last_modified_datetime DESC;
EOF
=> KO prend énormément de mémoire


COMMENTS

@github-actions github-actions bot removed the ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. label Dec 27, 2023
@ericemc3
Copy link

ericemc3 commented Jan 6, 2024

Pour la 7, je peux suggérer :

SET threads = 6; -- je réduis le nb par défaut, ça accélère un peu

FROM 'c:\...\en.openfoodfacts.org.products.parquet'
QUALIFY rank() OVER(PARTITION BY code ORDER BY last_modified_datetime DESC) = 1;

50 secondes sur mon portable windows.

L'idéal étant d'avoir constitué le fichier parquet en triant au préalable sur code et last_modified_datetime

@ericemc3
Copy link

ericemc3 commented Jan 6, 2024

Et pour la 8 :

WITH last_modified AS (
     SELECT code FROM 'en.openfoodfacts.org.products.new.parquet'
     QUALIFY max(last_modified_datetime) OVER(PARTITION BY code) = last_modified_datetime
)
SELECT count(*) FROM last_modified ;

1 seconde

ou plus simplement, et notamment si last_modified_datetime est parfois doublonné pour un même code :

SELECT count(DISTINCT code) from 'en.openfoodfacts.org.products.new.parquet'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data delta exports ✅ Task
Projects
Status: To discuss and validate
Development

No branches or pull requests

4 participants