New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate database export as parquet file #7660
Comments
As a Parquet file can be manipulated with SQL queries thanks to duckdb, it should be possible to merge new data with older data thanks to a unique SQL request (tests to be done). |
@CharlesNepote thanks a lot for following up on our conversation on this topic, and investigating the gains ! The gains in file size and query speed are impressive. Two questions:
|
This issue is stale because it has been open 90 days with no activity. |
A parquet version would be very welcomed indeed! Currently, parsing en.openfoodfacts.org.products.csv is difficult due to, apparently, 15 lines with 'ingredients_text' containing Anyway, this DuckDB query seems to do the job: converting to a parquet version (840 mo).
And then, this sample query
|
Hi @CharlesNepote and others usual partners in (open data) crime ! This is a very encouraging discussion ! I am 💯 % in favor to having a parquet file alongside csv (since parquet is not yet universally known). Parquet files at InseeWe aim to use parquet more and more in the French statistical system :
In Insee's innovation teams, we advocate a lot for a more general use of parquet files.
Other sources of inspiration :
ChallengeI think there might be some variables that can be quite challenging in the prospect of getting a small parquet that can be handled easily. Among them I see : categories and ingredients. It looks like they have, at the same time, sparse columns but, when information is present, they stack together very long strings. Maybe these columns could be in a separate parquet file that would store basic nutritional facts, alongside product basic information (name, EAN, etc.). This could give a lightweighted file allowing most users to handle basic nutritional data. A star schema that would allow joining together multiple variables in different files could then do the trick. I don't know if this idea is a good. However, in my opinion, if you want to further external web apps or dataviz connected to your dataset, this could help. |
I have made some tests to evaluate:
(@ericemc3 import seems to be ok if you provide $ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='')
) TO 'en.openfoodfacts.org.products.2023-11-24.parquet';
EOF
real 2m4,673s
$ ls -lh en.openfoodfacts.org.products.2023-11-24.parquet
-rw-r--r-- 1 charles charles 839M 25 nov. 10:45 en.openfoodfacts.org.products.2023-11-24.parquet Is smaller $ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='')
) TO 'en.openfoodfacts.org.products.2023-11-24.gps50000.parquet' (FORMAT PARQUET, ROW_GROUP_SIZE 50000);
EOF
real 1m53,181s
$ ls -lh en.openfoodfacts.org.products.2023-11-24.gps50000.parquet
-rw-r--r-- 1 charles charles 856M 25 nov. 11:30 en.openfoodfacts.org.products.2023-11-24.gps50000.parquet Is bigger time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='')
) TO 'en.openfoodfacts.org.products.2023-11-24.gps300000.parquet' (FORMAT PARQUET, ROW_GROUP_SIZE 300000);
EOF
real 2m19,405s
$ ls -lh en.openfoodfacts.org.products.2023-11-24.gps300000.parquet
-rw-r--r-- 1 charles charles 826M 25 nov. 11:22 en.openfoodfacts.org.products.2023-11-24.gps300000.parquet Is compression useful? Clearly yes: size is 313 MB smaller (37%), with a small impact on speed (13s, 10%). $ time ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.2023-11-24.csv', quote='')
) TO 'en.openfoodfacts.org.products.2023-11-24.zstd.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF
real 2m17,476s
$ ls -lh en.openfoodfacts.org.products.2023-11-24.zstd.parquet
-rw-r--r-- 1 charles charles 527M 25 nov. 11:18 en.openfoodfacts.org.products.2023-11-24.zstd.parquet Compression + In my next tests:
|
If we add the products that have been modified to the parquet file, we are able to request only the last version of the products with: See: http://sqlfiddle.com/#!5/5be4d/1/0 TODO: see if it's possible to create views or some mecanisms to:
|
Is this useful ? |
Side note: for Open Food Facts CSV this is very important to use All $ ./duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.csv', quote='', sample_size = 3000000)
) TO 'en.openfoodfacts.org.products.zstd.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF
$ ./duckdb x.db "DESCRIBE SELECT * FROM 'en.openfoodfacts.org.products.zstd.parquet';" -box
┌───────────────────────────────────────────────────────┬─────────────┬──────┬─────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
├───────────────────────────────────────────────────────┼─────────────┼──────┼─────┼─────────┼───────┤
│ code │ VARCHAR │ YES │ │ │ │
│ url │ VARCHAR │ YES │ │ │ │
│ creator │ VARCHAR │ YES │ │ │ │
│ created_t │ BIGINT │ YES │ │ │ │
│ created_datetime │ TIMESTAMP │ YES │ │ │ │
│ last_modified_t │ BIGINT │ YES │ │ │ │
│ last_modified_datetime │ TIMESTAMP │ YES │ │ │ │
│ last_modified_by │ VARCHAR │ YES │ │ │ │
│ product_name │ VARCHAR │ YES │ │ │ │
│ abbreviated_product_name │ VARCHAR │ YES │ │ │ │
│ generic_name │ VARCHAR │ YES │ │ │ │
│ quantity │ VARCHAR │ YES │ │ │ │
│ packaging │ VARCHAR │ YES │ │ │ │
│ packaging_tags │ VARCHAR │ YES │ │ │ │
│ packaging_en │ VARCHAR │ YES │ │ │ │
│ packaging_text │ VARCHAR │ YES │ │ │ │
│ brands │ VARCHAR │ YES │ │ │ │
│ brands_tags │ VARCHAR │ YES │ │ │ │
│ categories │ VARCHAR │ YES │ │ │ │
│ categories_tags │ VARCHAR │ YES │ │ │ │
│ categories_en │ VARCHAR │ YES │ │ │ │
│ origins │ VARCHAR │ YES │ │ │ │
│ origins_tags │ VARCHAR │ YES │ │ │ │
│ origins_en │ VARCHAR │ YES │ │ │ │
│ manufacturing_places │ VARCHAR │ YES │ │ │ │
│ manufacturing_places_tags │ VARCHAR │ YES │ │ │ │
│ labels │ VARCHAR │ YES │ │ │ │
│ labels_tags │ VARCHAR │ YES │ │ │ │
│ labels_en │ VARCHAR │ YES │ │ │ │
│ emb_codes │ VARCHAR │ YES │ │ │ │
│ emb_codes_tags │ VARCHAR │ YES │ │ │ │
│ first_packaging_code_geo │ VARCHAR │ YES │ │ │ │
│ cities │ VARCHAR │ YES │ │ │ │
│ cities_tags │ VARCHAR │ YES │ │ │ │
│ purchase_places │ VARCHAR │ YES │ │ │ │
│ stores │ VARCHAR │ YES │ │ │ │
│ countries │ VARCHAR │ YES │ │ │ │
│ countries_tags │ VARCHAR │ YES │ │ │ │
│ countries_en │ VARCHAR │ YES │ │ │ │
│ ingredients_text │ VARCHAR │ YES │ │ │ │
│ ingredients_tags │ VARCHAR │ YES │ │ │ │
│ ingredients_analysis_tags │ VARCHAR │ YES │ │ │ │
│ allergens │ VARCHAR │ YES │ │ │ │
│ allergens_en │ VARCHAR │ YES │ │ │ │
│ traces │ VARCHAR │ YES │ │ │ │
│ traces_tags │ VARCHAR │ YES │ │ │ │
│ traces_en │ VARCHAR │ YES │ │ │ │
│ serving_size │ VARCHAR │ YES │ │ │ │
│ serving_quantity │ DOUBLE │ YES │ │ │ │
│ no_nutrition_data │ VARCHAR │ YES │ │ │ │
│ additives_n │ BIGINT │ YES │ │ │ │
│ additives │ VARCHAR │ YES │ │ │ │
│ additives_tags │ VARCHAR │ YES │ │ │ │
│ additives_en │ VARCHAR │ YES │ │ │ │
│ nutriscore_score │ BIGINT │ YES │ │ │ │
│ nutriscore_grade │ VARCHAR │ YES │ │ │ │
│ nova_group │ BIGINT │ YES │ │ │ │
│ pnns_groups_1 │ VARCHAR │ YES │ │ │ │
│ pnns_groups_2 │ VARCHAR │ YES │ │ │ │
│ food_groups │ VARCHAR │ YES │ │ │ │
│ food_groups_tags │ VARCHAR │ YES │ │ │ │
│ food_groups_en │ VARCHAR │ YES │ │ │ │
│ states │ VARCHAR │ YES │ │ │ │
│ states_tags │ VARCHAR │ YES │ │ │ │
│ states_en │ VARCHAR │ YES │ │ │ │
│ brand_owner │ VARCHAR │ YES │ │ │ │
│ ecoscore_score │ DOUBLE │ YES │ │ │ │
│ ecoscore_grade │ VARCHAR │ YES │ │ │ │
│ nutrient_levels_tags │ VARCHAR │ YES │ │ │ │
│ product_quantity │ DOUBLE │ YES │ │ │ │
│ owner │ VARCHAR │ YES │ │ │ │
│ data_quality_errors_tags │ VARCHAR │ YES │ │ │ │
│ unique_scans_n │ BIGINT │ YES │ │ │ │
│ popularity_tags │ VARCHAR │ YES │ │ │ │
│ completeness │ DOUBLE │ YES │ │ │ │
│ last_image_t │ BIGINT │ YES │ │ │ │
│ last_image_datetime │ TIMESTAMP │ YES │ │ │ │
│ main_category │ VARCHAR │ YES │ │ │ │
│ main_category_en │ VARCHAR │ YES │ │ │ │
│ image_url │ VARCHAR │ YES │ │ │ │
│ image_small_url │ VARCHAR │ YES │ │ │ │
│ image_ingredients_url │ VARCHAR │ YES │ │ │ │
│ image_ingredients_small_url │ VARCHAR │ YES │ │ │ │
│ image_nutrition_url │ VARCHAR │ YES │ │ │ │
│ image_nutrition_small_url │ VARCHAR │ YES │ │ │ │
│ energy-kj_100g │ DOUBLE │ YES │ │ │ │
│ energy-kcal_100g │ DOUBLE │ YES │ │ │ │
│ energy_100g │ DOUBLE │ YES │ │ │ │
│ energy-from-fat_100g │ DOUBLE │ YES │ │ │ │
│ fat_100g │ DOUBLE │ YES │ │ │ │
[...]
│ nutrition-score-fr_100g │ BIGINT │ YES │ │ │ │
│ nutrition-score-uk_100g │ BIGINT │ YES │ │ │ │
[...]
│ sulphate_100g │ DOUBLE │ YES │ │ │ │
│ nitrate_100g │ DOUBLE │ YES │ │ │ │
└───────────────────────────────────────────────────────┴─────────────┴──────┴─────┴─────────┴───────┘
|
I have started to build the bash script to create the parquet conversion AND the update of the parquet file with new products. Below for those who are curious (it's simple bash with many comments, you can try it, it should work out of the box if you're using Linux). I'm still facing three issues.
#!/usr/bin/env bash
# duckdb executable path (without trailing slash)
DP=~
# TODO: gather stats and save them a log file; save info about main operations in the log file
# 0. find there is an old parquet export in the directory
[[ -f "en.openfoodfacts.org.products.parquet" ]] && OLD=1 || OLD=0
# 1. download latest CSV
wget -c https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv # get CSV file
# Discard invalid characters
# duckdb doesn't like invalid UTF8. It did not want to read some parquet file as such, with the following error:
# Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
# (occuring namely on this product: https://world.openfoodfacts.org/product/9900109008673?rev=4 )
# The issue, and its solution below, seems to be well-known: https://til.simonwillison.net/linux/iconv
iconv -f utf-8 -t utf-8 -c en.openfoodfacts.org.products.csv -o en.openfoodfacts.org.products.converted.csv
# 2. Create new temporary parquet file. From 2 to 5 minutes, depending on your machine
[[ -f "en.openfoodfacts.org.products.tmp.parquet" ]] && rm en.openfoodfacts.org.products.tmp.parquet
$DP/duckdb <<EOF
COPY (
FROM read_csv_auto('en.openfoodfacts.org.products.converted.csv', quote='', sample_size=3000000, delim='\t')
) TO 'en.openfoodfacts.org.products.tmp.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD');
EOF
# 3. If a parquet file is already present, then merge the new data
if [[ $OLD = 1 ]]; then
# Find last last_modified_date in the old parquet file
LATEST_PRODUCT_MODIFICATION=$(
$DP/duckdb :memory: -noheader -ascii -newline '' <<EOF
SELECT last_modified_datetime FROM read_parquet('en.openfoodfacts.org.products.parquet')
ORDER BY last_modified_datetime DESC LIMIT 1;
EOF
)
# TODO: if the last_modified_date is from today, do not update?
# Create a temporary duckdb DB to merge current parquet file with the new data
# (duckdb is not able to merge parket files directly)
[[ -f "tempo.db" ]] && rm tempo.db
$DP/duckdb tempo.db <<EOF
CREATE TABLE products AS
SELECT * FROM read_parquet('en.openfoodfacts.org.products.parquet');
EOF
# Find all the products that have been modified, and insert them in the temporary duckdb DB
$DP/duckdb tempo.db <<EOF
INSERT INTO products
SELECT * FROM read_parquet('en.openfoodfacts.org.products.tmp.parquet')
WHERE last_modified_datetime > strptime('$LATEST_PRODUCT_MODIFICATION', '%Y-%m-%d %I:%M:%S');
EOF
# Create the new parquet file based on the temporary duckdb DB
$DP/duckdb tempo.db <<EOF
COPY (SELECT * FROM products) TO 'en.openfoodfacts.org.products.new.parquet' (FORMAT PARQUET, COMPRESSION 'ZSTD')
EOF
else # if there isn't a previous parquet file, the temporary parquet file becomes the target result
mv en.openfoodfacts.org.products.tmp.parquet en.openfoodfacts.org.products.parquet
fi
# TODO: gather some stats (nb of products, last modified product, nb of new products, etc.)
<<COMMENTS
# Requests to verify all is OK: you can try them manually by copy/pasting
# Variable to find duckdb paths:
DP=~
# 0. Request to verify the total number of products (including updated ones)
$DP/duckdb :memory: -noheader -ascii -newline '' "
select count(*) from read_parquet('en.openfoodfacts.org.products.new.parquet');"
# 1. Request to verify the parquet file is coherent with good data in the right columns
# vd is Visidata, an awesome tool you have to know if you're dealing with serious data: https://www.visidata.org/
$DP/duckdb :memory: -csv "
select * from read_parquet('en.openfoodfacts.org.products.new.parquet')
where completeness > 0.99 -- products with a good level of completeness
order by last_modified_datetime limit 10;
" | vd -f csv
# 2. Request to verify the parquet file is coherent with good data in the right columns
# (too long!!!) request (due to random() sort order), but useful to verify data are ok
$DP/duckdb :memory: -csv "
select * from read_parquet('en.openfoodfacts.org.products.new.parquet')
where completeness > 1.2 -- products with a good level of completeness
order by random() limit 10;
" | vd -f csv
# Previous tests before using iconv
# 3. simple request
$DP/duckdb :memory: <<EOF
select * from read_parquet('en.openfoodfacts.org.products.new.parquet') order by last_modified_datetime limit 5;
EOF
=> KO! Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
=> semble-t-il à cause de ce produit : https://world.openfoodfacts.org/cgi/product.pl?type=edit&code=9900109008673
Je réessaye en ajoutant delim='\t' pour voir => même erreur.
# 4. request with a "where" clause, to check performance
$DP/duckdb :memory: <<EOF
select * from read_parquet('en.openfoodfacts.org.products.new.parquet') where countries_en like '%Germany' limit 5;
EOF
=> KO! Error: near line 1: Invalid Input Error: Invalid string encoding found in Parquet file: value "........."
=> semble-t-il à cause de ce produit : https://world.openfoodfacts.org/cgi/product.pl?type=edit&code=9900109008673
Je réessaye en ajoutant delim='\t' pour voir => même erreur.
# Requests to be tested again
# 5. multiple select + aggregation to see performance
time $DP/duckdb :memory: "select * FROM
(select count(data_quality_errors_tags) as products_with_issues
from read_parquet('en.openfoodfacts.org.products.new.parquet') where data_quality_errors_tags != ''),
(select count(data_quality_errors_tags) as products_with_issues_but_without_images
from read_parquet('en.openfoodfacts.org.products.new.parquet') where data_quality_errors_tags != '' and last_image_datetime is null);"
# 6. request to read the last version (ie the current) version of the database
$DP/duckdb :memory: <<EOF
select *, max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet')
group by code
order by last_modified_datetime desc
limit 5
;
EOF
=> KO! Error: near line 1: Binder Error: column "creator" must appear in the GROUP BY clause or must be part of an aggregate function.
Either add it to the GROUP BY list, or use "ANY_VALUE(creator)" if the exact value of "creator" is not important.
# 7. request to read the last version (ie the current) version of the database (tried another way)
# TODO: create a view?
$DP/duckdb :memory: <<EOF
select * from read_parquet('en.openfoodfacts.org.products.new.parquet') t1
where last_modified_datetime =
(select max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet') t2
where t1.code = t2.code)
order by last_modified_datetime desc
limit 5
;
EOF
# 8. same request as 7, but counting the number of products
$DP/duckdb :memory: <<EOF
select count(*) from read_parquet('en.openfoodfacts.org.products.new.parquet') t1
where last_modified_datetime =
(select max(last_modified_datetime) from read_parquet('en.openfoodfacts.org.products.new.parquet') t2
where t1.code = t2.code)
;
EOF
# 9. idem 7 but another way
$DP/duckdb :memory: <<EOF
SELECT DISTINCT ON (code)
code, *
FROM read_parquet('en.openfoodfacts.org.products.new.parquet')
ORDER BY code, last_modified_datetime DESC;
EOF
=> KO prend énormément de mémoire
COMMENTS
|
Pour la 7, je peux suggérer :
50 secondes sur mon portable windows. L'idéal étant d'avoir constitué le fichier parquet en triant au préalable sur |
Et pour la 8 :
1 seconde ou plus simplement, et notamment si
|
Apache Parquet is a file format aiming to manipulate data more easily.
Our current CSV (2023-03): 7.5 GB
Parquet file generated from our current CSV (thanks to csv2parquet): 643 MB with internal zstd compression. Here are the few steps to reproduce:
wget -c https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv # get CSV file
csvclean -t en.openfoodfacts.org.products.csv # clean up the CSV and convert it from TSV to CSV -- ~6'30s on my laptop
./csv2parquet --header true -n en.openfoodfacts.org.products_out.csv products.pqt > parquet.schema
code
field =>"data_type": "Utf8",
instead of"data_type": "Int64",
serving_quantity
field =>"data_type": "Float64",
instead of"data_type": "Utf8",
time ./csv2parquet --header true -c zstd -s parquet.schema en.openfoodfacts.org.products_out.csv products_zstd.pqt # ~1'45s on my laptop
It's a "young" format (born in 2013). Some tools are already reading/writing Parquet files. The easiest way to read/write Parquet files is to use duckdb (as easy to install as sqlite), natively reading or writing it (without import).
To give a simple example:
The same query on a SQLite DB build from the same CSV export takes more than 10s:
You can reproduce this query thanks to our Datasette instance.
Due to the nature of the format (column based), there is an overhead for some queries.
select *
, for example, need to iterate over all the columns (dozens if not hundreds in our case). Here is an example extracting all columns of the two first products.That said, it should be a far better format than CSV for all operations not including all the columns, and even including those ones for complex queries.
(Thanks @moreymat to have suggested me to explore it.)
[EDIT]
Repeated the last test with 10,000 and 100,000 products:
The text was updated successfully, but these errors were encountered: