Allow to restore MongoDB dump without uncompressing it #7962

CharlesNepote · 2023-01-08T00:23:43Z

MongoDB dump weights 39GB+ while compressed file weights 6GB+.

Currently, it is necessary to uncompress the compressed file to restore it. Thus, a simple restoration of the Open Food Facts DB is needing (6 + 39) + 39 = 84GB.

But MongoDB allows to dump and restore compressed files without uncompress them, thanks to the --gzip and --archive arguments.

mongodump --gzip --archive="mongodump-test-db.gz" --db=off --collection products
mongorestore --gzip --archive="mongodump-test-db.gz"

We might use it as:

Using --gzip and --archive for dump is as fast as not using it.
Dumping is simpler (just one simple command).
Restoring is simpler (just one simple command).
Restoring does not need an extra 39GB+ and might be faster.

We just have to change one line in mongodb_dump.sh.

Here are some benchmarks.

Current situation.

$ time mongodump --collection products --db off
16:18:50.823+0000	writing off.products to dump/off/products.bson
16:21:12.286+0000	done dumping off.products (2742914 documents)
real	2m21.575s
user	0m30.970s
sys	0m57.142s

$ time tar cvfz mongodbdump.tar.gz dump
dump/
dump/off/
dump/off/products.metadata.json
dump/off/products.bson
^X
real	20m16.142s
user	18m57.279s
sys	1m47.331s
=> ~23 minutes

$ ls -la
-rw-r--r-- 1 root          root          6535232274 Jan  7 22:52 mongodbdump.tar.gz

$ time mongorestore --drop ./dump                                                             
2023-01-09T07:53:50.376+0000	preparing collections to restore from
2023-01-09T07:53:50.376+0000	reading metadata for off.products from dump/off/products.metadata.json
2023-01-09T07:53:50.379+0000	dropping collection off.products before restoring
2023-01-09T07:53:50.464+0000	restoring off.products from dump/off/products.bson
[...]
2023-01-09T08:29:57.672+0000	2742914 document(s) restored successfully. 0 document(s) failed to restore.

real	36m7.309s
user	2m53.516s
sys	1m41.033s

Using --gzip and --archive.

$ mongodump --collection products --db off --gzip --archive="mongodump-test-db.gz"
15:37:19	writing off.products to archive 'mongodump-test-db'
16:00:18	done dumping off.products (2742914 documents)
=> ~23 minutes

$ ls -la
-rw-r--r-- 1 root          root          6484659291 Jan  7 16:00 mongodump-test-db.gz

$ time mongorestore --drop --collection products --db off --gzip --archive="mongodump-test-db.gz"
2023-01-08T15:48:27.545+0000	The --db and --collection flags are deprecated for this use-case; please use --nsInclude instead, i.e. with --nsInclude=${DATABASE}.${COLLECTION}
[...]
2023-01-08T16:25:18.788+0000	2742914 document(s) restored successfully. 0 document(s) failed to restore.

real	36m51.258s
user	7m57.690s
sys	1m1.246s

@stephanegigandet, @alexgarel, @hangy, @syl10100, @cquest ?

The text was updated successfully, but these errors were encountered:

raphael0202 · 2023-01-09T17:10:31Z

As @CharlesNepote asked me about it, it has no impact on Robotoff, which uses directly the MongoDB DB + the daily JSONL export.

alexgarel · 2023-01-10T10:14:44Z

There is an impact on import_prod_data target in Makefile (used in daily action to update stagging mongodb).

Also we must advertise the change on the /data page.

Otherwise, it's really cool to have it :-)

CharlesNepote · 2023-01-10T14:55:06Z

I propose to make it in several steps:

generate the new dump aside the current one
test a few days in production if all is ok
update some processes as @alexgarel mentioned (be careful to Document and rethink export workflow #8050)
document the new format on the /data page + alert reusers that the old one will be deleted in X weeks
then remove the old fashion (if all is ok with users and team processes)

CharlesNepote · 2023-01-27T18:25:10Z

All fine!

$ time wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz
--2023-01-27 16:59:39--  https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz
Resolving static.openfoodfacts.org (static.openfoodfacts.org)... 213.36.253.206
Connecting to static.openfoodfacts.org (static.openfoodfacts.org)|213.36.253.206|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6435921379 (6.0G) [application/octet-stream]
Saving to: 'openfoodfacts-mongodbdump.gz'

openfoodfacts-mongodbdump.gz                    100%[=====================================================================================================>]   5.99G  79.0MB/s    in 1m 40s  

2023-01-27 17:01:18 (61.5 MB/s) - 'openfoodfacts-mongodbdump.gz' saved [6435921379/6435921379]


real	1m39.910s

$ wget https://static.openfoodfacts.org/data/gz-sha256sum

$ time sha256sum --check gz-sha256sum
openfoodfacts-mongodbdump.gz: OK

real	0m29.388s

$ time mongorestore --drop --gzip --archive="openfoodfacts-mongodbdump.gz"
2023-01-27T17:43:08.031+0000	preparing collections to restore from
2023-01-27T17:43:08.067+0000	reading metadata for off.products from archive 'openfoodfacts-mongodbdump.gz'
2023-01-27T17:43:08.069+0000	dropping collection off.products before restoring
2023-01-27T17:43:08.153+0000	restoring off.products from archive 'openfoodfacts-mongodbdump.gz'
2023-01-27T17:43:10.982+0000	off.products  162MB
[...]
2023-01-27T18:23:08.270+0000	2784589 document(s) restored successfully. 0 document(s) failed to restore.

real	40m0.322s

CharlesNepote · 2023-02-17T07:56:53Z

I have made other tests: it's working perfectly well.

We have to pay attention to #8050, to be sure it does not impact too much on export global workflow.

CharlesNepote · 2023-02-17T10:06:58Z

@alexgarel: you mentioned:

There is an impact on import_prod_data target in Makefile (used in daily action to update stagging mongodb).

Could you do it? I'm not very comfortable with makefiles... It seems that the github action is starting at 00:00 so it uses the archive from the previous day and you don't need to take care about the creation time of the file (see #8050).

When it's done, I suggest we test it during a few days before going further.

Using new openfoodfacts-mongodbdump.gz dump to restore staging. Also using a bind mounted directory to avoid copying files around. + fix a bug on daily refresh_products_tags on mongo_dev Part of: #7962

github-actions · 2023-12-23T00:07:32Z

This issue has been open 90 days with no activity. Can you give it a little love by linking it to a parent issue, adding relevant labels and projets, creating a mockup if applicable, adding code pointers from https://github.com/openfoodfacts/openfoodfacts-server/blob/main/.github/labeler.yml, giving it a priority, editing the original issue to have a more comprehensive description… Thank you very much for your contribution to 🍊 Open Food Facts

stephanegigandet · 2024-03-18T17:16:38Z

@CharlesNepote @alexgarel I think it's time to remove the old mongodb dump: #9946

CharlesNepote added 📝 story export Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data delta exports labels Jan 8, 2023

CharlesNepote mentioned this issue Jan 10, 2023

feat: new dump, aside the current one #7968

Merged

CharlesNepote self-assigned this Jan 10, 2023

CharlesNepote closed this as completed Feb 17, 2023

CharlesNepote reopened this Feb 17, 2023

alexgarel mentioned this issue Feb 20, 2023

ci: use mongodb new gz dump for staging #8126

Merged

github-actions bot added the ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. label Dec 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to restore MongoDB dump without uncompressing it #7962

Allow to restore MongoDB dump without uncompressing it #7962

CharlesNepote commented Jan 8, 2023 •

edited

raphael0202 commented Jan 9, 2023

alexgarel commented Jan 10, 2023

CharlesNepote commented Jan 10, 2023 •

edited by alexgarel

CharlesNepote commented Jan 27, 2023

CharlesNepote commented Feb 17, 2023

CharlesNepote commented Feb 17, 2023 •

edited

github-actions bot commented Dec 23, 2023

stephanegigandet commented Mar 18, 2024

Allow to restore MongoDB dump without uncompressing it #7962

Allow to restore MongoDB dump without uncompressing it #7962

Comments

CharlesNepote commented Jan 8, 2023 • edited

raphael0202 commented Jan 9, 2023

alexgarel commented Jan 10, 2023

CharlesNepote commented Jan 10, 2023 • edited by alexgarel

CharlesNepote commented Jan 27, 2023

CharlesNepote commented Feb 17, 2023

CharlesNepote commented Feb 17, 2023 • edited

github-actions bot commented Dec 23, 2023

stephanegigandet commented Mar 18, 2024

CharlesNepote commented Jan 8, 2023 •

edited

CharlesNepote commented Jan 10, 2023 •

edited by alexgarel

CharlesNepote commented Feb 17, 2023 •

edited