Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to restore MongoDB dump without uncompressing it #7962

Open
CharlesNepote opened this issue Jan 8, 2023 · 8 comments
Open

Allow to restore MongoDB dump without uncompressing it #7962

CharlesNepote opened this issue Jan 8, 2023 · 8 comments
Assignees
Labels
Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data delta exports export ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. 📝 story

Comments

@CharlesNepote
Copy link
Member

CharlesNepote commented Jan 8, 2023

MongoDB dump weights 39GB+ while compressed file weights 6GB+.

Currently, it is necessary to uncompress the compressed file to restore it. Thus, a simple restoration of the Open Food Facts DB is needing (6 + 39) + 39 = 84GB.

But MongoDB allows to dump and restore compressed files without uncompress them, thanks to the --gzip and --archive arguments.

mongodump --gzip --archive="mongodump-test-db.gz" --db=off --collection products
mongorestore --gzip --archive="mongodump-test-db.gz"

We might use it as:

  • Using --gzip and --archive for dump is as fast as not using it.
  • Dumping is simpler (just one simple command).
  • Restoring is simpler (just one simple command).
  • Restoring does not need an extra 39GB+ and might be faster.

We just have to change one line in mongodb_dump.sh.

Here are some benchmarks.

  1. Current situation.
$ time mongodump --collection products --db off
16:18:50.823+0000	writing off.products to dump/off/products.bson
16:21:12.286+0000	done dumping off.products (2742914 documents)
real	2m21.575s
user	0m30.970s
sys	0m57.142s

$ time tar cvfz mongodbdump.tar.gz dump
dump/
dump/off/
dump/off/products.metadata.json
dump/off/products.bson
^X
real	20m16.142s
user	18m57.279s
sys	1m47.331s
=> ~23 minutes

$ ls -la
-rw-r--r-- 1 root          root          6535232274 Jan  7 22:52 mongodbdump.tar.gz

$ time mongorestore --drop ./dump                                                             
2023-01-09T07:53:50.376+0000	preparing collections to restore from
2023-01-09T07:53:50.376+0000	reading metadata for off.products from dump/off/products.metadata.json
2023-01-09T07:53:50.379+0000	dropping collection off.products before restoring
2023-01-09T07:53:50.464+0000	restoring off.products from dump/off/products.bson
[...]
2023-01-09T08:29:57.672+0000	2742914 document(s) restored successfully. 0 document(s) failed to restore.

real	36m7.309s
user	2m53.516s
sys	1m41.033s

  1. Using --gzip and --archive.
$ mongodump --collection products --db off --gzip --archive="mongodump-test-db.gz"
15:37:19	writing off.products to archive 'mongodump-test-db'
16:00:18	done dumping off.products (2742914 documents)
=> ~23 minutes

$ ls -la
-rw-r--r-- 1 root          root          6484659291 Jan  7 16:00 mongodump-test-db.gz

$ time mongorestore --drop --collection products --db off --gzip --archive="mongodump-test-db.gz"
2023-01-08T15:48:27.545+0000	The --db and --collection flags are deprecated for this use-case; please use --nsInclude instead, i.e. with --nsInclude=${DATABASE}.${COLLECTION}
[...]
2023-01-08T16:25:18.788+0000	2742914 document(s) restored successfully. 0 document(s) failed to restore.

real	36m51.258s
user	7m57.690s
sys	1m1.246s

@stephanegigandet, @alexgarel, @hangy, @syl10100, @cquest ?

@CharlesNepote CharlesNepote added 📝 story export Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data delta exports labels Jan 8, 2023
@raphael0202
Copy link
Contributor

As @CharlesNepote asked me about it, it has no impact on Robotoff, which uses directly the MongoDB DB + the daily JSONL export.

@alexgarel
Copy link
Member

There is an impact on import_prod_data target in Makefile (used in daily action to update stagging mongodb).

Also we must advertise the change on the /data page.

Otherwise, it's really cool to have it :-)

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Jan 10, 2023

I propose to make it in several steps:

  • generate the new dump aside the current one
  • test a few days in production if all is ok
  • update some processes as @alexgarel mentioned (be careful to Document and rethink export workflow #8050)
  • document the new format on the /data page + alert reusers that the old one will be deleted in X weeks
  • then remove the old fashion (if all is ok with users and team processes)

@CharlesNepote
Copy link
Member Author

All fine!

$ time wget https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz
--2023-01-27 16:59:39--  https://static.openfoodfacts.org/data/openfoodfacts-mongodbdump.gz
Resolving static.openfoodfacts.org (static.openfoodfacts.org)... 213.36.253.206
Connecting to static.openfoodfacts.org (static.openfoodfacts.org)|213.36.253.206|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6435921379 (6.0G) [application/octet-stream]
Saving to: 'openfoodfacts-mongodbdump.gz'

openfoodfacts-mongodbdump.gz                    100%[=====================================================================================================>]   5.99G  79.0MB/s    in 1m 40s  

2023-01-27 17:01:18 (61.5 MB/s) - 'openfoodfacts-mongodbdump.gz' saved [6435921379/6435921379]


real	1m39.910s

$ wget https://static.openfoodfacts.org/data/gz-sha256sum

$ time sha256sum --check gz-sha256sum
openfoodfacts-mongodbdump.gz: OK

real	0m29.388s

$ time mongorestore --drop --gzip --archive="openfoodfacts-mongodbdump.gz"
2023-01-27T17:43:08.031+0000	preparing collections to restore from
2023-01-27T17:43:08.067+0000	reading metadata for off.products from archive 'openfoodfacts-mongodbdump.gz'
2023-01-27T17:43:08.069+0000	dropping collection off.products before restoring
2023-01-27T17:43:08.153+0000	restoring off.products from archive 'openfoodfacts-mongodbdump.gz'
2023-01-27T17:43:10.982+0000	off.products  162MB
[...]
2023-01-27T18:23:08.270+0000	2784589 document(s) restored successfully. 0 document(s) failed to restore.

real	40m0.322s

@CharlesNepote
Copy link
Member Author

I have made other tests: it's working perfectly well.

We have to pay attention to #8050, to be sure it does not impact too much on export global workflow.

@CharlesNepote
Copy link
Member Author

CharlesNepote commented Feb 17, 2023

@alexgarel: you mentioned:

There is an impact on import_prod_data target in Makefile (used in daily action to update stagging mongodb).

Could you do it? I'm not very comfortable with makefiles... It seems that the github action is starting at 00:00 so it uses the archive from the previous day and you don't need to take care about the creation time of the file (see #8050).

When it's done, I suggest we test it during a few days before going further.

@CharlesNepote CharlesNepote reopened this Feb 17, 2023
alexgarel added a commit that referenced this issue Feb 27, 2023
Using new openfoodfacts-mongodbdump.gz dump to restore staging.

Also using a bind mounted directory to avoid copying files around.

+ fix a bug on daily refresh_products_tags on mongo_dev

Part of: #7962
Copy link
Contributor

This issue has been open 90 days with no activity. Can you give it a little love by linking it to a parent issue, adding relevant labels and projets, creating a mockup if applicable, adding code pointers from https://github.com/openfoodfacts/openfoodfacts-server/blob/main/.github/labeler.yml, giving it a priority, editing the original issue to have a more comprehensive description… Thank you very much for your contribution to 🍊 Open Food Facts

@github-actions github-actions bot added the ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. label Dec 23, 2023
@stephanegigandet
Copy link
Contributor

@CharlesNepote @alexgarel I think it's time to remove the old mongodb dump: #9946

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Data export We export data nightly as CSV, MongoDB… See: https://world.openfoodfacts.org/data delta exports export ⏰ Stale This issue hasn't seen activity in a while. You can try documenting more to unblock it. 📝 story
Projects
Status: To discuss and validate
Development

No branches or pull requests

4 participants