Switch MongoDB to RDMS for storage #317

rgaudin · 2019-11-15T17:06:50Z

MongoDB consumes too much RAM
our use case doesn't need/benefit from this data being in RAM
RAM is expensive on VPS and sloppy
disk storage is cheap

for all those reason, I believe we shall switch from mongo to mariadb. once done, memory usage should be lower and more controllable.

Just make sure that task creation is atomic to ensure only a single worker is assigned to a task.

pymysql and peewee should be sufficient.

The text was updated successfully, but these errors were encountered:

stale · 2020-01-14T17:17:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2020-11-14T06:12:31Z

Needed to proper backup.

stale · 2021-01-14T04:42:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2022-03-23T06:43:36Z

Backup is secured, but we still need to move the MongoDB content to the MariaDB instance.

stale · 2022-06-12T20:08:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

rgaudin · 2023-03-24T11:04:01Z

Refreshing this ticket more than 3 years later:

zimfarm has a single DB, for its API (dispatcher)
it's a MongoDB so it's fully in RAM and synced to disk
At this very moment, mongo consumes 6GiB of RAM and periodically gets restarted by k8s/kernel due to OOM
DB takes up 1GiB on disk

That's for the issue we face.

Now MongoDB was chosen by previous ZF developer and we never really questioned it until its RAM usage became problematic.

It's important to note that we have no MongoDB expertise. Our instance is not configured. We use the official image (4.2.9) and its defaults.

It's also important to note that we don't really need a schemaless DB. Most of what we store is code-driven and the main (only?) flexible part is the flags in the config. We could store that in a JSON field.

Using Mongo also means a different code, different tools, different backup scenarios, etc. And writing aggreation queries is difficult and not self-explanatory. So it doesn't make sense to use it if we don't benefit from its main feature and it's the only one in a large collection of projects (actually the cardshop API also uses mongo but will switch as well).

For maintenance sake, we want to use a single type of database across our projects (this doen't exclude KV store like redis where see fits) and we've decided to use PostgreSQL because of its wide support and reputation for saefty and performances.

Key tasks:

identify a db stack. We've tried thing with the cms but we're not happy with it.
we want an ORM because that's more convenient and we want to abstract SQL so we may switch to a different engine in the future
we want one that can play nicely with async frameworks such as fastapi
we want to be able to input SQL when needed
we want a schema migration tools
ORM schema
code must be ported to the new DB
update backup
migration plan
deployment

The most important task here being choosing the python stack because we will use those in other projects and we want this sorted once and for a good while.

benoit74 · 2023-04-12T12:38:55Z

Checklist of items to not forget :

fix documentation in CONTRIBUTING.md (add PostgreSQL, explain how to use docker to start PG DB)
pin requirements.txt versions
fix names of marshmallow mapping classes (see https://github.com/openzim/zimfarm/pull/760/files#r1160593142)
remove @url_objectid where it is not usefull + replace by @url_uuid where we need it (and return 400 if not ok)
rework nullable columns of tasks / requested tasks (debug, container, ...) as non-nullable

Post-migration :

fix documentation in CONTRIBUTING.md (remove reference to Mongo)
remove mongo_val + mongo_id columns
remove mongo and mongoexpress from dev compose
remove .github/workflows/db-migration-docker.yaml workflow

benoit74 · 2023-04-14T09:43:53Z

Some topics which have been discussed live with @rgaudin, I reproduce them here for history :)

requested_by and canceled_by on tasks / requested_tasks

In Mongo, tasks have a "requested_by" and "canceled_by" fields which is a user name but not all users exists.
"requested_by" has "satyam" and "period-scheduler" which are missing
"canceled_by" has "periodic-tasks" and "task shutdown (Terminated)" which are missing)
Same remark applies to requested_tasks which have a "requested_by" as well
The decision has been taken to keep these as simple strings in PG for now without any integrity constraints since these fields are used only for display, they will probably be revisited when we will work on external IdP integration and on historisation of users actions

tasks without worker or schedule

In Mongo, tasks have a "worker" and "schedule_name" property. Both are names, and in same rare cases the corresponding schedule and/or worker does not exists, probably because it has been renamed. 529 tasks out of 14432 are concerned.
The decision has been taken to get rid of these tasks since they are in any case already not visible in the UI today.

schedule durations for missing worker

In Mongo, there is a "durations" property on schedule where we store a duration per worker. This duration is associated to its worker by name. There is some schedule durations which are linked to a worker name which does not exists anymore. The decision has been taken to get rid of these durations.

schedule durations with missing tasks

In Mongo, there is a "durations" property on schedule where we store a duration per worker. This duration is associated to the task which has been used to compute this duration by ID. There is some tasks which does not exists anymore. The decision has been taken to keep these durations, since this is a normal behavior. We have to support durations per worker without anymore the corresponding task in DB (i.e. when we purge tasks, we must not delete the duration, just remove the link to the task).

rgaudin added the enhancement label Nov 15, 2019

rgaudin self-assigned this Nov 15, 2019

stale bot added the stale label Jan 14, 2020

stale bot removed the stale label Nov 14, 2020

stale bot added the stale label Jan 14, 2021

stale bot removed the stale label Mar 23, 2022

kelson42 changed the title ~~switch to mariadb for storage~~ Switch to mariadb for storage Mar 23, 2022

stale bot added the stale label Jun 12, 2022

kelson42 pinned this issue Feb 12, 2023

stale bot removed the stale label Feb 12, 2023

kelson42 unassigned rgaudin Feb 12, 2023

kelson42 changed the title ~~Switch to mariadb for storage~~ Switch MongoDB to RDMS for storage Mar 24, 2023

benoit74 mentioned this issue Apr 6, 2023

Migration to PG - Step 1 #760

Merged

rgaudin mentioned this issue Apr 14, 2023

Allow disabling users and workers #763

Closed

This was referenced Apr 23, 2023

Migration to PG - Step 2 #766

Merged

Issue 317 step 3 #768

Merged

benoit74 mentioned this issue May 12, 2023

Issue 317 step 5 #780

Merged

rgaudin linked a pull request May 12, 2023 that will close this issue

Mongo to PostgreSQL Migration #773

Merged

rgaudin closed this as completed in #773 May 12, 2023

kelson42 unpinned this issue May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch MongoDB to RDMS for storage #317

Switch MongoDB to RDMS for storage #317

rgaudin commented Nov 15, 2019

stale bot commented Jan 14, 2020

kelson42 commented Nov 14, 2020

stale bot commented Jan 14, 2021

kelson42 commented Mar 23, 2022

stale bot commented Jun 12, 2022

rgaudin commented Mar 24, 2023

benoit74 commented Apr 12, 2023 •

edited by rgaudin

Loading

benoit74 commented Apr 14, 2023

Switch MongoDB to RDMS for storage #317

Switch MongoDB to RDMS for storage #317

Comments

rgaudin commented Nov 15, 2019

stale bot commented Jan 14, 2020

kelson42 commented Nov 14, 2020

stale bot commented Jan 14, 2021

kelson42 commented Mar 23, 2022

stale bot commented Jun 12, 2022

rgaudin commented Mar 24, 2023

benoit74 commented Apr 12, 2023 • edited by rgaudin Loading

benoit74 commented Apr 14, 2023

requested_by and canceled_by on tasks / requested_tasks

tasks without worker or schedule

schedule durations for missing worker

schedule durations with missing tasks

benoit74 commented Apr 12, 2023 •

edited by rgaudin

Loading