Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch MongoDB to RDMS for storage #317

Closed
rgaudin opened this issue Nov 15, 2019 · 8 comments · Fixed by #773
Closed

Switch MongoDB to RDMS for storage #317

rgaudin opened this issue Nov 15, 2019 · 8 comments · Fixed by #773

Comments

@rgaudin
Copy link
Member

rgaudin commented Nov 15, 2019

  • MongoDB consumes too much RAM
  • our use case doesn't need/benefit from this data being in RAM
  • RAM is expensive on VPS and sloppy
  • disk storage is cheap

for all those reason, I believe we shall switch from mongo to mariadb. once done, memory usage should be lower and more controllable.

Just make sure that task creation is atomic to ensure only a single worker is assigned to a task.

pymysql and peewee should be sufficient.

@rgaudin rgaudin self-assigned this Nov 15, 2019
@stale
Copy link

stale bot commented Jan 14, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jan 14, 2020
@kelson42
Copy link
Contributor

Needed to proper backup.

@stale stale bot removed the stale label Nov 14, 2020
@stale
Copy link

stale bot commented Jan 14, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jan 14, 2021
@kelson42
Copy link
Contributor

Backup is secured, but we still need to move the MongoDB content to the MariaDB instance.

@stale stale bot removed the stale label Mar 23, 2022
@kelson42 kelson42 changed the title switch to mariadb for storage Switch to mariadb for storage Mar 23, 2022
@stale
Copy link

stale bot commented Jun 12, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@stale stale bot added the stale label Jun 12, 2022
@kelson42 kelson42 pinned this issue Feb 12, 2023
@stale stale bot removed the stale label Feb 12, 2023
@rgaudin
Copy link
Member Author

rgaudin commented Mar 24, 2023

Refreshing this ticket more than 3 years later:

  • zimfarm has a single DB, for its API (dispatcher)
  • it's a MongoDB so it's fully in RAM and synced to disk
  • At this very moment, mongo consumes 6GiB of RAM and periodically gets restarted by k8s/kernel due to OOM
  • DB takes up 1GiB on disk

That's for the issue we face.

Now MongoDB was chosen by previous ZF developer and we never really questioned it until its RAM usage became problematic.

It's important to note that we have no MongoDB expertise. Our instance is not configured. We use the official image (4.2.9) and its defaults.

It's also important to note that we don't really need a schemaless DB. Most of what we store is code-driven and the main (only?) flexible part is the flags in the config. We could store that in a JSON field.

Using Mongo also means a different code, different tools, different backup scenarios, etc. And writing aggreation queries is difficult and not self-explanatory. So it doesn't make sense to use it if we don't benefit from its main feature and it's the only one in a large collection of projects (actually the cardshop API also uses mongo but will switch as well).

For maintenance sake, we want to use a single type of database across our projects (this doen't exclude KV store like redis where see fits) and we've decided to use PostgreSQL because of its wide support and reputation for saefty and performances.

Key tasks:

  • identify a db stack. We've tried thing with the cms but we're not happy with it.
  • we want an ORM because that's more convenient and we want to abstract SQL so we may switch to a different engine in the future
  • we want one that can play nicely with async frameworks such as fastapi
  • we want to be able to input SQL when needed
  • we want a schema migration tools
  • ORM schema
  • code must be ported to the new DB
  • update backup
  • migration plan
  • deployment

The most important task here being choosing the python stack because we will use those in other projects and we want this sorted once and for a good while.

@kelson42 kelson42 changed the title Switch to mariadb for storage Switch MongoDB to RDMS for storage Mar 24, 2023
@benoit74
Copy link
Collaborator

benoit74 commented Apr 12, 2023

Checklist of items to not forget :

  • fix documentation in CONTRIBUTING.md (add PostgreSQL, explain how to use docker to start PG DB)
  • pin requirements.txt versions
  • fix names of marshmallow mapping classes (see https://github.com/openzim/zimfarm/pull/760/files#r1160593142)
  • remove @url_objectid where it is not usefull + replace by @url_uuid where we need it (and return 400 if not ok)
  • rework nullable columns of tasks / requested tasks (debug, container, ...) as non-nullable

Post-migration :

  • fix documentation in CONTRIBUTING.md (remove reference to Mongo)
  • remove mongo_val + mongo_id columns
  • remove mongo and mongoexpress from dev compose
  • remove .github/workflows/db-migration-docker.yaml workflow

@benoit74
Copy link
Collaborator

Some topics which have been discussed live with @rgaudin, I reproduce them here for history :)

requested_by and canceled_by on tasks / requested_tasks

In Mongo, tasks have a "requested_by" and "canceled_by" fields which is a user name but not all users exists.
"requested_by" has "satyam" and "period-scheduler" which are missing
"canceled_by" has "periodic-tasks" and "task shutdown (Terminated)" which are missing)
Same remark applies to requested_tasks which have a "requested_by" as well
The decision has been taken to keep these as simple strings in PG for now without any integrity constraints since these fields are used only for display, they will probably be revisited when we will work on external IdP integration and on historisation of users actions

tasks without worker or schedule

In Mongo, tasks have a "worker" and "schedule_name" property. Both are names, and in same rare cases the corresponding schedule and/or worker does not exists, probably because it has been renamed. 529 tasks out of 14432 are concerned.
The decision has been taken to get rid of these tasks since they are in any case already not visible in the UI today.

schedule durations for missing worker

In Mongo, there is a "durations" property on schedule where we store a duration per worker. This duration is associated to its worker by name. There is some schedule durations which are linked to a worker name which does not exists anymore. The decision has been taken to get rid of these durations.

schedule durations with missing tasks

In Mongo, there is a "durations" property on schedule where we store a duration per worker. This duration is associated to the task which has been used to compute this duration by ID. There is some tasks which does not exists anymore. The decision has been taken to keep these durations, since this is a normal behavior. We have to support durations per worker without anymore the corresponding task in DB (i.e. when we purge tasks, we must not delete the duration, just remove the link to the task).

This was referenced Apr 23, 2023
@rgaudin rgaudin linked a pull request May 12, 2023 that will close this issue
@kelson42 kelson42 unpinned this issue May 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants