Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest in transaction #822

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Ingest in transaction #822

wants to merge 8 commits into from

Conversation

mcovalt
Copy link
Contributor

@mcovalt mcovalt commented Jan 2, 2023

This PR changes the ingestion to work inside of a transaction on a single database rather than via an A/B database.

Intro

Previously: We had two databases. The web server would operate on one database and ingestion would clear-then-load the other. Data that we persisted was copied from the live database to the other. We'd rotate the databases and re-deploy after a successful ingest.

With this PR: There is only one database. The ingestion opens a transaction and clears the database without locking reads. Changes from the ingestion are visible without intervention on the web server if ingestion succeeds. Changes are discarded if ingestion fails or disconnects.

Goals

The long-term goals of this change are:

  1. Supporting the ability to persist the mga_gene_function table without a full ingest. Re-creating the mga_gene_function table takes 48+ hours. While this PR will still clear the mga_gene_function table without a full ingest, it lays the groundwork necessary to build a solution that persists this table without a full ingest. This would allow this table to be stale, but refresh other data (a partial ingest only takes ~1 hour). Work is needed to selectively drop rows in the mga_gene_function that violate FK constraints in the presence of altered dependent data.
  2. Simplify operations. The A/B database is unconventional and difficult to reason about. This is particularly true in the presence of data that must be persisted which increases the risk of data loss.

Cons

This approach, however, has a con: database "bloat" is likely to accrue. With the A/B database, tables were truncated and therefore all indexes were rebuilt and no dead tuples remained after ingestion. Now, we DELETE FROM and we'll have dead tuples. Conventional tools can combat this. The scheduled VACUUM that's already active will take care of a lot of this. A VACUUM FULL will achieve the same results as the A/B approach, but does lock reads during it's process. There's pg_repack which can be used to VACUUM FULL without the lock.

@mcovalt
Copy link
Contributor Author

mcovalt commented Jan 2, 2023

The really big diff is mainly due to e55995a. There I'm making all foreign keys be deferrable. This allows us to DELETE FROM all tables without having to care about the order in which we do so.

Comment on lines +30 to +36
with session_maker.begin() as db: # type: ignore
database.get_ingest_lock(db)
alembic_cfg = Config(str(Path(__file__).parent / "alembic.ini"))
alembic_cfg.set_main_option("script_location", str(Path(__file__).parent / "migrations"))
alembic_cfg.set_main_option("sqlalchemy.url", database_uri)
alembic_cfg.attributes["configure_logger"] = True
command.upgrade(alembic_cfg, "head")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This invariably runs the migration which results in a no-op if no migrations are required. However, the migration command must acquire the ingest lock.

Since our container invokes the nmdc-server migrate command on startup, a currently running ingestion will block the container from even starting as the migrate command will throw an error if the lock cannot be acquired.

This is desirable behavior if migrations must be applied, but bad if the migration command would be a no-op. We need to change this to only conditionally apply migrations if required. Luckily, this has been implemented in the past.

@shreddd
Copy link
Contributor

shreddd commented Jan 3, 2023

Since this is a relatively significant update to the ingest process, I'd like to request that we discuss the change at the next Kitware / NMDC sync before rolling into production.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants