[clickhouse] Fix DB initialisation on cluster nodes #7454

karencfv · 2025-01-31T05:49:05Z

When running some manual tests with reconfigurator adding nodes to an existing cluster, I noticed that every time I added a new node, all of the other nodes would delete their existing oximeter database and create a new oximeter database with it's tables from scratch. This is not the behaviour we want. The other nodes should continue as they are and only the new node should have the database and schema initialised.

The unwanted behaviour when adding a new node can be seen in the following logs of a node that already existed and was part of the cluster.

In the first line of the logs we can see initialisation is being skipped as the database already exists and is in the correct version.
At some point before 23:11:44.738Z the new node is created, db-wipe.sql runs DROP DATABASE IF EXISTS oximeter ON CLUSTER oximeter_cluster SYNC; and deletes every oximeter database in the nodes.
At 23:11:44.738Z clickhouse-admin sees the schema version is less that the current version (assumes "0" if the database doesn't exist) and wipes all of the databases again.
At 23:11:51.510Z the all of the databases are wiped again presumably because some other node was also wiping everything, and entered some sort of DROP DATABASE loop.
Finally, at 23:12:27.277Z clickhouse-admin skips initialization because the database is initialised and on the correct version.

23:11:22.614Z INFO clickhouse-admin-server (ClickhouseCli): skipping initialization of replicated ClickHouse cluster at version 13
    file = clickhouse-admin/src/http_entrypoints.rs:117
23:11:22.614Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 2412
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::c]:51733
    req_id = a770bb5b-31cb-4bae-9647-cead050daec9
    response_code = 204
    uri = /init
23:11:44.590Z INFO clickhouse-admin-server (dropshot): accepted connection
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:1025
    local_addr = [fd00:1122:3344:101::26]:8888
    remote_addr = [fd00:1122:3344:101::c]:59101
23:11:44.729Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 35047
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::c]:59101
    req_id = 6319e2c0-3f86-4e50-8565-457aa1f93654
    response_code = 201
    uri = /config
23:11:44.735Z INFO clickhouse-admin-server (ClickhouseCli): initializing replicated ClickHouse cluster to version 13
    file = clickhouse-admin/src/http_entrypoints.rs:102
23:11:44.735Z INFO clickhouse-admin-server (ClickhouseCli): reading db version
    file = oximeter/db/src/client/mod.rs:732
    id = 53f2d94d-0be6-4cc8-bc41-c4653ecc87ee
23:11:44.738Z INFO clickhouse-admin-server (ClickhouseCli): read oximeter database version
    file = oximeter/db/src/client/mod.rs:736
    id = 53f2d94d-0be6-4cc8-bc41-c4653ecc87ee
    version = 0
23:11:44.738Z INFO clickhouse-admin-server (ClickhouseCli): wiping and re-initializing oximeter schema
    file = oximeter/db/src/client/mod.rs:741
    id = 53f2d94d-0be6-4cc8-bc41-c4653ecc87ee
23:11:46.630Z INFO clickhouse-admin-server (dropshot): accepted connection
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:1025
    local_addr = [fd00:1122:3344:101::26]:8888
    remote_addr = [fd00:1122:3344:101::b]:44556
23:11:46.763Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 39948
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::b]:44556
    req_id = 025a88a4-7fc6-420f-8301-59c2354946b7
    response_code = 201
    uri = /config
23:11:46.919Z INFO clickhouse-admin-server (dropshot): accepted connection
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:1025
    local_addr = [fd00:1122:3344:101::26]:8888
    remote_addr = [fd00:1122:3344:101::a]:38254
23:11:47.049Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 43307
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::a]:38254
    req_id = 7c651e0d-d685-48c1-bed7-ad47a3dec2d4
    response_code = 201
    uri = /config
23:11:51.504Z INFO clickhouse-admin-server (dropshot): request completed
    error_message_external = Internal Server Error
    error_message_internal = can't initialize replicated ClickHouse cluster to version 13: Native protocol error
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:855
    latency_us = 6774798
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::c]:59101
    req_id = a7af1143-2b0e-44a3-80f6-c64b2ef9f294
    response_code = 500
    uri = /init
23:11:51.509Z WARN clickhouse-admin-server (ClickhouseCli): oximeter database does not exist, or is out-of-date
    file = oximeter/db/src/client/mod.rs:823
    id = 53f2d94d-0be6-4cc8-bc41-c4653ecc87ee
23:11:51.509Z INFO clickhouse-admin-server (ClickhouseCli): initializing replicated ClickHouse cluster to version 13
    file = clickhouse-admin/src/http_entrypoints.rs:102
23:11:51.509Z INFO clickhouse-admin-server (ClickhouseCli): reading db version
    file = oximeter/db/src/client/mod.rs:732
    id = 53f2d94d-0be6-4cc8-bc41-c4653ecc87ee
23:11:51.510Z WARN clickhouse-admin-server (ClickhouseCli): oximeter database does not exist, or is out-of-date
    file = oximeter/db/src/client/mod.rs:823
    id = 53f2d94d-0be6-4cc8-bc41-c4653ecc87ee
23:11:51.510Z INFO clickhouse-admin-server (ClickhouseCli): read oximeter database version
    file = oximeter/db/src/client/mod.rs:736
    id = 53f2d94d-0be6-4cc8-bc41-c4653ecc87ee
    version = 0
23:11:51.510Z INFO clickhouse-admin-server (ClickhouseCli): wiping and re-initializing oximeter schema
    file = oximeter/db/src/client/mod.rs:741
    id = 53f2d94d-0be6-4cc8-bc41-c4653ecc87ee
23:12:01.769Z WARN clickhouse-admin-server (dropshot): request handling cancelled (client disconnected)
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:801
    latency_us = 15003374
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::b]:44556
    req_id = 90380119-9271-488c-b6a4-904f58b5b081
    uri = /init
23:12:02.053Z WARN clickhouse-admin-server (dropshot): request handling cancelled (client disconnected)
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:801
    latency_us = 15003025
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::a]:38254
    req_id = 41ba9f0b-50bd-4115-9dad-6f7c7b6d83d7
    uri = /init
23:12:27.262Z INFO clickhouse-admin-server (ClickhouseCli): inserting current version
    file = oximeter/db/src/client/mod.rs:764
    id = 53f2d94d-0be6-4cc8-bc41-c4653ecc87ee
    version = 13
23:12:27.275Z WARN clickhouse-admin-server (dropshot): request completed after handler was already cancelled
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:943
    local_addr = [fd00:1122:3344:101::26]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::b]:44556
    req_id = 90380119-9271-488c-b6a4-904f58b5b081
    response_code = 204
    uri = /init
23:12:27.277Z INFO clickhouse-admin-server (ClickhouseCli): skipping initialization of replicated ClickHouse cluster at version 13
    file = clickhouse-admin/src/http_entrypoints.rs:117

To fix this issue, we are now only wiping the database associated with the node we're targeting. Here are the logs after applying the fix:

23:47:52.506Z INFO clickhouse-admin-server (ClickhouseCli): skipping initialization of replicated ClickHouse cluster at version 13
    file = clickhouse-admin/src/http_entrypoints.rs:117
23:47:52.506Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 2386
    local_addr = [fd00:1122:3344:101::28]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::b]:64010
    req_id = 027b52d9-e5c8-4e35-998c-9e62529fb34c
    response_code = 204
    uri = /init
23:47:54.394Z INFO clickhouse-admin-server (dropshot): accepted connection
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:1025
    local_addr = [fd00:1122:3344:101::28]:8888
    remote_addr = [fd00:1122:3344:101::a]:50268
23:47:54.436Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 41111
    local_addr = [fd00:1122:3344:101::28]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::a]:50268
    req_id = 62e645c8-30bb-4508-b18d-7c70e378c8ee
    response_code = 201
    uri = /config
23:47:54.439Z INFO clickhouse-admin-server (ClickhouseCli): skipping initialization of replicated ClickHouse cluster at version 13
    file = clickhouse-admin/src/http_entrypoints.rs:117
23:47:54.439Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 2249
    local_addr = [fd00:1122:3344:101::28]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::a]:50268
    req_id = 64303d02-dd4f-45ac-8b2d-9e7f3f7b1e69
    response_code = 204
    uri = /init
23:48:52.437Z INFO clickhouse-admin-server (dropshot): accepted connection
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:1025
    local_addr = [fd00:1122:3344:101::28]:8888
    remote_addr = [fd00:1122:3344:101::b]:61722
23:48:52.477Z INFO clickhouse-admin-server (dropshot): request completed
    file = /home/coatlicue/.cargo/registry/src/index.crates.io-6f17d22bba15001f/dropshot-0.15.1/src/server.rs:867
    latency_us = 39209
    local_addr = [fd00:1122:3344:101::28]:8888
    method = PUT
    remote_addr = [fd00:1122:3344:101::b]:61722
    req_id = 837ef3af-af33-4599-9459-06f0569018ad
    response_code = 201
    uri = /config

root@oxz_clickhouse_server_f5abb38c:~# cat /var/svc/log/oxide-clickhouse-admin-server:default.log | grep wiping | looker
root@oxz_clickhouse_server_f5abb38c:~#

andrewjstone

Thanks for the find and fix @karencfv. Your solution looks like it should work to me,
but I do wonder why we are wiping the databases from the client in the first place. That seems like a major faux-pas to have the possibility open in production. We still need to solve schema update for clickhouse, but wiping is just scary. I wonder if instead we can just remove the wipes all together from the client to fix this. I'd feel more comfortable with that.

Maybe we should wait for @bnaecker to come back from vacation and discuss.

karencfv · 2025-02-03T01:26:39Z

We still need to solve schema update for clickhouse, but wiping is just scary. I wonder if instead we can just remove the wipes all together from the client to fix this. I'd feel more comfortable with that.

I am 100% with you on this, yeah I'll be on stand by until @bnaecker returns

bnaecker · 2025-02-04T19:34:36Z

If we remove the ability to wipe from the client somehow, how would the replicated tests run? Those spin up a single cluster, and continually initialize / wipe the DB to ensure a clean slate between them.

andrewjstone · 2025-02-04T19:35:55Z

If we remove the ability to wipe from the client somehow, how would the replicated tests run? Those spin up a single cluster, and continually initialize / wipe the DB to ensure a clean slate between them.

Can we not add separate wiping code only to the tests?

bnaecker · 2025-02-04T19:39:12Z

If we remove the ability to wipe from the client somehow, how would the replicated tests run? Those spin up a single cluster, and continually initialize / wipe the DB to ensure a clean slate between them.

Can we not add separate wiping code only to the tests?

Probably! I didn't mean it wasn't possible, only that we need to make sure that still happens if we remove the actual public method calls for doing that. There might be other things that rely on wiping, but we should be able to find callsites to those methods to answer that.

karencfv · 2025-02-04T20:52:02Z

Sounds good! I think the objection was mainly having initialize_db_with_version() wipe out the databases if the version of the database is less than the expected version. @bnaecker are you OK with me removing that bit of the code from the initialize_db_with_version() altogether? It must have been there for a good reason to begin with, so I'd like to make sure it's fine to move forward with this.

bnaecker · 2025-02-04T21:49:54Z

I think I'm missing some context here. The method initialize_db_with_version() is currently called from two places. First, when the oximeter collector starts up and finds that there is no database at all:

https://github.com/oxidecomputer/omicron/blob/main/oximeter/collector/src/agent.rs#L90-L120

This all predates the admin server work. We needed a way to initialize the database somewhere. Initially, that was just in oximeter itself. Later we added the clickhouse-schema-updater binary to migrate non-destructively between schema versions.

The second callsite is in the clickhouse-admin server:

https://github.com/oxidecomputer/omicron/blob/main/clickhouse-admin/src/context.rs#L427

which is in the ClickhouseAdminServerApi::init_db endpoint handler.

These two seem to be doing very similar things. Are they conflicting now? Should we be using only one of them? That is, should we still have oximeter do initialization at all, or should it simply be waiting for the right version to appear? Or should it be doing initialization, but via the admin server? I just don't quite understand the desired division of responsibility between the collector and admin server now.

Related, the initialize_db_with_version() method has a comment on it. It's clear that, if you want to non-destructively update the ClickHouse database schema, one should use ensure_schema() instead. Is the ClickHouse admin server API supposed to expose that, either in addition to or instead of the initialize-by-wiping version?

karencfv · 2025-02-04T23:03:35Z

Initialising the database via clickhouse-admin was introduced here #6903 I am unsure about the specifics as to why the database needs to be started by reconfigurator when it's a single node (@andrewjstone or @plotnick will have more insight into this).

For a replicated cluster the database needs to be initialised by reconfigurator because we add and remove nodes, and sadly ClickHouse does not copy over the schema when a new node is added to the cluster. With what you're saying, It appears we could have oximeter stop initialising the database and leave that all to clickhouse-admin? WDYT?

Related, the initialize_db_with_version() method has a comment on it. It's clear that, if you want to non-destructively update the ClickHouse database schema, one should use ensure_schema() instead. Is the ClickHouse admin server API supposed to expose that, either in addition to or instead of the initialize-by-wiping version?

I considered using the ensure_schema() method for this fix, but that one requires the database to exist already. I think the main question here is: Do we really need initialize_db_with_version() to have the ability to wipe a database? It feels a bit destructive. Also, if it does need to be able to wipe out the entire database, which do you think would be the best way forward for this fix?

plotnick · 2025-02-04T23:42:42Z

The original reason for putting schema initialization in the admin server is the scary note immediately preceding initialize_db_with_version:

/// NOTE: This function is not safe for concurrent usage!

Since reconfigurator execution happens automatically and asynchronously from each nexus, we can't serialize initialization requests from the clients (nexus). So we force everyone to go to the admin server, and use an explicit mutex there to serialize requests.

FWIW, I do think that splitting initialization from wiping makes sense. I tried to use the machinery that was there for single-node, but the multi-node case is clearly more complex. I do not have strong opinions on exactly what that refactor should look like.

bnaecker · 2025-02-05T04:32:26Z

With what you're saying, It appears we could have oximeter stop initialising the database and leave that all to clickhouse-admin?

I'm not sure, but that seems reasonable. oximeter was originally initializing the database when there was no admin server. Even now, the only time oximeter ever actually does anything is if the database doesn't exist at all, in which case dropping the database is a no-op. I think I'm fine removing the "drop database" part of initialize_db_with_version(), assuming we fixup any tests that rely on that behavior now.

It's also possible we want to get rid of the initialize_db_with_version() method entirely, and instead make ensure_schema() handle the case of a non-existent database by applying all the DDL in sequence. I don't have a strong opinion, but I do think we should be a bit more clear about who is responsible for making sure the database has the right schema.

karencfv · 2025-02-06T00:59:02Z

I think I'm fine removing the "drop database" part of initialize_db_with_version(), assuming we fixup any tests that rely on that behavior now.

Excellent! I've updated this PR with that change. Thankfully, the impact on the tests was minimal 😅

It's also possible we want to get rid of the initialize_db_with_version() method entirely, and instead make ensure_schema() handle the case of a non-existent database by applying all the DDL in sequence. I don't have a strong opinion, but I do think we should be a bit more clear about who is responsible for making sure the database has the right schema.

I agree that we should clearer about whose responsibility it is to make sure the database exists and has the right schema. Your proposal of tentatively removing initialize_db_with_version() sounds interesting. I'd like to explore this option. That said, I'd like to do so in a follow up PR. Right now adding a new server node to the cluster causes all the databases to be destroyed, so I'd like to get this fix in first. Are you OK with that @bnaecker? I'll write up a follow up issue and assign it to myself.

bnaecker · 2025-02-06T01:07:41Z

That said, I'd like to do so in a follow up PR. Right now adding a new server node to the cluster causes all the databases to be destroyed, so I'd like to get this fix in first. Are you OK with that @bnaecker? I'll write up a follow up issue and assign it to myself.

Yep, that's totally fine!

karencfv · 2025-02-06T01:10:16Z

Thanks @bnaecker! Might I request a ✅ on this PR?

bnaecker

Ask and ye shall receive @karencfv!

bnaecker · 2025-02-06T01:16:08Z

oximeter/db/src/client/mod.rs

-
-        // If we try to upgrade to a newer version, we'll drop old data.
+        // If we try to upgrade to a newer version, we expect a failure when
+        // re-initilaising the client.


Nit: spelling `reinitialising"

bnaecker · 2025-02-06T01:30:44Z

oximeter/db/src/client/mod.rs

                self.init_replicated_db().await?;
            }
-        } else if version > expected_version {
+        } else if version != expected_version {


No real action necessary, but this is part of why I think we probably want to move away from oximeter doing this work at all.

Suppose oximeter starts up, and is sitting at the loop waiting for the DB to be at its expected version. Meanwhile, we're separately upgrading the DB, past oximeter's expected version. This condition might feasibly be true only temporarily -- oximeter would then continue expecting the DB to be at its version, while the updater continues to move beyond it.

I'm not 100% sure what to do about this. We might need to commit to upgrading oximeter first, before applying any schema updates. We might also need some explicit way to release oximeter, so that it waits until the updater (the admin server, presumably) tells it to start operating.

Ugh, yeah, those are really good points. I'll take this into account for #7488 . Will work on that next btw

Sounds good, thanks @karencfv. Let me know if I can help or you need a rubber duck!

Thanks! will do!

fix db initialisation on cluster nodes

912f4d2

karencfv requested review from andrewjstone and bnaecker January 31, 2025 05:49

andrewjstone reviewed Jan 31, 2025

View reviewed changes

Remove db wiping ability from initialization process

1be4905

fix docs

618a3de

karencfv requested a review from andrewjstone February 6, 2025 01:14

karencfv mentioned this pull request Feb 6, 2025

Make the responsibility of initialising oximeter schema clearer #7488

Open

bnaecker approved these changes Feb 6, 2025

View reviewed changes

address comment

b56f20e

karencfv enabled auto-merge (squash) February 6, 2025 01:38

karencfv merged commit 0d1dec0 into oxidecomputer:main Feb 6, 2025
16 checks passed

karencfv deleted the fix-db-wipe branch February 6, 2025 08:07

[clickhouse] Fix DB initialisation on cluster nodes #7454

[clickhouse] Fix DB initialisation on cluster nodes #7454

Uh oh!

Conversation

karencfv commented Jan 31, 2025

Uh oh!

andrewjstone left a comment

Choose a reason for hiding this comment

Uh oh!

karencfv commented Feb 3, 2025

Uh oh!

bnaecker commented Feb 4, 2025

Uh oh!

andrewjstone commented Feb 4, 2025

Uh oh!

bnaecker commented Feb 4, 2025

Uh oh!

karencfv commented Feb 4, 2025

Uh oh!

bnaecker commented Feb 4, 2025

Uh oh!

karencfv commented Feb 4, 2025

Uh oh!

plotnick commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bnaecker commented Feb 5, 2025

Uh oh!

karencfv commented Feb 6, 2025

Uh oh!

bnaecker commented Feb 6, 2025

Uh oh!

karencfv commented Feb 6, 2025

Uh oh!

bnaecker left a comment

Choose a reason for hiding this comment

Uh oh!

bnaecker Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

bnaecker Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

karencfv Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

bnaecker Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

karencfv Feb 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

plotnick commented Feb 4, 2025 •

edited

Loading