Proxy release 2024-03-14 #7119

vipvap · 2024-03-14T06:01:52Z

Proxy release 2024-03-14

Please merge this Pull Request using 'Create a merge commit' button

## Problem The storage controller binary still has its historic `attachment_service` name -- it will be painful to change this later because we can't atomically update this repo and the helm charts used to deploy. Companion helm chart change: neondatabase/helm-charts#70 ## Summary of changes - Change the name of the binary to `storage_controller` - Skipping renaming things in the source right now: this is just to get rid of the legacy name in external interfaces. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

We have a benchmark for creating a lot of branches, but it does random things, and the branch count is not what we is the largest maximum we aim to support. If this PR would stabilize the benchmark total duration it means that there are some structures which are very much slower than others. Then we should add a seed-outputting variant to help find and reproduce such cases. Additionally, record for the benchmark: - shutdown duration - startup metrics once done (on restart) - duration of first compaction completion via debug logging

## Problem When vectored get encountered a portion of the key range that could not be mapped to any layer in the current timeline it would incorrectly bail out of the current timeline. This is incorrect since we may have had layers queued for a visit in the fringe. ## Summary of changes * Add a repro unit test * Remove the early bail out path * Simplify range search return value

## Problem Closes: #6847 Closes: #7006 ## Summary of changes - Pageserver API calls are wrapped in timeout/retry logic: this prevents a reconciler getting hung on a pageserver API hang, and prevents reconcilers having to totally retry if one API call returns a retryable error (e.g. 503). - Add a cancellation token to `Node`, so that when we mark a node offline we will cancel any API calls in progress to that node, and avoid issuing any more API calls to that offline node. - If the dirty locations of a shard are all on offline nodes, then don't spawn a reconciler - In re-attach, if we have no observed state object for a tenant then construct one with conf: None (which means "unknown"). Then in Reconciler, implement a TODO for scanning such locations before running, so that we will avoid spuriously incrementing a generation in the case of a node that was offline while we started (this is the case that tripped up #7006) - Refactoring: make Node contents private (and thereby guarantee that updates to availability mode reliably update the cancellation token.) - Refactoring: don't pass the whole map of nodes into Reconciler (and thereby remove a bunch of .expect() calls) Some of this was discovered/tested with a new failure injection test that will come in a separate PR, once it is stable enough for CI.

Gets upstream PR nical/rust_debug#3 , removes trailing "s from output.

## Problem It seems that even though we have a retry on basebackup, it still sometimes fails to fetch it with the failpoint enabled, resulting in a test error. ## Summary of changes If we fail to get the basebackup, disable the failpoint and try again.

## Summary of changes Update rustls from 0.21 to 0.22. reqwest/tonic/aws-smithy still use rustls 0.21. no upgrade route available yet.

## Problem We reverted #6661 a few days ago. The change led to OOMs in benchmarks followed by large WAL reingests. The issue was that we removed [this code](https://github.com/neondatabase/neon/blob/d04af08567cc3ff94ff19a2f6b3f7a2a1e3c55d1/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs#L409-L417). That call may trigger a roll of the open layer due to the keepalive messages received from the safekeeper. Removing it meant that enforcing of checkpoint timeout became even more lax and led to using up large amounts of memory for the in memory layer indices. ## Summary of changes Piggyback on keep alive messages to enforce checkpoint timeout. This is a hack, but it's exactly what the current code is doing. ## Alternatives Christhian, Joonas and myself sketched out a timer based approach [here](#6940). While discussing it further, it became obvious that's also a bit of a hack and not the desired end state. I chose not to take that further since it's not what we ultimately want and it'll be harder to rip out. Right now it's unclear what the ideal system behaviour is: * early flushing on memory pressure, or ... * detaching tenants on memory pressure

## Problem For the ephemeral endpoint feature, it's not really too helpful to keep them around in the connection pool. This isn't really pressing but I think it's still a bit better this way. ## Summary of changes Add `is_ephemeral` function to `NeonOptions`. Allow `serverless::ConnInfo::endpoint_cache_key()` to return an `Option`. Handle that option appropriately

…#7037) ## Problem Tenants created via the storage controller have a `PlacementPolicy` that defines their HA/secondary/detach intent. For backward compat we can just set it to Single, for onboarding tenants using /location_conf it is automatically set to Double(1) if there are at least two pageservers, but for freshly created tenants we didn't have a way to specify it. This unblocks writing tests that create HA tenants on the storage controller and do failure injection testing. ## Summary of changes - Add optional fields to TenantCreateRequest for specifying PlacementPolicy. This request structure is used both on pageserver API and storage controller API, but this method is only meaningful for the storage controller (same as existing `shard_parameters` attribute). - Use the value from the creation request in tenant creation, if provided.

## Problem When we start compute with newer version of extension (i.e. 1.2) and then rollback the release, downgrading the compute version, next compute start will try to update extension to the latest version available in neon.control (i.e. 1.1). Thus we need to provide downgrade scripts like neon--1.2--1.1.sql These scripts must revert the changes made by the upgrade scripts in the reverse order. This is necessary to ensure that the next upgrade will work correctly. In general, we need to write upgrade and downgrade scripts to be more robust and add IF EXISTS / CREATE OR REPLACE clauses to all statements (where applicable). ## Summary of changes Adds downgrade scripts. Adds test cases for extension downgrade/upgrade. fixes #7066 This is a follow-up for https://app.incident.io/neondb/incidents/167?tab=follow-ups Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Alex Chi Z <iskyzh@gmail.com> Co-authored-by: Anastasia Lubennikova <anastasia@neon.tech>

## Problem Currently users can cause problems with replication ## Summary of changes Don't let them replicate

…7072) PR #6953 only excluded throttled time from the handle_pagerequests (aka smgr metrics). This PR implements the deduction for `basebackup ` queries. The other page_service methods either don't use Timeline::get or they aren't used in production. Found by manually inspecting in [staging logs](https://neonprod.grafana.net/explore?schemaVersion=1&panes=%7B%22wx8%22:%7B%22datasource%22:%22xHHYY0dVz%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bhostname%3D%5C%22pageserver-0.eu-west-1.aws.neon.build%5C%22%7D%20%7C~%20%60git-env%7CERR%7CWARN%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22xHHYY0dVz%22%7D,%22editorMode%22:%22code%22%7D%5D,%22range%22:%7B%22to%22:%221709919114642%22,%22from%22:%221709904430898%22%7D%7D%7D).

## Problem Before this PR, it was possible that on-demand downloads were started after `Timeline::shutdown()`. For example, we have observed a walreceiver-connection-handler-initiated on-demand download that was started after `Timeline::shutdown()`s final `task_mgr::shutdown_tasks()` call. The underlying issue is that `task_mgr::shutdown_tasks()` isn't sticky, i.e., new tasks can be spawned during or after `task_mgr::shutdown_tasks()`. Cc: #4175 in lieu of a more specific issue for task_mgr. We already decided we want to get rid of it anyways. Original investigation: https://neondb.slack.com/archives/C033RQ5SPDH/p1709824952465949 ## Changes - enter gate while downloading - use timeline cancellation token for cancelling download thereby, fixes #7054 Entering the gate might also remove recent "kept the gate from closing" in staging.

## Problem We want to report metrics for the oldest user database.

## Problem `422 Unprocessable Entity: compute time quota of non-primary branches is exceeded` being marked as a control plane error. ## Summary of changes Add the manual checks to make this a user error that should not be retried.

- The type of heatmap_period in tenant config was wrrong - Secondary download and heatmap upload endpoints weren't in swagger.

Otherwise, it might happen that we never get to witness the same state on subsequent restarts, thus the time series will show the value from a few restarts ago. The actual case here was that "Activating" was showing `3` while I was doing tenant migration testing on staging. The number 3 was however from a startup that happened some time ago which had been interrupted by another deployment.

result_tx and compute_hook were in ServiceState (i.e. behind a sync mutex), but didn't need to be. Moving them up into Service removes a bunch of boilerplate clones. While we're here, create a helper `Service::maybe_reconcile_shard` which avoids writing out all the `&self.` arguments to `TenantState::maybe_reconcile` everywhere we call it.

All of production is using it now as of neondatabase/aws#1121 The change in `flaky_tests.py` resets the flakiness detection logic. The alternative would have been to repeat the choice of io engine in each test name, which would junk up the various test reports too much. --------- Co-authored-by: Alexander Bayandin <alexander@neon.tech>

…tion (#7064) Tenant::shutdown or Timeline::shutdown completes and becomes externally observable before the corresponding Tenant/Timeline object is dropped. For example, after observing a Tenant::shutdown to complete, we could attach the same tenant_id again. The shut down Tenant object might still be around at the time of the attach. The race is then the following: - old object's metrics are still around - new object uses with_label_values - old object calls remove_label_values The outcome is that the new object will have the metric objects (they're an Arc internall) but the metrics won't be part of the internal registry and hence they'll be missing in `/metrics`. Later, when the new object gets shut down and tries to remove_label_value, it will observe an error because the metric was already removed by the old object. Changes ------- This PR moves metric removal to `shutdown()`. An alternative design would be to multi-version the metrics using a distinguishing label, or, to use a better metrics crate that allows removing metrics from the registry through the locally held metric handle instead of interacting with the (globally shared) registry. refs #7051

…7082) This is a follow-up to #7051 where `LayerInner::drop` and `LayerInner::evict_blocking` were not noticed to require a gate before the file deletion. The lack of entering a gate opens up a similar possibility of deleting a layer file which a newer Timeline instance has already checked out to be resident in a similar case as #7051.

…poll-uring (#7090) Co-authored-by: Alexander Bayandin <alexander@neon.tech>

proceeding #7010, close #6188 ## Summary of changes This pull request (should) fix all warnings except `-Wdeclaration-after-statement` in the neon extension compilation. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Returning from PG_TRY is a bug, and we currently do that ## Summary of changes Make it break and then return false. This should also help stabilize test_bad_connection.py

To avoid orphaned processes using wiped datadir with confusing logging.

The walproposer pretends to be a walsender in many ways. It has a WalSnd slot, it claims to be a walsender by calling MarkPostmasterChildWalSender() etc. But one different to real walsenders was that the postmaster still treated it as a bgworker rather than a walsender. The difference is that at shutdown, walsenders are not killed until the very end, after the checkpointer process has written the shutdown checkpoint and exited. As a result, the walproposer always got killed before the shutdown checkpoint was written, so the shutdown checkpoint never made it to safekeepers. That's fine in principle, we don't require a clean shutdown after all. But it also feels a bit silly not to stream the shutdown checkpoint. It could be useful for initializing hot standby mode in a read replica, for example. Change postmaster to treat background workers that have called MarkPostmasterChildWalSender() as walsenders. That unfortunately requires another small change in postgres core. After doing that, walproposers stay alive longer. However, it also means that the checkpointer will wait for the walproposer to switch to WALSNDSTATE_STOPPING state, when the checkpointer sends the PROCSIG_WALSND_INIT_STOPPING signal. We don't have the machinery in walproposer to receive and handle that signal reliably. Instead, we mark walproposer as being in WALSNDSTATE_STOPPING always. In commit 568f914, I assumed that shutdown will wait for all the remaining WAL to be streamed to safekeepers, but before this commit that was not true, and the test became flaky. This should make it stable again. Some tests wrongly assumed that no WAL could have been written between pg_current_wal_flush_lsn and quick pg stop after it. Fix them by introducing flush_ep_to_pageserver which first stops the endpoint and then waits till all committed WAL reaches the pageserver. In passing extract safekeeper http client to its own module.

This test occasionally fails with a difference in "pg_xact/0000" file between the local and restored datadirs. My hypothesis is that something changed in the database between the last explicit checkpoint and the shutdown. I suspect autovacuum, it could certainly create transactions. To fix, be more precise about the point in time that we compare. Shut down the endpoint first, then read the last LSN (i.e. the shutdown checkpoint's LSN), from the local disk with pg_controldata. And use exactly that LSN in the basebackup. Closes #559

…#7087) Not a user-facing change, but can break any existing `.neon` directories created by neon_local, as the name of the database used by the storage controller changes. This PR changes all the locations apart from the path of `control_plane/attachment_service` (waiting for an opportune moment to do that one, because it's the most conflict-ish wrt ongoing PRs like #6676 )

## Problem On HTTP query timeout, we should try and cancel the current in-flight SQL query. ## Summary of changes Trigger a cancellation command in postgres once the timeout is reach

## Summary of changes The problem it fixes is when `request_lsn` is `u64::MAX-1` the `cont_lsn` becomes `u64::MAX` which is the same as `prev_lsn` which stops the loop. Closes #6812

) ## Summary - Currently we can set stripe size at tenant creation, but it doesn't mean anything until we have multiple shards - When onboarding an existing tenant, it will always get a default shard stripe size, so we would like to be able to pick the actual stripe size at the point we split. ## Why do this inline with a split? The alternative to this change would be to have a separate endpoint on the storage controller for setting the stripe size on a tenant, and only permit writes to that endpoint when the tenant has only a single shard. That would work, but be a little bit more work for a client, and not appreciably simpler (instead of having a special argument to the split functions, we'd have a special separate endpoint, and a requirement that the controller must sync its config down to the pageserver before calling the split API). Either approach would work, but this one feels a bit more robust end-to-end: the split API is the _very last moment_ that the stripe size is mutable, so if we aim to set it before splitting, it makes sense to do it as part of the same operation.

## Problem Missing error classification for SQL-over-HTTP queries. Not respecting `UserFacingError` for SQL-over-HTTP queries. ## Summary of changes Adds error classification. Adds user facing errors.

…7104) ## Problem * quotes in serialized string * no status if connection is from local cache ## Summary of changes * remove quotes * report warm if connection if from local cache

## Problem Currently cplane communication is a part of the latency monitoring. It doesn't allow to setup the proper alerting based on proxy latency. ## Summary of changes Added dimension to exclude cplane latency.

Currently, the flushing operation could flush multiple frozen layers to the disk and store the aggregate time in the histogram. The result is a bimodal distribution with short and over 1000-second flushes. Change it so that we record how long one layer flush takes.

The `tenant_id` in `TenantLocationConfigRequest` in the `location_config` endpoint was only used in the storage controller/attachment service, and there it was only used for assertions and the creation part.

## Problem Before this PR, `Timeline::get_vectored` would be throttled twice if the sequential option was enabled or if validation was enabled. Also, `pageserver_get_vectored_seconds` included the time spent in the throttle, which turns out to be undesirable for what we use that metric for. ## Summary of changes Double-throttle: * Add `Timeline::get0` method which is unthrottled. * Use that method from within the `Timeline::get_vectored` code path. Metric: * return throttled time from `throttle()` method * deduct the value from the observed time * globally rate-limited logging of duration subtraction errors, like in all other places that do the throttled-time deduction from observations

github-actions · 2024-03-14T06:47:37Z

2652 tests run: 2527 passed, 0 failed, 125 skipped (full report)

Code coverage* (full report)

functions: 28.4% (7031 of 24721 functions)
lines: 47.2% (43454 of 92101 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
192b49c at 2024-03-14T10:05:38.960Z :recycle:}

## Problem hyper auto-cancels the request futures on connection close. `sql_over_http::handle` is not 'drop cancel safe', so we need to do some other work to make sure connections are queries in the right way. ## Summary of changes 1. tokio::spawn the request handler to resolve the initial cancel-safety issue 2. share a cancellation token, and cancel it when the request `Service` is dropped. 3. Add a new log span to be able to track the HTTP connection lifecycle.

## Problem Shard splits worked, but weren't safe against failures (e.g. node crash during split) yet. Related: #6676 ## Summary of changes - Introduce async rwlocks at the scope of Tenant and Node: - exclusive tenant lock is used to protect splits - exclusive node lock is used to protect new reconciliation process that happens when setting node active - exclusive locks used in both cases when doing persistent updates (e.g. node scheduling conf) where the update to DB & in-memory state needs to be atomic. - Add failpoints to shard splitting in control plane and pageserver code. - Implement error handling in control plane for shard splits: this detaches child chards and ensures parent shards are re-attached. - Crash-safety for storage controller restarts requires little effort: we already reconcile with nodes over a storage controller restart, so as long as we reset any incomplete splits in the DB on restart (added in this PR), things are implicitly cleaned up. - Implement reconciliation with offline nodes before they transition to active: - (in this context reconciliation means something like startup_reconcile, not literally the Reconciler) - This covers cases where split abort cannot reach a node to clean it up: the cleanup will eventually happen when the node is marked active, as part of reconciliation. - This also covers the case where a node was unavailable when the storage controller started, but becomes available later: previously this allowed it to skip the startup reconcile. - Storage controller now terminates on panics. We only use panics for true "should never happen" assertions, and these cases can leave us in an un-usable state if we keep running (e.g. panicking in a shard split). In the unlikely event that we get into a crashloop as a result, we'll rely on kubernetes to back us off. - Add `test_sharding_split_failures` which exercises a variety of failure cases during shard split.

danieltprice · 2024-03-21T21:33:21Z

Reviewed for changelog.

jcsp and others added 30 commits March 7, 2024 14:06

Update svg_fmt (#7049)

ce7a82d

Gets upstream PR nical/rust_debug#3 , removes trailing "s from output.

update rustls (#7048)

02358b2

## Summary of changes Update rustls from 0.21 to 0.22. reqwest/tonic/aws-smithy still use rustls 0.21. no upgrade route available yet.

Revoke REPLICATION (#7052)

4834d22

## Problem Currently users can cause problems with replication ## Summary of changes Don't let them replicate

Export db size, deadlocks and changed row metrics (#7050)

d894d2b

## Problem We want to report metrics for the oldest user database.

proxy: categorise new cplane error message (#7057)

cc5d6c6

## Problem `422 Unprocessable Entity: compute time quota of non-primary branches is exceeded` being marked as a control plane error. ## Summary of changes Add the manual checks to make this a user error that should not be retried.

pageserver: update swagger for HA APIs (#7070)

f8483cc

- The type of heatmap_period in tenant config was wrrong - Secondary download and heatmap upload endpoints weren't in swagger.

follow-up(#7077): adjust flaky-test-detection cutoff date for tokio-e…

17a3c90

…poll-uring (#7090) Co-authored-by: Alexander Bayandin <alexander@neon.tech>

fix: warnings when compiling neon extensions (#7053)

73a8c97

proceeding #7010, close #6188 ## Summary of changes This pull request (should) fix all warnings except `-Wdeclaration-after-statement` in the neon extension compilation. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

Don't return from inside PG_TRY (#7095)

9872384

## Problem Returning from PG_TRY is a bug, and we currently do that ## Summary of changes Make it break and then return false. This should also help stabilize test_bad_connection.py

SIGQUIT instead of SIGKILL prewarmed postgres.

0cf0731

To avoid orphaned processes using wiped datadir with confusing logging.

proxy: cancel http queries on timeout (#7031)

09699d4

## Problem On HTTP query timeout, we should try and cancel the current in-flight SQL query. ## Summary of changes Trigger a cancellation command in postgres once the timeout is reach

jbajic and others added 8 commits March 12, 2024 16:32

pageserver: fix read path max lsn bug (#7007)

bac06ea

## Summary of changes The problem it fixes is when `request_lsn` is `u64::MAX-1` the `cont_lsn` becomes `u64::MAX` which is the same as `prev_lsn` which stops the loop. Closes #6812

proxy http error classification (#7098)

83855a9

## Problem Missing error classification for SQL-over-HTTP queries. Not respecting `UserFacingError` for SQL-over-HTTP queries. ## Summary of changes Adds error classification. Adds user facing errors.

proxy: Report warm cold start if connection is from the local cache (#…

0554bee

…7104) ## Problem * quotes in serialized string * no status if connection is from local cache ## Summary of changes * remove quotes * report warm if connection if from local cache

proxy: add new dimension to exclude cplane latency (#7011)

b0aff04

## Problem Currently cplane communication is a part of the latency monitoring. It doesn't allow to setup the proper alerting based on proxy latency. ## Summary of changes Added dimension to exclude cplane latency.

Make tenant_id in TenantLocationConfigRequest optional (#7055)

5309711

The `tenant_id` in `TenantLocationConfigRequest` in the `location_config` endpoint was only used in the storage controller/attachment service, and there it was only used for assertions and the creation part.

vipvap requested review from a team as code owners March 14, 2024 06:01

vipvap requested review from conradludgate, petuhovskiy, problame and ololobus and removed request for a team March 14, 2024 06:01

conradludgate and others added 2 commits March 14, 2024 08:20

conradludgate force-pushed the rc/proxy/2024-03-14 branch from 354993b to 44f4262 Compare March 14, 2024 09:14

Merge branch 'release-proxy' into rc/proxy/2024-03-14

192b49c

conradludgate approved these changes Mar 14, 2024

View reviewed changes

khanova merged commit 27bc242 into release-proxy Mar 14, 2024
49 of 50 checks passed

khanova deleted the rc/proxy/2024-03-14 branch March 14, 2024 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-03-14 #7119

Proxy release 2024-03-14 #7119

vipvap commented Mar 14, 2024

github-actions bot commented Mar 14, 2024 •

edited

danieltprice commented Mar 21, 2024

Proxy release 2024-03-14 #7119

Proxy release 2024-03-14 #7119

Conversation

vipvap commented Mar 14, 2024