storage hotfix release 2024-09-20 #9085

problame · 2024-09-20T18:52:37Z

This storage hotfix release adds valuable metrics to pageserver.

We will only deploy this hotfix manually to a dedicated pageserver that is currently empty.

Context https://neondb.slack.com/archives/C07MU9ES6NP/p1726827244185729

Created using

git switch -c releases/2024-09-20-hotfix
git reset --hard origin/release
git merge ec5dce04ebfa51b727dfc9bc04ebb1e68aef6434

Signed-off-by: Tristan Partin <tristan@neon.tech>

This also reduces the GRANT statements to one per created _reset function

This should generally be faster when running tests, especially those that run with higher scales. Ignoring test_lfc_resize since it seems like we are hitting a query timeout for some reason that I have yet to investigate. A little bit of improvemnt is better than none. Signed-off-by: Tristan Partin <tristan@neon.tech>

This is in preparation of replacing neon_fixtures.get_dir_size with neon_fixtures.utils.get_dir_size() in next commit.

There was another copy of it in utils.py. The only difference is that the version in utils.py tolerates files that are concurrently removed. That seems fine for the few callers in neon_fixtures.py too.

pg_distrib_dir doesn't include the Postgres version and only depends on env variables which cannot change during a test run, so it can be marked as session-scoped. Similarly, the platform cannot change during a test run.

) For readability. neon_fixtures.py is huge.

…eader (#8954) ## Problem It turns out that we can't rely on external orchestration to promptly route trafic to the new leader. This is downtime inducing. Forwarding provides a safe way out. ## Safety We forward when: 1. Request is not one of ["/control/v1/step_down", "/status", "/ready", "/metrics"] 2. Current instance is in [`LeadershipStatus::SteppedDown`] state 3. There is a leader in the database to forward to 4. Leader from step (3) is not the current instance If a storcon instance is persisted in the database, then we know that it is the current leader. There's one exception: time between handling step-down request and the new leader updating the database. Let's treat the happy case first. The stepped down node does not produce any side effects, since all request handling happens on the leader. As for the edge case, we are guaranteed to always have a maximum of two running instances. Hence, if we are in the edge case scenario the leader persisted in the database is the stepped down instance that received the request. Condition (4) above covers this scenario. ## Summary of changes * Conversion utilities for reqwest <-> hyper. I'm not happy with these, but I don't see a better way. Open to suggestions. * Add request forwarding logic * Update each request handler. Again, not happy with this. If anyone knows a nice to wrap the handlers, lmk. Me and Joonas tried :/ * Update each handler to maybe forward * Tweak tests to showcase new behaviour

## Problem We've got 2 non-blocking failures on the release pipeline: - `promote-compatibility-data` job got skipped _presumably_ because one of the dependencies of `deploy` job (`push-to-acr-dev`) got skipped (#8940) - `coverage-report` job fails because we don't build debug artifacts in the release branch (#8561) ## Summary of changes - Always run `push-to-acr-dev` / `push-to-acr-prod` jobs, but add `skip_if` parameter to the reusable workflow, which can skip the job internally, without skipping externally - Do not run `coverage-report` on release branches

Found by "cargo machete"

We added another migration in 5876c44, but didn't bump this value. This had no effect, but best to fix it anyway. Signed-off-by: Tristan Partin <tristan@neon.tech>

In exposed messages like log messages we mentioned "S3", which is not entirely accurate as we support Azure blob storage now as well.

## Problem We do use `actions/checkout` with `fetch-depth: 0` when it's not required ## Summary of changes - Remove unneeded `fetch-depth: 0` - Add a comment if `fetch-depth: 0` is required

Immediate benefit: easier to spot what's going on. Later benefit: use the extracted method in PR - #8952 which adds a `ping` command to walredo. Found this useful during investigation neondatabase/cloud#16886.

… tests (#8948) There's currently no way to just start/stop broker from `neon_local`. This PR * adds a sub-command * uses that sub-command from the test suite instead of the pre-existing Python `subprocess` based approach. Found this useful during investigation neondatabase/cloud#16886.

) Commit ca5390a made a similar change to DeltaLayerWriter. We bumped into this with Stas with our hackathon project, to create a standalong program to create image layers directly from a Postgres data directory. It needs to create image layers without having a Timeline and other pageserver machinery. This downgrades the "created image layer {}" message from INFO to TRACE level. TRACE is used for the corresponding message on delta layer creation too. The path logged in the message is now the temporary path, before the file is renamed to its final name. Again commit ca5390a made the same change for the message on delta layer creation.

)

## Problem It turns out the previous approach (with `skip_if` input) doesn't work (from #9017). Revert it and use more straightforward if-conditions ## Summary of changes - Revert efbe8db - Add if-condition to`promote-compatibility-data` job and relevant comments

Signed-off-by: Tristan Partin <tristan@neon.tech>

(Found this useful during investigation neondatabase/cloud#16886.) Problem ------- Before this PR, `neon_local` sequentially does the following: 1. launch storcon process 2. wait for storcon to signal readiness [here](https://github.com/neondatabase/neon/blob/75310fe441b87d399213e365f1364aa9f08aa40d/control_plane/src/storage_controller.rs#L804-L808) 3. start pageserver 4. wait for pageserver to become ready [here](https://github.com/neondatabase/neon/blob/c43e664ff577d4568722e4e7a2b2c6267b609607/control_plane/src/pageserver.rs#L343-L346) 5. etc The problem is that storcon's readiness waits for the [`startup_reconcile`](https://github.com/neondatabase/neon/blob/cbcd4058edb7a2c2bb3bfe1a6fc1ffb0d820b870/storage_controller/src/service.rs#L520-L523) to complete. But pageservers aren't started at this point. So, worst case we wait for `STARTUP_RECONCILE_TIMEOUT/2`, i.e., 15s. This is more than the 10s default timeout allowed by neon_local. So, the result is that `neon_local start` fails to start storcon and stops everything. Solution -------- In this PR I choose the the radical solution to start everything in parallel. It junks up the output because we do stuff like `print!(".")` to indicate progress. We should just abandon that. And switch to `utils::logging` + `tracing` with separate spans for each component. I can do that in this PR or we leave it as a follow-up. Alternatives Considered ----------------------- The Pageserver's `/v1/status` or in fact any endpoint of the mgmt API will not `accept()` on the mgmt API socket until after the `re-attach` call to storcon returned success. So, it's insufficient to change the startup order to start Pageservers first. We cannot easily change Pageserver startup order because `init_tenant_mgr` must complete before we start serving the mgmt API. Otherwise tenant detach calls et al can race with `init_tenant_mgr`. We'd have to add a "loading" state to tenant mgr and make all API endpoints except `/v1/status` wait for _that_ to complete. Related ------- - #6475

When I checked the log in Grafana I couldn't find the scrubber version. Then I realized that it should be logged after the logger gets initialized. ## Summary of changes Log after initializing the logger for the scrubber. Signed-off-by: Alex Chi Z <chi@neon.tech>

Dead code is generally useless, but with Postgres constants in particular, I'm also worried that if they're not used anywhere, we might fail to update them at a Postgres version update, and get very confused later when they have wrong values.

…is blocked on sleep (#9072) ## Problem Seems that PS might be too eager in reporting throttled tasks ## Summary of changes Introduce a sleep counter. If the sleep counter increases, then the acquire tasks was throttled.

## Problem When layer visibility was added, an info log was included for the situation where actual access to a layer disagrees with the visibility calculation. This situation is safe, but I was interested in seeing when it happens. The log is pretty high volume, so this PR refines it to fire less often. ## Summary of changes - For cases where accessing non-visible layers is normal, don't log at all. - Extend a unit test to increase confidence that the updates to visibility on access are working as expected - During compaction, only call the visibility calculation routine if some image layers were created: previously, frequent calls resulted in the visibility of layers getting reset every time we passed through create_image_layers.

…erstand throttle queue depth (#9077)

…/2024-09-20-hotfix

tristan957 and others added 30 commits September 16, 2024 16:38

Grant access to pg_show_replication_origin_status for neon_superuser

5876c44

Signed-off-by: Tristan Partin <tristan@neon.tech>

Fix PG17's extension modifications (#9010)

c8bedca

This also reduces the GRANT statements to one per created _reset function

Remove misc unused code (#9014)

2bbb4d3

Remove unused function (#9018)

3a52e35

Use Path type in a few more places in neon_fixtures.py (#9018)

fec9321

This is in preparation of replacing neon_fixtures.get_dir_size with neon_fixtures.utils.get_dir_size() in next commit.

Remove redundant get_dir_size() function (#9018)

c6f56b8

There was another copy of it in utils.py. The only difference is that the version in utils.py tolerates files that are concurrently removed. That seems fine for the few callers in neon_fixtures.py too.

Mark a couple of test fixtures as session-scoped (#9018)

4295ff0

pg_distrib_dir doesn't include the Postgres version and only depends on env variables which cannot change during a test run, so it can be marked as session-scoped. Similarly, the platform cannot change during a test run.

Move a few test functions related to auth tokens to separate file (#9018

2db840d

) For readability. neon_fixtures.py is huge.

Remove unnecessary dependencies (#9000)

d211f00

Found by "cargo machete"

Fix test code related to migrations

6138eb5

We added another migration in 5876c44, but didn't bump this value. This had no effect, but best to fix it anyway. Signed-off-by: Tristan Partin <tristan@neon.tech>

Rename some S3 usages to "remote storage" in exposed messages (#8999)

a1b71b7

In exposed messages like log messages we mentioned "S3", which is not entirely accurate as we support Azure blob storage now as well.

CI: don't fetch the whole git history if it's not required (#9021)

d78f5ce

## Problem We do use `actions/checkout` with `fetch-depth: 0` when it's not required ## Summary of changes - Remove unneeded `fetch-depth: 0` - Add a comment if `fetch-depth: 0` is required

refactor(walredo): process launch & kill-on-error machinery (#8951)

3cd2a3f

Immediate benefit: easier to spot what's going on. Later benefit: use the extracted method in PR - #8952 which adds a `ping` command to walredo. Found this useful during investigation neondatabase/cloud#16886.

proxy: make sql-over-http max request/response sizes configurable (#9029

c5cd857

)

Add v17 to revisions.json

2f37f03

Signed-off-by: Tristan Partin <tristan@neon.tech>

test_postgres_version: reenable version check for prereleased versions

ac6a115

proxy: mock cplane usable without allowed-ips table (#9046)

794bd4b

Remove some unnecessary derives

728b79b

Remove misc dead code in libs/

7b34c2d

Remove misc dead code in control_plane/

5da2340

Remove unused code for parsing a postgresql.conf file

2d4e5af

conradludgate and others added 4 commits September 20, 2024 16:09

pageserver: throttling: per-tenant metrics + more metrics to help und…

ec5dce0

…erstand throttle queue depth (#9077)

Merge commit 'ec5dce04ebfa51b727dfc9bc04ebb1e68aef6434' into releases…

126cbd2

…/2024-09-20-hotfix

problame requested review from a team as code owners September 20, 2024 18:52

problame requested review from hlinnaka, arssher, cloneable, Omrigan and nikitakalyanov and removed request for a team September 20, 2024 18:52

problame changed the title ~~storage hotfix 2024-09-20~~ storage hotfix release 2024-09-20 Sep 20, 2024

problame requested review from arpad-m and removed request for hlinnaka, arssher, cloneable, Omrigan and nikitakalyanov September 20, 2024 18:56

koivunej approved these changes Sep 20, 2024

View reviewed changes

problame merged commit ec0550e into release Sep 20, 2024
30 of 46 checks passed

problame deleted the releases/2024-09-20-hotfix branch September 20, 2024 19:09

problame mentioned this pull request Sep 23, 2024

Storage & Compute release 2024-09-23 #9095

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage hotfix release 2024-09-20 #9085

storage hotfix release 2024-09-20 #9085

problame commented Sep 20, 2024 •

edited

Loading

storage hotfix release 2024-09-20 #9085

storage hotfix release 2024-09-20 #9085

Conversation

problame commented Sep 20, 2024 • edited Loading

problame commented Sep 20, 2024 •

edited

Loading