Compute release 2024-10-02 #9228

lubennikovaav · 2024-10-01T19:11:32Z

No description provided.

Commit 263dfba introduced neon extension version 1.5, which included some new functions and views for metrics. It didn't bump the default neon extension number yet, so that we could still safely roll back to the old binary if necessary. This bumps the default version.

## Problem Legacy functions that were called as `mgr::` and relied on the static TENANTS, see #5796 ## Summary of changes - Move the last stray function (immediate_gc) into TenantManager Closes: #5796

## Problem The Neon components, built locally and by the GitHub workflow have slightly different version prefixes (git: vs git-env:) This does not allow running tests against local builds correctly. ## Summary of changes The regular expressions were changed to work with both prefixes.

## Problem This test waits for a request to finish, and then expects deletion to complete almost immediately. The request completes, but it's a 202, the timeline is still deleting in the background: we need to be more patient. ## Summary of changes - Adjust iterations from 2 to 10 when waiting for deletion

These calls seem really out of place. We know what the initial tenant and branch are in these tests, just like in all other tests.

neon_cli.create_tenant() creates a new tenant *and* a timeline on the tenant, with name "main". In most tests, there's no need to create another timeline on the same tenant. There are some more tests that do that, but in the remaining cases, I wasn't be 100% if the presence of extra root timelines affect what the tests test, so I left them alone.

There is no 'pg_bin' in NeonEnv.

resolves neondatabase/cloud#18026

We haven't updated it for a while. Now I need the update to add quotas support to compute images (neondatabase/cloud#13127). Previous update: #7849

Opens http2 connection to local-proxy and forwards requests over with all headers and body closes neondatabase/cloud#16039

libcurl4-openssl-dev is needed to build pgxn/, but libcurl4 is enough at runtime.

… migration origin becomes unavailable (#9147) ## Problem The live migration code waits forever for the compute notification hook, on the basis that until it succeeds, the compute is probably using the old location and we shouldn't detach it. However, if a pageserver stops or restarts in the background, then this original location might no longer be available, so there is no point waiting. Waiting is also actively harmful, because it prevents other reconciliations happening for the tenant shard, such as during an upgrade where a stuck "drain" migration might prevent the later "fill" migration from moving the shard back to its original location. ## Summary of changes - Refactor the notification wait loop into a function - Add a checks during the loop, for the origin node's cancellation token and an explicit HTTP request to the origin node to confirm the shard is still attached there. Closes: #8901

* tracing-utils now returns a `Layer` impl. Removes the need for crates to import OTel crates. * Drop the /v1/traces URI check. Verified that the code does the right thing. * Leave a TODO to hook in an error handler for OTel to log errors to when it assumes the regular pipeline cannot be used/is broken.

## Problem `test_multi_attach` is sometimes failing with `invalid compute status for configuration request: Configuration`. This is likely a result of the test attempting to reconfigure the compute at the same time as the storage controller is doing so. This test was originally written before the storage controller existed, and is not expecting anything else to be reconfiguring computes at the same time. ## Summary of changes - Configure the tenant into scheduling policy `Stop` in the storage controller at the start of the test, so that it won't try to do anything to the tenant while the test is running.

## Problem We need the [pg_session_jwt](https://github.com/neondatabase/pg_session_jwt/) extension in the compute image. This PR adds it. ## Summary of changes I added the `pg_session_jwt` extension in a very similar way to how the pggraphql and pgtiktoken extensions were added (since they're all written with pgrx). Then I tested this. ``` $ cd docker-compose/ $ PG_VERSION=16 TAG=10667533475 docker-compose up --build -d $ psql postgresql://cloud_admin:cloud_admin@localhost:55433/postgres cloud_admin@postgres=# create extension pg_session_jwt; CREATE EXTENSION Time: 43.048 ms cloud_admin@postgres=# \df auth.*; List of functions ┌────────┬──────────────────┬──────────────────┬─────────────────────┬──────┐ │ Schema │ Name │ Result data type │ Argument data types │ Type │ ├────────┼──────────────────┼──────────────────┼─────────────────────┼──────┤ │ auth │ get │ jsonb │ s text │ func │ │ auth │ init │ void │ kid bigint, s jsonb │ func │ │ auth │ jwt_session_init │ void │ s text │ func │ │ auth │ user_id │ text │ │ func │ └────────┴──────────────────┴──────────────────┴─────────────────────┴──────┘ (4 rows) cloud_admin@postgres=# select auth.init(cast('1' as bigint), to_jsonb(TEXT '{ "kty": "EC", "kid": "571683be-33cf-4e67-bccc-8905c0ebb862", "crv": "P-521", "alg": "ES512", "x": "AM_GsnQvKML2yXdn_OsN8PdgO1Sf9XMXih5vQMKLmJkp-Iz_FFWJUt6uyR_qp4brr8Ji2kjGJgN4cQJpg2kskH7V", "y": "AZg-salw24lCmsBP-BCBa5jT6INkTwLtCOC7o0BIxDVvmIEH1-PQAJVYVJPTFvPMi_PLa0QlOm-ufJYkynwa2Mau" }')); ERROR: called `Result::unwrap()` on an `Err` value: Error("invalid type: string \"{ \\\"kty\\\": \\\"EC\\\", \\\"kid\\\": \\\"571683be-33cf-4e67-bccc-8905c0ebb862\\\", \\\"crv\\\": \\\"P-521\\\", \\\"alg\\\": \\\"ES512\\\", \\\"x\\\": \\\"AM_GsnQvKML2yXdn_OsN8PdgO1Sf9XMXih5vQMKLmJkp-Iz_FFWJUt6uyR_qp4brr8Ji2kjGJgN4cQJpg2kskH7V\\\", \\\"y\\\": \\\"AZg-salw24lCmsBP-BCBa5jT6INkTwLtCOC7o0BIxDVvmIEH1-PQAJVYVJPTFvPMi_PLa0QlOm-ufJYkynwa2Mau\\\" }\", expected struct JwkEcKey", line: 0, column: 0) Time: 6.991 ms ``` ## Checklist before requesting a review - [x] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Move the download location to a proper URL

## Problem In automated tests running on AWS S3, we frequently see scrubber failures when it can't delete an index. `location_conf_churn`: https://neon-github-public-dev.s3.amazonaws.com/reports/main/11076221056/index.html#/testresult/f89b1916b6a693e2 `scrubber_physical_gc`: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-9178/11074269153/index.html#/testresult/9885ed5aa0fe38b6 ## Summary of changes Wrap index deletion in a backoff::retry --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

Microsoft exposes JWKs without the alg header. It's only included on the tokens. Not a problem. Also noticed that wrt the `typ` header: > It will typically not be used by applications when it is already known that the object is a JWT. This parameter is ignored by JWT implementations; any processing of this parameter is performed by the JWT application. Since we know we are expecting JWTs only, I've followed the guidance and removed the validation.

On bookworm, 'cmake' is new enough that we can just use it. On bullseye, we can get a new-enough package from backports. By including 'cmake' in the build-deps stage, we don't need to install it separately in all the later build stages that need it. See #2699, where we switched to downloading and building a specific version.

These are the perf counters added in commit 263dfba. Note: This relies on 'neon' extension version 1.5. The default was bumped to 1.5 in commit d696c41. --------- Co-authored-by: Matthias van de Meent <matthias@neon.tech>

aux v2 migration is near the end and I rewrote the RFC based on what I proposed (several months before...) and what I actually implemented. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem There is a wrong log message. ## Summary of changes Fixed the log message.

Bring back post_apply_config() step that was accidentally removed in 78938d1

Following #7656, `TenantConfOpt::TryFrom<toml_edit::Item>` appears to be dead code. This patch removes `TenantConfOpt::TryFrom<toml_edit::Item>`. The code does appear to be dead, since the TOML config is deserialized into `TenantConfig` (via `LocationConfig`) and then converted into `TenantConfOpt`. This was verified by adding a panic to `try_from()` and running the pageserver unit tests as well as a local end-to-end cluster (including creating a new tenant and restarting the pageserver). This did not fail, so this is not used on the common happy path at least. No explicit `try_from` or `try_into` calls were found either. Resolves #8918.

Found while searching for other issues in shared memory. The bug should be benign, in that it over-allocates memory for this struct, but doesn't allow for out-of-bounds writes.

…9099) When endpoint is stopped in immediate mode and started again there is a chance of old connection delivering some WAL to safekeepers after second start checked need for sync-safekeepers and thus grabbed basebackup LSN. It makes basebackup unusable, so compute panics. Avoid flakiness by waiting for walreceivers on safekeepers to be gone in such cases. A better way would be to bump term on safekeepers if sync-safekeepers is skipped, but it needs more infrastructure. ref #9079

Previously we set the 'backpressure throttling' status, but overwrote current one and never reset it back.

MaxBackends doesn't include auxiliary processes. Whenever an aux process made IO operations that updated the counters, they would scribble over shared memory beoynd the end of the array. The relsize cache hash table comes after the array, so the symptom was an error about hash table corruption in the relsize cache hash.

github-actions · 2024-10-01T20:02:00Z

5022 tests run: 4864 passed, 0 failed, 158 skipped (full report)

Flaky tests (3)

Postgres 17

test_pageserver_compaction_smoke: release-x86-64, release-arm64

Postgres 16

test_subscriber_restart: release-x86-64

Code coverage* (full report)

functions: 31.4% (7490 of 23881 functions)
lines: 49.6% (60113 of 121224 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
8861e8a at 2024-10-01T20:01:59.852Z :recycle:}

danieltprice · 2024-10-10T19:44:34Z

Reviewed for changelog

hlinnaka and others added 27 commits September 30, 2024 09:20

pageserver: refactor immediate_gc into TenantManager (#9183)

7cfd116

## Problem Legacy functions that were called as `mgr::` and relied on the static TENANTS, see #5796 ## Summary of changes - Move the last stray function (immediate_gc) into TenantManager Closes: #5796

tests: Remove some spurious list_timelines calls

4dc9cb7

These calls seem really out of place. We know what the initial tenant and branch are in these tests, just like in all other tests.

tests: Move comment to more appropriate place

0a567ac

There is no 'pg_bin' in NeonEnv.

add proxy-protocol header disable option (#9203)

a2e2362

resolves neondatabase/cloud#18026

Bump vm-builder v0.29.3 -> v0.35.0 (#9208)

c07cea8

We haven't updated it for a while. Now I need the update to add quotas support to compute images (neondatabase/cloud#13127). Previous update: #7849

proxy: auth broker (#8855)

94a5ca2

Opens http2 connection to local-proxy and forwards requests over with all headers and body closes neondatabase/cloud#16039

Remove unnecessary dev package from compute image (#9210)

65bda19

libcurl4-openssl-dev is needed to build pgxn/, but libcurl4 is enough at runtime.

Add new compute metrics to sql exporter (#9190)

0d500bb

These are the perf counters added in commit 263dfba. Note: This relies on 'neon' extension version 1.5. The default was bumped to 1.5 in commit d696c41. --------- Co-authored-by: Matthias van de Meent <matthias@neon.tech>

docs: add aux file v2 RFC (#9115)

49f99eb

aux v2 migration is near the end and I rewrote the RFC based on what I proposed (several months before...) and what I actually implemented. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

safekeeper: Fix a log message of HTTP worker (#9213)

b675997

## Problem There is a wrong log message. ## Summary of changes Fixed the log message.

Fix post_apply_config() (#9220)

ce73db9

Bring back post_apply_config() step that was accidentally removed in 78938d1

Fix small memory accounting bug in libpagestore (#9223)

6efdb1d

Found while searching for other issues in shared memory. The bug should be benign, in that it over-allocates memory for this struct, but doesn't allow for out-of-bounds writes.

Backpressure: reset ps display after it is done. (#8980)

62e22df

Previously we set the 'backpressure throttling' status, but overwrote current one and never reset it back.

lubennikovaav requested review from a team as code owners October 1, 2024 19:11

lubennikovaav requested a review from a team as a code owner October 1, 2024 19:11

lubennikovaav requested review from problame, conradludgate, tristan957, chaporgin, hlinnaka and sharnoff and removed request for a team October 1, 2024 19:11

hlinnaka approved these changes Oct 1, 2024

View reviewed changes

sharnoff approved these changes Oct 1, 2024

View reviewed changes

lubennikovaav merged commit 5cabf32 into release Oct 1, 2024
150 of 152 checks passed

lubennikovaav deleted the releases/2024-10-01-compute-only branch October 1, 2024 20:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute release 2024-10-02 #9228

Compute release 2024-10-02 #9228

lubennikovaav commented Oct 1, 2024

github-actions bot commented Oct 1, 2024

Postgres 17

Postgres 16

danieltprice commented Oct 10, 2024

Compute release 2024-10-02 #9228

Compute release 2024-10-02 #9228

Conversation

lubennikovaav commented Oct 1, 2024

github-actions bot commented Oct 1, 2024

5022 tests run: 4864 passed, 0 failed, 158 skipped (full report)

Postgres 17

Postgres 16

Code coverage* (full report)

danieltprice commented Oct 10, 2024