Storage & Compute release 2024-10-07 #9291

vipvap · 2024-10-07T06:05:13Z

Storage & Compute release 2024-10-07

Please merge this Pull Request using 'Create a merge commit' button

In PG17, there is this newfangled custom wait events system. This commit adds that feature to Neon, so that users can see what their backends may be waiting for when a PostgreSQL backend is playing the waiting game in Neon code.

```shell $ cargo run -p proxy --bin proxy -- --auth-backend=web --webauth-confirmation-timeout=5s ``` ``` $ psql -h localhost -p 4432 NOTICE: Welcome to Neon! Authenticate by visiting within 5s: http://localhost:3000/psql_session/e946900c8a9bc6e9 psql: error: connection to server at "localhost" (::1), port 4432 failed: Connection refused Is the server running on that host and accepting TCP/IP connections? connection to server at "localhost" (127.0.0.1), port 4432 failed: ERROR: Disconnected due to inactivity after 5s. ```

## Problem Recent change to avoid the "became visible" log messages from certain tasks missed a task: the logical size calculation that happens as a child of synthetic size calculation. Related: #9058 ## Summary of changes - Add OnDemandLogicalSize to the list of permitted tasks for reads making a covered layer visible - Tweak the log message to use layer name instead of key: this is more terse, and easier to use when debugging, as one can search for it elsewhere to see when the layer was written/downloaded etc.

## Problem If a timeline was deleted right before waiting for LSNs to catch up before the cut-over, then we would wait forever. ## Summary of changes Fix the issue and add a test for timeline deletions mid migration. Related #9144

## Problem We don't have an alert for long running reconciles. Stuck reconciles are problematic as we've seen in a recent incident. ## Summary of changes Add a new metric `storage_controller_reconcile_long_running_total` with labels: `{tenant_id, shard_number, seq}`. The metric is removed after the long running reconcile finishes. These events should be rare, so we won't break the bank on cardinality. Related #9150

close #9160 For whatever reason, pg17's WAL pattern seems different from others, which triggers some flaky behavior within the compaction smoke test. ## Summary of changes * Run L0 compaction before proceeding with the read benchmark. * So that we can ensure the num of L0 layers is 0 and test the compaction behavior only with L1 layers. We have a threshold for triggering L0 compaction. In some cases, the test case did not produce enough L0 layers to do a L0 compaction, therefore leaving the layer map with 3+ L0 layers above the L1 layers. This increases the average read depth for the timeline. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem We are seeing frequent pageserver startup timelines while it calls syncfs(). There is an existing fixture that syncs _after_ tests, but not before the first one. We hypothesize that some failures are happening on the first test in a job. ## Summary of changes - extend the existing sync_after_each_test to be a sync between all tests, including sync'ing before running the first test. That should remove any ambiguity about whether the sync is happening on the correct node. This is an alternative to #8957 -- I didn't realize until I saw Alexander's comment on that PR that we have an existing hook that syncs filesystems and can be extended.

Per convention, histogram buckets have the '_bucket' suffix. I got that wrong in commit 0d500bb. Fixes #9241

Some parentheses in conditional expressions are redundant or necessary for clarity conditional expressions in walproposer. ## Summary of changes Change some parentheses to clarify conditions in walproposer. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

## Problem ``` Warning: The file chosen for install of requests 2.32.0 (requests-2.32.0-py3-none-any.whl) is yanked. Reason for being yanked: Yanked due to conflicts with CVE-2024-35195 mitigation ``` ## Summary of changes - Update `requests` to fix the warning - Update `psycopg2-binary`

…PR (#9251) compute_ctl is an important part of the interfaces between the control plane and the compute, so it seems important to E2E test any changes there.

Fixes #9231 . Upgrade hyper to 1.4.0 and use hyper 1.4 instead of 0.14 in the storage broker, together with tonic 0.12. The two upgrades go hand in hand. Thanks to the broker being independent from other components, we can upgrade its hyper version without touching the other components, which makes things easier.

As seen in neondatabase/cloud#17335, during releases we can have ingest lags that are above the limits for warnings. However, such lags are part of normal pageserver startup. Therefore, calculate a certain cooldown timestamp until which we accept lags up to a certain size. The heuristic is chosen to grow the later we get to fully load the tenant, and we also add 60 seconds as a grace period after that term.

The apt install stage before this commit: 0 upgraded, 391 newly installed, 0 to remove and 9 not upgraded. Need to get 261 MB of archives. after: 0 upgraded, 367 newly installed, 0 to remove and 9 not upgraded. Need to get 220 MB of archives.

Address minor technical debt in Layer inspired by #9224: - layer usage as arg same as in spans - avoid one Weak::upgrade

Because: - it's nice to be up-to-date, - we already had axum 0.7 in our dependency tree, so this avoids having to compile two versions, and - removes one of the remaining dpendencies to hyper version 0 Also bumps the 'tokio-tungstenite' dependency, to avoid having two versions in the dependency tree.

Follow-up of #9234 to give hyper 1.0 the version-free name, and the legacy version of hyper the one with the version number inside. As we move away from hyper 0.14, we can remove the `hyper0` name piece by piece. Part of #9255

## Problem `Oversized vectored read [...]` logs are spewing in prod because we have a few keys that are unexpectedly large: * reldir/relblock - these are unbounded, so it's known technical debt * slru block - they can be a bit bigger than 128KiB due to storage format overhead ## Summary of changes * Bump threshold to 130KiB * Don't warn on oversized reldir and dbdir keys Closes #8967

Panic was triggered only when dump selected no timelines. sentry report: https://neondatabase.sentry.io/issues/5832368589/

…#9253) See [this comment](#8888 (comment)).

* I had to install `m4` in order to be able to run locally * The docs/docker.md was missing a pointer to where the compute node code is (Was originally on #8888 but I am pulling this out)

Add wrappers for a few commands that didn't have them before. Move the logic to generate tenant and timeline IDs from NeonCli to the callers, so that NeonCli is more purely just a type-safe wrapper around 'neon_local'.

In the passing, rename it to NeonLocalCli, to reflect that the binary is called 'neon_local'. Add wrapper for the 'timeline_import' command, eliminating the last raw call to the raw_cli() function from tests, except for a few in test_neon_cli.py which are about testing the 'neon_local' iteself. All the other calls are now made through the strongly-typed wrapper functions

…ds (#9195) This makes it more clear that the functions in NeonLocalCli are just typed wrappers around the corresponding 'neon_local' commands.

…9195)

## Problem The S3 tests couldn't use SSO authentication for local tests against S3. ## Summary of changes Enable the `sso` feature of `aws-config`. Also run `cargo hakari generate` which made some updates to `workspace_hack`.

## Problem Secondary tenant heatmaps were always downloaded, even when they hadn't changed. This can be avoided by using a conditional GET request passing the `ETag` of the previous heatmap. ## Summary of changes The `ETag` was already plumbed down into the heatmap downloader, and just needed further plumbing into the remote storage backends. * Add a `DownloadOpts` struct and pass it to `RemoteStorage::download()`. * Add an optional `DownloadOpts::etag` field, which uses a conditional GET and returns `DownloadError::Unmodified` on match.

## Problem Creation of a timelines during a reconciliation can lead to unavailability if the user attempts to start a compute before the storage controller has notified cplane of the cut-over. ## Summary of changes Create timelines on all currently attached locations. For the latest location, we still look at the database (this is a previously). With this change we also look into the observed state to find *other* attached locations. Related #9144

NeonWALReader needs to know LSN before which WAL is not available locally, that is, basebackup LSN. Previously it was taken from WalpropShmemState, but that's racy, as walproposer sets its there only after successfull election. Get it directly with GetRedoStartLsn. Should fix flakiness of test_ondemand_wal_download_in_replication_slot_funcs etc. ref #9201

1. Adds local-proxy to compute image and vm spec 2. Updates local-proxy config processing, writing PID to a file eagerly 3. Updates compute-ctl to understand local proxy compute spec and to send SIGHUP to local-proxy over that pid. closes neondatabase/cloud#16867

Fixes (#9020) - Use the compute::COULD_NOT_CONNECT for connection error message; - Eliminate logging for one connection attempt; - Typo fix.

If peer safekeeper needs garbage collected segment it will be fetched now from s3 using on-demand WAL download. Reduces danger of running out of disk space when safekeeper fails.

github-actions · 2024-10-07T06:56:40Z

5085 tests run: 4878 passed, 0 failed, 207 skipped (full report)

Flaky tests (2)

Postgres 17

test_pg_regress[None]: release-arm64

Postgres 16

test_subscriber_restart: release-x86-64

Code coverage* (full report)

functions: 31.4% (7509 of 23942 functions)
lines: 49.6% (60286 of 121592 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
eae4470 at 2024-10-07T06:56:40.096Z :recycle:}

…#9365) ## Problem Action `run-python-test-set` fails if it is not used for `regress_tests` on release PR, because it expects `test_compatibility.py::test_create_snapshot` to generate a snapshot, and the test exists only in `regress_tests` suite. For example, in #9291 [`test-postgres-client-libs`](https://github.com/neondatabase/neon/actions/runs/11209615321/job/31155111544) job failed. ## Summary of changes - Add `skip-if-does-not-exist` input to `.github/actions/upload` action (the same way we do for `.github/actions/download`) - Set `skip-if-does-not-exist=true` for "Upload compatibility snapshot" step in `run-python-test-set` action

MMeent and others added 30 commits October 2, 2024 11:12

Expose more granular wait event data to the user (#9163)

ea32f1d

In PG17, there is this newfangled custom wait events system. This commit adds that feature to Neon, so that users can see what their backends may be waiting for when a PostgreSQL backend is playing the waiting game in Neon code.

Fix metric name of the 'getpage_wait_seconds_bucket' metric (#9242)

d204489

Per convention, histogram buckets have the '_bucket' suffix. I got that wrong in commit 0d500bb. Fixes #9241

Add compute_tools/ to the list of paths that trigger an E2E run on a …

1dec93f

…PR (#9251) compute_ctl is an important part of the interfaces between the control plane and the compute, so it seems important to E2E test any changes there.

chore: smaller layer changes (#9247)

dbef1b0

Address minor technical debt in Layer inspired by #9224: - layer usage as arg same as in spans - avoid one Weak::upgrade

Rename hyper 1.0 to hyper and hyper 0.14 to hyper0 (#9254)

9d93dd4

Follow-up of #9234 to give hyper 1.0 the version-free name, and the legacy version of hyper the one with the version number inside. As we move away from hyper 0.14, we can remove the `hyper0` name piece by piece. Part of #9255

safekeeper: fix panic in debug_dump. (#9097)

d785fcb

Panic was triggered only when dump selected no timelines. sentry report: https://neondatabase.sentry.io/issues/5832368589/

Revert hyper and tonic updates (#9268)

e3d6eca

chore: remove unnecessary comments in compute/Dockerfile.compute-node (…

2fac0b7

…#9253) See [this comment](#8888 (comment)).

chore: makes some onboarding document improvements (#9216)

4e9b32c

* I had to install `m4` in order to be able to run locally * The docs/docker.md was missing a pointer to where the compute node code is (Was originally on #8888 but I am pulling this out)

tests: Rename NeonLocalCli functions to match the 'neon_local' comman…

8ef0c38

…ds (#9195) This makes it more clear that the functions in NeonLocalCli are just typed wrappers around the corresponding 'neon_local' commands.

tests: Add a comment explaining the rules of NeonLocalCli wrappers (#…

52232dd

…9195)

Cargo.toml: enable sso for aws-config (#9261)

60fb840

## Problem The S3 tests couldn't use SSO authentication for local tests against S3. ## Summary of changes Enable the `sso` feature of `aws-config`. Also run `cargo hakari generate` which made some updates to `workspace_hack`.

remote_storage: add head_object integration test (#9274)

04a6222

arssher and others added 4 commits October 4, 2024 14:56

proxy: exclude triple logging of connect compute errors (#9277)

2d248ae

Fixes (#9020) - Use the compute::COULD_NOT_CONNECT for connection error message; - Eliminate logging for one connection attempt; - Typo fix.

safekeeper: remove local WAL files ignoring peer_horizon_lsn. (#8900)

eae4470

If peer safekeeper needs garbage collected segment it will be fetched now from s3 using on-demand WAL download. Reduces danger of running out of disk space when safekeeper fails.

vipvap requested review from a team as code owners October 7, 2024 06:05

vipvap requested review from problame, conradludgate, lubennikovaav, sharnoff and chaporgin and removed request for a team October 7, 2024 06:05

arssher approved these changes Oct 7, 2024

View reviewed changes

arssher merged commit 2613769 into release Oct 7, 2024
273 of 284 checks passed

arssher deleted the rc/2024-10-07 branch October 7, 2024 15:20

bayandin mentioned this pull request Oct 11, 2024

CI(run-python-test-set): allow to skip missing compatibility snapshot #9365

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage & Compute release 2024-10-07 #9291

Storage & Compute release 2024-10-07 #9291

vipvap commented Oct 7, 2024

github-actions bot commented Oct 7, 2024

Postgres 17

Postgres 16

Storage & Compute release 2024-10-07 #9291

Storage & Compute release 2024-10-07 #9291

Conversation

vipvap commented Oct 7, 2024