Release 2024-05-13 #7713

vipvap · 2024-05-13T06:04:19Z

Release 2024-05-13

Please merge this Pull Request using 'Create a merge commit' button

This is used by safekeeper tests.

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211.

…7594) ## Problem In testing of the earlier fix for OOMs under heavy write load (#7218), we saw that the limit on ephemeral layer size wasn't being reliably enforced. That was diagnosed as being due to overwhelmed compaction loops: most tenants were waiting on the semaphore for background tasks, and thereby not running the function that proactively rolls layers frequently enough. Related: #6939 ## Summary of changes - Create a new per-tenant background loop for "ingest housekeeping", which invokes maybe_freeze_ephemeral_layer() without taking the background task semaphore. - Downgrade to DEBUG a log line in maybe_freeze_ephemeral_layer that had been INFO, but turns out to be pretty common in the field. There's some discussion on the issue (#6939 (comment)) about alternatives for calling this maybe_freeze_epemeral_layer periodically without it getting stuck behind compaction. A whole task just for this feels like kind of a big hammer, but we may in future find that there are other pieces of lightweight housekeeping that we want to do here too. Why is it okay to call maybe_freeze_ephemeral_layer outside of the background tasks semaphore? - this is the same work we would do anyway if we receive writes from the safekeeper, just done a bit sooner. - The period of the new task is generously jittered (+/- 5%), so when the ephemeral layer size tips over the threshold, we shouldn't see an excessively aggressive thundering herd of layer freezes (and only layers larger than the mean layer size will be frozen) - All that said, this is an imperfect approach that relies on having a generous amount of RAM to dip into when we need to freeze somewhat urgently. It would be nice in future to also block compaction/GC when we recognize resource stress and need to do other work (like layer freezing) to reduce memory footprint.

## Problem The current `tenant_slots` metric becomes less useful once we have lots of secondaries, because we can't tell how many tenants are really attached (without doing a sum() on some other metric). ## Summary of changes - Add a `mode` label to this metric - Update the metric with `slot_added` and `slot_removed` helpers that are called at all the places we mutate the tenants map. - Add a debug assertion at shutdown that checks the metrics add up to the right number, as a cheap way of validating that we're calling the metric hooks in all the right places.

s/temporary/temporarily --------- Co-authored-by: Barry Grenon <barry_grenon@yahoo.ca>

We don't actually use it. refs #7555

#7585 introduced test case for deletions while synthetic size is being calculated. The test has a race against deletion, but we only accept one outcome. Fix it to accept 404 as well, as we cannot control from outside which outcome happens. Evidence: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7456/8970595458/index.html#/testresult/32a5b2f8c4094bdb

Changes parameters to fix the flakiness of `test_ts_of_lsn_api`. Already now, the amount of flakiness of the test is pretty low. With this, it's even lower. cc #5768

## Problem There's allegedly a bug where if we connect a subscriber before WAL is downloaded from the safekeeper, it creates an error. ## Summary of changes Adds support for pausing safekeepers from sending WAL to computes, and then creates a compute and attaches a subscriber while it's in this paused state. Fails to reproduce the issue, but probably a good test to have --------- Co-authored-by: Arseny Sher <sher-ars@yandex.ru>

## Problem Usually, the connection itself is quite fast (bellow 10ms for p999: https://neonprod.grafana.net/goto/aOyn8vYIg?orgId=1). It doesn't make a lot of sense to wait for a lot of time for the lock, if it takes a lot of time to acquire it, probably, something goes wrong. We also spawn a lot of retries, but they are not super helpful (0 means that it was connected successfully, 1, most probably, that it was re-request of the compute node address https://neonprod.grafana.net/goto/J_8VQvLIR?orgId=1). Let's try to keep a small number of retries.

…7613)

## Problem After a shard split of a large existing tenant, child tenants can end up with oversized historic layers indefinitely, if those layers are prevented from being GC'd by branchpoints. This PR is followed by #7531 Related issue: #7504 ## Summary of changes - Add a new compaction phase `compact_shard_ancestors`, which identifies layers that are no longer needed after a shard split. - Add a Timeline->LayerMap code path called `rewrite_layers` , which is currently only used to drop layers, but will later be used to rewrite them as well in #7531 - Add a new test that compacts after a split, and checks that something is deleted. Note that this doesn't have much impact on a tenant's resident size (since unused layers would end up evicted anyway), but it: - Makes index_part.json much smaller - Makes the system easier to reason about: avoid having tenants which are like "my physical size is 4TiB but don't worry I'll never actually download it", instead have tenants report the real physical size of what they might download. Why do we remove these layers in compaction rather than during the split? Because we have existing split tenants that need cleaning up. We can add it to the split operation in future as an optimization.

## Problem Timelines cannot be deleted if they have children. In many production cases, a branch or a timeline has been created off the main branch for various reasons to the effect of having now a "new main" branch. This feature will make it possible to detach a timeline from its ancestor by inheriting all of the data before the branchpoint to the detached timeline and by also reparenting all of the ancestor's earlier branches to the detached timeline. ## Summary of changes - Earlier added copy_lsn_prefix functionality is used - RemoteTimelineClient learns to adopt layers by copying them from another timeline - LayerManager adds support for adding adopted layers - `timeline::Timeline::{prepare_to_detach,complete_detaching}_from_ancestor` and `timeline::detach_ancestor` are added - HTTP PUT handler Cc: #6994 Co-authored-by: Christian Schwarz <christian@neon.tech>

## Problem Ref https://neondb.slack.com/archives/C036U0GRMRB/p1688122168477729 ## Summary of changes - Add sha from postgres repo into postgres version string (via `--with-extra-version`) - Add a test that Postgres version matches the expected one - Remove build-time hard check and allow only related tests to fail

…7637) The `test_forward_compatibility` test runs the old production binaries, but is supposed to always run the latest neon_local binary. I think commit 6acbee2 broke that by accident because in that commit, `from_repo_dir` is introduced and runs an `init_start()` before the `test_forward_compatibility` gets a chance to patch up the neon_local_binpath.

RemoteTimelineClient has a lot of mandatory cloning. By using a single way of creating IndexPart out of UploadQueueInitialized we can simplify things and also avoid cloning the latest files for each `index_part.json` upload (the contents will still be cloned).

Preceding PR #7613 reduced the usage of `--pageserver-config-override`. This PR builds on top of that work and fully removes the `neon_local --pageserver-config-override`. Tests that need a non-default `pageserver.toml` control it using two options: 1. Specify `NeonEnvBuilder.pageserver_config_override` before `NeonEnvBuilder.init_start()`. This uses a new `neon_local init --pageserver-config` flag. 2. After `init_start()`: `env.pageserver.stop()` + `NeonPageserver.edit_config_toml()` + `env.pageserver.start()` A few test cases were using `env.pageserver.start(overrides=("--pageserver-config-override...",))`. I changed them to use one of the options above. Future Work ----------- The `neon_local init --pageserver-config` flag still uses `pageserver --config-override` under the hood. In the future, neon_local should just write the `pageserver.toml` directly. The `NeonEnvBuilder.pageserver_config_override` field should be renamed to `pageserver_initial_config`. Let's save this churn for a separate refactor commit.

This pull request adds the new basebackup read path + aux file write path. In the regression test, all logical replication tests are run with matrix aux_file_v2=false/true. Also fixed the vectored get code path to correctly return missing key error when being called from the unified sequential get code path. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem In #7531, we would like to be able to rewrite layers safely. One option is to make `Layer` able to rewrite files in place safely (e.g. by blocking evictions/deletions for an old Layer while a new one is created), but that's relatively fragile. It's more robust in general if we simply never overwrite the same local file: we can do that by putting the generation number in the filename. ## Summary of changes - Add `local_layer_path` (counterpart to `remote_layer_path`) and convert all locations that manually constructed a local layer path by joining LayerFileName to timeline path - In the layer upload path, construct remote paths with `remote_layer_path` rather than trying to build them out of a local path. - During startup, carry the full path to layer files through `init::reconcile`, and pass it into `Layer::for_resident` - Add a test to make sure we handle upgrades properly. - Comment out the generation part of `local_layer_path`, since we need to maintain forward compatibility for one release. A tiny followup PR will enable it afterwards. We could make this a bit simpler if we bulk renamed existing layers on startup instead of carrying literal paths through init, but that is operationally risky on existing servers with millions of layer files. We can always do a renaming change in future if it becomes annoying, but for the moment it's kind of nice to have a structure that enables us to change local path names again in future quite easily. We should rename `LayerFileName` to `LayerName` or somesuch, to make it more obvious that it's not a literal filename: this was already a bit confusing where that type is used in remote paths. That will be a followup, to avoid polluting this PR's diff.

## Problem Some HTTP client connections can stay open for quite a long time. ## Summary of changes When there are too many HTTP client connections, pick a random connection and gracefully cancel it.

A couple lines moved further down in main(), and one case of using Option<&str> instead of Option<&String>.

This commit is intentionally designed to have as small a diff as possible. To that end, the basic idea is that each distinct "chunk" of the previous main() has been wrapped in its own function, with the return values from each function being passed directly into the next. The structure of main() is now visible from its contents, which have a handful of smaller functions. There's a lot of other work that can / should(?) be done beyond this, but I figure that's more opinionated, and this should be a solid start. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Part of applying the changes from #7600. This piece *technically* can change the semantics because now the context guard is held before process_cli, but... the difference is likely quite small. Co-authored-by: Heikki Linnakangas <heikki@neon.tech>

Introduced by refactorings from #7577. See an example check-macos-build failure here: https://github.com/neondatabase/neon/actions/runs/8992211409/job/24701531264

These log lines were repeated, and `test_detached_receives_flushes_while_being_detached` had an incomplete definition. Example failure: https://neon-github-public-dev.s3.amazonaws.com/reports/pr-7531/8989511410/index.html#suites/a1c2be32556270764423c495fad75d47/992897d3a3369210

…-init --config-override` (#7638) This does to `neon_local` what neondatabase/aws#1322 does to our production deployment. After both are merged, there are no users of `pageserver --init` / `pageserver --config-override` left, and we can remove those flags eventually.

## Problem If a permit cannot be acquired to connect to compute, the cache is invalidated. This had the observed affect of sending more traffic to ProxyWakeCompute on cplane. ## Summary of changes Make sure that permit acquire failures are marked as "should not invalidate cache".

…7679) This reverts commit 1173ee6. ## Problem It breaks autoscaling tests

…ts/csharp/npgsql (#7680)

## Problem #7637 breaks forward compat test. On commit ea531d4. https://neon-github-public-dev.s3.amazonaws.com/reports/main/8988324349/index.html ``` test_create_snapshot 2024-05-07T16:03:11.331883Z INFO version: git-env:ea531d448eb65c4f58abb9ef7d8cd461952f7c5f failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 16:03:11.316131763 UTC build_tag: build_tag-env:5159 test_forward_compatibility 2024-05-07T16:07:02.310769Z INFO version: git-env:ea531d448eb65c4f58abb9ef7d8cd461952f7c5f failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 16:07:02.294676183 UTC build_tag: build_tag-env:5159 ``` The forward compatibility test is actually using the same tag as the current build. The commit before that, https://neon-github-public-dev.s3.amazonaws.com/reports/main/8988126011/index.html ``` test_create_snapshot 2024-05-07T15:47:21.900796Z INFO version: git-env:2dbd1c1ed5cd0458933e8ffd40a9c0a5f4d610b8 failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 15:47:21.882784185 UTC build_tag: build_tag-env:5158 test_forward_compatibility 2024-05-07T15:50:48.828733Z INFO version: git-env:c4d7d5982553d2cf66634d1fbf85d95ef44a6524 failpoints: true, features: ["testing"] launch_timestamp: 2024-05-07 15:50:48.816635176 UTC build_tag: build_tag-env:release-5434 ``` This pull request patches the bin path so that the new neon_local will use the old binary. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).

## Problem This caused a variation of the stats bug fixed by #7662. That PR also fixed this case, but we still shouldn't make redundant get calls. ## Summary of changes - Only call get in the create image layers loop at the end of a range if some keys have been accumulated

## Problem Move from aws based arm64 runners to bare-metal based ## Summary of changes Changes in GitHub action workflows where `runs-on: arm64` used. More parallelism added, build time for `neon with extra platform builds` workflow reduced from 45m to 25m

We didn't check permission in `"/v1/failpoints"` endpoint, it means that everyone with per-tenant token could modify the failpoints. This commit fixes that.

We had accidentally left two endpoints for `tenant`: `/synthetic_size` and `/size`. Size had the more extensive description but has returned 404 since renaming. Remove the `/size` in favor of the working one and describe the `text/html` output.

in addition to layer names, expand the input vocabulary to recognize lines in the form of: ${kind}:${lsn} where: - kind in `gc_cutoff` or `branch` - lsn is accepted in Lsn display format (x/y) or hex (as used in layer names) gc_cutoff and branch have different colors.

While switching to use nextest with the repository in f28bdb6, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.

@hlinnaka

The old test based on the immutable `target_file_size` that was a parameter to the function. It makes no sense to go further once `current_level_target_height` has reached `u64::MAX`, as lsn's are u64 typed. In practice, we should only run into this if there is a bug, as the practical lsn range usually ends much earlier. Testing on `target_file_size` makes less sense, it basically implements an invocation mode that turns off the looping and only runs one iteration of it. @hlinnaka agrees that `current_level_target_height` is better here. Part of #7554

Split checkpoint_stats into two separate metrics: checkpoints_req and checkpoints_timed Fixes commit 21e1a49 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>

The first implementation #7456 did not include `index_part.json` changes in an attempt to keep amount of changes down. Tracks the historic reparentings and earlier detach in `index_part.json`. - `index_part.json` receives a new field `lineage: Lineage` - `Lineage` is queried through RemoteTimelineClient during basebackup, creating `PREV LSN: none` for the invalid prev record lsn just as it would had been created for a newly created timeline - as `struct IndexPart` grew, it is now boxed in places Cc: #6994

github-actions · 2024-05-13T06:49:44Z

3060 tests run: 2928 passed, 0 failed, 132 skipped (full report)

Flaky tests (4)

Postgres 15

test_partial_evict_tenant[relative_equal]: release
test_partial_evict_tenant[relative_spare]: release

Postgres 14

test_gc_aggressive: debug
test_synthetic_size_while_deleting: release

Code coverage* (full report)

functions: 31.4% (6329 of 20161 functions)
lines: 47.3% (47711 of 100932 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
6351313 at 2024-05-13T06:49:43.624Z :recycle:}

arssher and others added 30 commits May 6, 2024 11:08

Add option allowing running multiple endpoints on the same branch.

e6da7e2

This is used by safekeeper tests.

Make 'neon.protocol_version = 2' the default (#7616)

0115fe6

Once all the computes in production have restarted, we can remove protocol version 1 altogether. See issue #6211.

Fix grammar in provider.rs error message

69337be

s/temporary/temporarily --------- Co-authored-by: Barry Grenon <barry_grenon@yahoo.ca>

refactor(pageserver): remove --update-init flag (#7612)

df1def7

We don't actually use it. refs #7555

Fix test_ts_of_lsn_api flakiness (#7599)

f3af5f4

Changes parameters to fix the flakiness of `test_ts_of_lsn_api`. Already now, the amount of flakiness of the test is pretty low. With this, it's even lower. cc #5768

use less neon_local --pageserver-config-override / pageserver -c (#…

ac7dc82

…7613)

build(deps): bump werkzeug from 3.0.1 to 3.0.3 (#7625)

6e4e578

build(deps): bump jinja2 from 3.1.3 to 3.1.4 (#7626)

5a3d8e7

build(deps): bump flask-cors from 3.0.10 to 4.0.1 (#7633)

2dbd1c1

proxy: cull http connections (#7632)

0c99e5e

## Problem Some HTTP client connections can stay open for quite a long time. ## Summary of changes When there are too many HTTP client connections, pick a random connection and gracefully cancel it.

compute_ctl: Non-functional prep changes to reduce diff (#7577)

b158a5e

A couple lines moved further down in main(), and one case of using Option<&str> instead of Option<&String>.

compute_ctl: Fix unused variable on non-Linux (#7646)

b827e7b

Introduced by refactorings from #7577. See an example check-macos-build failure here: https://github.com/neondatabase/neon/actions/runs/8992211409/job/24701531264

skyzh and others added 13 commits May 9, 2024 15:15

Revert "chore(neon_test_utils): restrict installation to superuser" (#…

2682e02

…7679) This reverts commit 1173ee6. ## Problem It breaks autoscaling tests

build(deps): bump Npgsql from 8.0.2 to 8.0.3 in /test_runner/pg_clien…

5ea117c

…ts/csharp/npgsql (#7680)

Proxy added per ep rate limiter (#7636)

be1a88e

## Problem There is no global per-ep rate limiter in proxy. ## Summary of changes * Return global per-ep rate limiter back. * Rename weak compute rate limiter (the cli flags were not used anywhere, so it's safe to rename).

Fix permissions for safekeeper failpoints (#7669)

0b02043

We didn't check permission in `"/v1/failpoints"` endpoint, it means that everyone with per-tenant token could modify the failpoints. This commit fixes that.

build: run doctests (#7697)

6206f76

While switching to use nextest with the repository in f28bdb6, we had not noticed that it doesn't yet support running doctests. Run the doc tests before other tests.

Fix checkpoint metric (#7701)

95098c3

Split checkpoint_stats into two separate metrics: checkpoints_req and checkpoints_timed Fixes commit 21e1a49 --------- Co-authored-by: Peter Bendel <peterbendel@neon.tech>

vipvap requested review from a team as code owners May 13, 2024 06:04

vipvap requested review from save-buffer, arssher, conradludgate and mattpodraza and removed request for a team May 13, 2024 06:04

VladLazar mentioned this pull request May 13, 2024

Handmade Release 2024-05-13 #7735

Merged

VladLazar merged commit 6351313 into release May 13, 2024
157 of 159 checks passed

VladLazar deleted the rc/2024-05-13 branch May 13, 2024 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-05-13 #7713

Release 2024-05-13 #7713

vipvap commented May 13, 2024

github-actions bot commented May 13, 2024

Postgres 15

Postgres 14

Release 2024-05-13 #7713

Release 2024-05-13 #7713

Conversation

vipvap commented May 13, 2024