Proxy release 2024-03-27 #7254

vipvap · 2024-03-27T10:29:44Z

Proxy release 2024-03-27

Please merge this Pull Request using 'Create a merge commit' button

## Problem As with the pageserver, we should fail tests that emit unexpected log errors/warnings. ## Summary of changes - Refactor existing log checks to be reusable - Run log checks for attachment_service - Add allow lists as needed.

…cified config overrides (#7166) e2e tests cannot run on macOS unless the file engine env var is supplied. ``` ./scripts/pytest test_runner/regress/test_neon_superuser.py -s ``` will fail with tokio-epoll-uring not supported. This is because we persist the file engine config by default. In this pull request, we only persist when someone specifies it, so that it can use the default platform-variant config in the page server. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

Useful for other code paths which will handle zstd compression and decompression.

This is a mixed bag of changes split out for separate review while working on other things, and batched together to reduce load on CI runners. Each commits stands alone for review purposes: - do_tenant_shard_split was a long function and had a synchronous validation phase at the start that could readily be pulled out into a separate function. This also avoids the special casing of ApiError::BadRequest when deciding whether an abort is needed on errors - Add a 'describe' API (GET on tenant ID) that will enable storcon-cli to see what's going on with a tenant - the 'locate' API wasn't really meant for use in the field. It's for tests: demote it to the /debug/ prefix - The `Single` placement policy was a redundant duplicate of Double(0), and Double was a bad name. Rename it Attached. (#7107) - Some neon_local commands were added for debug/demos, which are now replaced by commands in storcon-cli (#7114 ). Even though that's not merged yet, we don't need the neon_local ones any more. Closes #7107 ## Backward compat of Single/Double -> `Attached(n)` change A database migration is used to convert any existing values.

Warm-up (and the "tenant startup complete" metric update) happens in a background tokio task. The tenant map is eagerly updated (can happen before the task finishes). The test assumed that if the tenant map was updated, then the metric should reflect that. That's not the case, so we tweak the test to wait for the metric. Fixes #7158

## Problem faster sha2 hashing. ## Summary of changes enable asm feature for sha2. this feature will be default in sha2 0.11, so we might as well lean into it now. It provides a noticeable speed boost on macos aarch64. Haven't tested on x86 though

Since #6115 with more often used get_value_reconstruct_data and friends, we should not have needless INFO level span creation near hot paths. In our prod configuration, INFO spans are always created, but in practice, very rarely anything at INFO level is logged underneath. `ResidentLayer::load_keys` is only used during compaction so it is not that hot, but this aligns the access paths and their span usage. PR changes the span level to debug to align with others, and adds the layer name to the error which was missing. Split off from #7030.

The second part of work towards fixing `Layer::keep_resident` so that it does not need to repair the internal state. #7135 added a nicer API for initialization. This PR uses it to remove a few indentation levels and the loop construction. The next PR #7175 will use the refactorings done in this PR, and always initialize the internal state after a download. Cc: #5331

- Enable debug logs for this test - Add some debug logging detail in downloader.rs - Add an info-level message in scheduler.rs that makes it obvious if a command is waiting for an existing task rather than spawning a new one.

Before this PR, cancellation for `LayerInner::get_or_maybe_download` could occur so that we have downloaded the layer file in the filesystem, but because of the cancellation chance, we have not set the internal `LayerInner::inner` or initialized the state. With the detached init support introduced in #7135 and in place in #7152, we can now initialize the internal state after successfully downloading in the spawned task. The next PR will fix the remaining problems that this PR leaves: - `Layer::keep_resident` is still used because - `Layer::get_or_maybe_download` always cancels an eviction, even when canceled Split off from #7030. Stacked on top of #7152. Cc: #5331.

The layer map json is an interesting file for that test, so dump it to make debugging easier.

`pgxn/` also contains WAL proposer code, so modifications to this directory should be able to be approved by the safekeeper team. Signed-off-by: Alex Chi Z <chi@neon.tech>

errno is not preserved in the signal handler. This pull request fixes it. Maybe related: #6969, but does not fix the flaky test problem. Signed-off-by: Alex Chi Z <chi@neon.tech>

Models a compute's lifetime.

Small fix to remove confusing `mut` bindings. Builds upon #7175, split off from #7030. Cc: #5331.

## Problem The current implementation of struct Layer supports canceled read requests, but those will leave the internal state such that a following `Layer::keep_resident` call will need to repair the state. In pathological cases seen during generation numbers resetting in staging or with too many in-progress on-demand downloads, this repair activity will need to wait for the download to complete, which stalls disk usage-based eviction. Similar stalls have been observed in staging near disk-full situations, where downloads failed because the disk was full. Fixes #6028 or the "layer is present on filesystem but not evictable" problems by: 1. not canceling pending evictions by a canceled `LayerInner::get_or_maybe_download` 2. completing post-download initialization of the `LayerInner::inner` from the download task Not canceling evictions above case (1) and always initializing (2) lead to plain `LayerInner::inner` always having the up-to-date information, which leads to the old `Layer::keep_resident` never having to wait for downloads to complete. Finally, the `Layer::keep_resident` is replaced with `Layer::is_likely_resident`. These fix #7145. ## Summary of changes - add a new test showing that a canceled get_or_maybe_download should not cancel the eviction - switch to using a `watch` internally rather than a `broadcast` to avoid hanging eviction while a download is ongoing - doc changes for new semantics and cleanup - fix `Layer::keep_resident` to use just `self.0.inner.get()` as truth as `Layer::is_likely_resident` - remove `LayerInner::wanted_evicted` boolean as no longer needed Builds upon: #7185. Cc: #5331.

## Summary of changes Enforce LSN ordering of batch entries. Closes #6707

## Problem spawn_blocking in #7171 was a hack ## Summary of changes neondatabase/rust-postgres#29

## Problem Storage controller had basically no metrics. ## Summary of changes 1. Migrate the existing metrics to use Conrad's [`measured`](https://docs.rs/measured/0.0.14/measured/) crate. 2. Add metrics for incoming http requests 3. Add metrics for outgoing http requests to the pageserver 4. Add metrics for outgoing pass through requests to the pageserver 5. Add metrics for database queries Note that the metrics response for the attachment service does not use chunked encoding like the rest of the metrics endpoints. Conrad has kindly extended the crate such that it can now be done. Let's leave it for a follow-up since the payload shouldn't be that big at this point. Fixes #6875

Stacks on: - #7165 Fixes while working on background optimization of scheduling after a split: - When a tenant has secondary locations, we weren't detaching the parent shards' secondary locations when doing a split - When a reconciler detaches a location, it was feeding back a locationconf with `Detached` mode in its `observed` object, whereas it should omit that location. This could cause the background reconcile task to keep kicking off no-op reconcilers forever (harmless but annoying). - During shard split, we were scheduling secondary locations for the child shards, but no reconcile was run for these until the next time the background reconcile task ran. Creating these ASAP is useful, because they'll be used shortly after a shard split as the destination locations for migrating the new shards to different nodes.

## Problem If a shutdown happens when a tenant is attaching, we were logging at ERROR severity and with a backtrace. Yuck. ## Summary of changes - Pass a flag into `make_broken` to enable quietening this non-scary case.

This change improves the resilience of the system to unclean restarts. Previously, re-attach responses only included attached tenants - If the pageserver had local state for a secondary location, it would remain, but with no guarantee that it was still _meant_ to be there. After this change, the pageserver will only retain secondary locations if the /re-attach response indicates that they should still be there. - If the pageserver had local state for an attached location that was omitted from a re-attach response, it would be entirely detached. This is wasteful in a typical HA setup, where an offline node's tenants might have been re-attached elsewhere before it restarts, but the offline node's location should revert to a secondary location rather than being wiped. Including secondary tenants in the re-attach response enables the pageserver to avoid throwing away local state unnecessarily. In this PR: - The re-attach items are extended with a 'mode' field. - Storage controller populates 'mode' - Pageserver interprets it (default is attached if missing) to construct either a SecondaryTenant or a Tenant. - A new test exercises both cases.

## Problem for HTTP/WS/password hack flows we imitate SCRAM to validate passwords. This code was unnecessarily complicated. ## Summary of changes Copy in the `pbkdf2` and 'derive keys' steps from the `postgres_protocol` crate in our `rust-postgres` fork. Derive the `client_key`, `server_key` and `stored_key` from the password directly. Use constant time equality to compare the `stored_key` and `server_key` with the ones we are sent from cplane.

See the updated `bench_walredo.rs` module comment. tl;dr: we measure avg latency of single redo operations issues against a single redo manager from N tokio tasks. part of #6628

Release notes: https://blog.rust-lang.org/2024/03/21/Rust-1.77.0.html Thanks to #6886 the diff is reasonable, only for one new lint `clippy::suspicious_open_options`. I added `truncate()` calls to the places where it is obviously the right choice to me, and added allows everywhere else, leaving it for followups. I had to specify cargo install --locked because the build would fail otherwise. This was also recommended by upstream.

## Problem Support of IAM Roles for Service Accounts for authentication. ## Summary of changes * Obtain aws 15m-long credentials * Retrieve redis password from credentials * Update every 1h to keep connection for more than 12h * For now allow to have different endpoints for pubsub/stream redis. TODOs: * PubSub doesn't support credentials refresh, consider using stream instead. * We need an AWS role for proxy to be able to connect to both: S3 and elasticache. Credentials obtaining and connection refresh was tested on xenon preview. neondatabase/cloud#10365

A test was added which exercises secondary locations more, and there was a location in the secondary downloader that warned on ephemeral files. This was intended to be fixed in this faulty commit: 8cea866

## Problem I noticed code coverage for auth_quirks was pretty bare ## Summary of changes Adds 3 happy path unit tests for auth_quirks * scram * cleartext (websockets) * cleartext (password hack)

## Problem We want to deploy releases to a preprod region first to perform required checks ## Summary of changes - Deploy `release-XXX` / `release-proxy-YYY` docker tags to a preprod region

## Problem The service that receives consumption metrics has lower availability than S3. Writing metrics to S3 improves their availability. Closes: neondatabase/cloud#9824 ## Summary of changes - The same data as consumption metrics POST bodies is also compressed and written to an S3 object with a timestamp-formatted path. - Set `metric_collection_bucket` (same format as `remote_storage` config) to configure the location to write to

## Problem We currently hold the layer map read lock while doing IO on the read path. This is not required for correctness. ## Summary of changes Drop the layer map lock after figuring out which layer we wish to read from. Why is this correct: * `Layer` models the lifecycle of an on disk layer. In the event the layer is removed from local disk, it will be on demand downloaded * `InMemoryLayer` holds the `EphemeralFile` which wraps the on disk file. As long as the `InMemoryLayer` is in scope, it's safe to read from it. Related #6833

## Problem `test_bulk_insert` becomes too slow, and it fails constantly: #7124 ## Summary of changes - Skip `test_bulk_insert` until it's fixed

## Problem Currently, we return 409 (Conflict) in two cases: - Temporary: Timeline creation cannot proceed because another timeline with the same ID is being created - Permanent: Timeline creation cannot proceed because another timeline exists with different parameters but the same ID. Callers which time out a request and retry should be able to distinguish these cases. Closes: #7208 ## Summary of changes - Expose `AlreadyCreating` errors as 429 instead of 409

## Problem Follows: #7182 - Sufficient concurrent writes could OOM a pageserver from the size of indices on all the InMemoryLayer instances. - Enforcement of checkpoint_period only happened if there were some writes. Closes: #6916 ## Summary of changes - Add `ephemeral_bytes_per_memory_kb` config property. This controls the ratio of ephemeral layer capacity to memory capacity. The weird unit is to enable making the ratio less than 1:1 (set this property to 1024 to use 1MB of ephemeral layers for every 1MB of RAM, set it smaller to get a fraction). - Implement background layer rolling checks in Timeline::compaction_iteration -- this ensures we apply layer rolling policy in the absence of writes. - During background checks, if the total ephemeral layer size has exceeded the limit, then roll layers whose size is greater than the mean size of all ephemeral layers. - Remove the tick() path from walreceiver: it isn't needed any more now that we do equivalent checks from compaction_iteration. - Add tests for the above. --------- Co-authored-by: Arpad Müller <arpad-m@users.noreply.github.com>

## Problem - Creations were not idempotent (unique key violation) - Creations waited for reconciliation, which control plane blocks while an operation is in flight ## Summary of changes - Handle unique key constraint violation as an OK situation: if we're creating the same tenant ID and shard count, it's reasonable to assume this is a duplicate creation. - Make the wait for reconcile during creation tolerate failures: this is similar to location_conf, where the cloud control plane blocks our notification calls until it is done with calling into our API (in future this constraint is expected to relax as the cloud control plane learns to run multiple operations concurrently for a tenant)

## Problem #7227 destabilized various tests in the performance suite, with log errors during shutdown. It's because we switched shutdown order to stop the storage controller before the pageservers. ## Summary of changes - Tolerate "connection failed" errors from pageservers trying to validation their deletion queue.

## Problem This is a refactor. This PR was a precursor to a much smaller change e5bd602, where as I was writing it I found that we were not far from getting rid of the last non-deprecated code paths that use `mgr::` scoped functions to get at the TenantManager state. We're almost done cleaning this up as per #5796. The only significant remaining mgr:: item is `get_active_tenant_with_timeout`, which is page_service's path for fetching tenants. ## Summary of changes - Remove the bool argument to get_attached_tenant_shard: this was almost always false from API use cases, and in cases when it was true, it was readily replacable with an explicit check of the returned tenant's status. - Rather than letting the timeline eviction task query any tenant it likes via `mgr::`, pass an `Arc<Tenant>` into the task. This is still an ugly circular reference, but should eventually go away: either when we switch to exclusively using disk usage eviction, or when we change metadata storage to avoid the need to imitate layer accesses. - Convert all the mgr::get_tenant call sites to use TenantManager::get_attached_tenant_shard - Move list_tenants into TenantManager.

## Problem neondatabase/cloud#9642 ## Summary of changes 1. Make `EndpointRateLimiter` generic, renamed as `BucketRateLimiter` 2. Add support for claiming multiple tokens at once 3. Add `AuthRateLimiter` alias. 4. Check `(Endpoint, IP)` pair during authentication, weighted by how many hashes proxy would be doing. TODO: handle ipv6 subnets. will do this in a separate PR.

github-actions · 2024-03-27T11:14:58Z

2730 tests run: 2590 passed, 0 failed, 140 skipped (full report)

Flaky tests (1)

Postgres 15

test_empty_branch_remote_storage_upload: release

Code coverage* (full report)

functions: 28.2% (6307 of 22367 functions)
lines: 47.0% (44289 of 94291 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
12512f3 at 2024-03-27T11:14:57.715Z :recycle:}

jcsp and others added 30 commits March 19, 2024 10:30

Move functions for creating/extracting tarballs into utils

64c6dfd

Useful for other code paths which will handle zstd compression and decompression.

proxy: enable sha2 asm support (#7184)

6d99642

## Problem faster sha2 hashing. ## Summary of changes enable asm feature for sha2. this feature will be default in sha2 0.11, so we might as well lean into it now. It provides a noticeable speed boost on macos aarch64. Haven't tested on x86 though

Dump layer map json in test_gc_feedback.py (#7179)

34fa34d

The layer map json is an interesting file for that test, so dump it to make debugging easier.

fix: add safekeeper team to pgxn codeowners (#7170)

5f0d9f2

`pgxn/` also contains WAL proposer code, so modifications to this directory should be able to be approved by the safekeeper team. Signed-off-by: Alex Chi Z <chi@neon.tech>

safekeeper: correctly handle signals (#7167)

55c4ef4

errno is not preserved in the signal handler. This pull request fixes it. Maybe related: #6969, but does not fix the flaky test problem. Signed-off-by: Alex Chi Z <chi@neon.tech>

Add state diagram for compute

041b653

Models a compute's lifetime.

fix(heavier_once_cell): take_and_deinit should take ownership (#7185)

a95c41f

Small fix to remove confusing `mut` bindings. Builds upon #7175, split off from #7030. Cc: #5331.

Enforce LSN ordering of batch entries (#7071)

94138c1

## Summary of changes Enforce LSN ordering of batch entries. Closes #6707

proxy: async aware password validation (#7176)

5ec6862

## Problem spawn_blocking in #7171 was a hack ## Summary of changes neondatabase/rust-postgres#29

pageserver: quieten log on shutdown-while-attaching (#7177)

bb47d53

## Problem If a shutdown happens when a tenant is attaching, we were logging at ERROR severity and with a backtrace. Yuck. ## Summary of changes - Pass a flag into `make_broken` to enable quietening this non-scary case.

walredo benchmark: throughput-oriented rewrite (#7190)

fb60278

See the updated `bench_walredo.rs` module comment. tl;dr: we measure avg latency of single redo operations issues against a single redo manager from N tokio tasks. part of #6628

Fix ephemeral file warning on secondaries (#7201)

62b318c

A test was added which exercises secondary locations more, and there was a location in the secondary downloader that warned on ephemeral files. This was intended to be fixed in this faulty commit: 8cea866

proxy: unit tests for auth_quirks (#7199)

77f3a30

## Problem I noticed code coverage for auth_quirks was pretty bare ## Summary of changes Adds 3 happy path unit tests for auth_quirks * scram * cleartext (websockets) * cleartext (password hack)

CI: deploy release version to a preprod region (#6811)

2668a1d

## Problem We want to deploy releases to a preprod region first to perform required checks ## Summary of changes - Deploy `release-XXX` / `release-proxy-YYY` docker tags to a preprod region

VladLazar and others added 8 commits March 26, 2024 14:35

test_runner/performance: skip test_bulk_insert (#7238)

3426619

## Problem `test_bulk_insert` becomes too slow, and it fails constantly: #7124 ## Summary of changes - Skip `test_bulk_insert` until it's fixed

vipvap requested review from a team as code owners March 27, 2024 10:29

vipvap requested review from knizhnik, arssher, conradludgate, jcsp, Omrigan and nikitakalyanov and removed request for a team March 27, 2024 10:29

conradludgate approved these changes Mar 27, 2024

View reviewed changes

khanova merged commit 2a88889 into release-proxy Mar 27, 2024
108 checks passed

khanova deleted the rc/proxy/2024-03-27 branch March 27, 2024 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy release 2024-03-27 #7254

Proxy release 2024-03-27 #7254

vipvap commented Mar 27, 2024

github-actions bot commented Mar 27, 2024

Postgres 15

Proxy release 2024-03-27 #7254

Proxy release 2024-03-27 #7254

Conversation

vipvap commented Mar 27, 2024