Release 2024-05-06 #7615

vipvap · 2024-05-06T06:04:27Z

Release 2024-05-06

Please merge this Pull Request using 'Create a merge commit' button

## Problem Sometimes we have test data in the form of S3 contents that we would like to run live in a neon_local environment. ## Summary of changes - Add a storage controller API that imports an existing tenant. Currently this is equivalent to doing a create with a high generation number, but in future this would be something smarter to probe S3 to find the shards in a tenant and find generation numbers. - Add a `neon_local` command that invokes the import API, and then inspects timelines in the newly attached tenant to create matching branches.

…7495) ## Problem Previously, we try to send compute notifications in startup_reconcile before completing that function, with a time limit. Any notifications that don't happen within the time limit result in tenants having their `pending_compute_notification` flag set, which causes them to spawn a Reconciler next time the background reconciler loop runs. This causes two problems: - Spawning a lot of reconcilers after startup caused a spike in memory (this is addressed in #7493) - After #7493, spawning lots of reconcilers will block some other operations, e.g. a tenant creation might fail due to lack of reconciler semaphore units while the controller is busy running all the Reconcilers for its startup compute notifications. When the code was first written, ComputeHook didn't have internal ordering logic to ensure that notifications for a shard were sent in the right order. Since that was added in #7088, we can use it to avoid waiting for notifications to complete in startup_reconcile. Related to: #7460 ## Summary of changes - Add a `notify_background` method to ComputeHook. - Call this from startup_reconcile instead of doing notifications inline - Process completions from `notify_background` in `process_results`, and if a notification failed then set the `pending_compute_notification` flag on the shard. The result is that we will only spawn lots of Reconcilers if the compute notifications _fail_, not just because they take some significant amount of time. Test coverage for this case is in #7475

## Problem Alerts fire if the connection the compute is slow. ## Summary of changes Exclude compute and retry from latencies.

## Problem Downloading tenant data for analysis/debug with `aws s3 cp` works well for small tenants, but for larger tenants it is unlikely that one ends up with an index that matches layer files, due to the time taken to download. ## Summary of changes - Add a `tenant-snapshot` command to the scrubber, which reads timeline indices and then downloads the layers referenced in the index, even if they were deleted. The result is a snapshot of the tenant's remote storage state that should be usable when imported (#7399 ).

## Problem Right now we always do retry wake compute. ## Summary of changes Create a list of errors when we could avoid needless retries.

@oruen

## Problem It's not possible to get the duration of the session from proxy events. ## Summary of changes * Added a separate events folder in s3, to record disconnect events. * Disconnect events are exactly the same as normal events, but also have `disconnect_timestamp` field not empty. * @oruen suggested to fill it with the same information as the original events to avoid potentially heavy joins.

## Problem Benchmarks don't use the vectored read path. ## Summary of changes * Update the benchmarks to use the vectored read path for both singular and vectored gets. * Disable validation for the benchmarks

Extracted from #7514, 9399 is the default port. We want to specify it b/c we will start a second sql exporter for autoscaling agent soon. Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Sequential get runs after vectored get, so it is possible for the later to time out while waiting for its ancestor's Lsn to become ready and for the former to succeed (it essentially has a doubled wait time). ## Summary of Changes Relax the validation to allow for such rare cases.

Updates the four azure SDK crates used by remote_storage to 0.19.

previously in #7375, we observed that for in-memory layers, we will need to iterate every key in the key space in order to get the result. The operation can be more efficient if we use BTreeMap as the in-memory layer representation, even if we are doing vectored get in a dense keyspace. Imagine a case that the in-memory layer covers a very little part of the keyspace, and most of the keys need to be found in lower layers. Using a BTreeMap can significantly reduce probes for nonexistent keys. ## Summary of changes * Use BTreeMap as in-memory layer representation. * Optimize the vectored get flow to utilize the range scan functionality of BTreeMap. Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Followup to #6776 While #6776 makes compaction safe on sharded tenants, the logic for keyspace partitioning remains inefficient: it assumes that the size of data on a pageserver can be calculated simply as the range between start and end of a Range -- this is not the case in sharded tenants, where data within a range belongs to a variety of shards. Closes: #6774 ## Summary of changes I experimented with using a sharding-aware range type in KeySpace to replace all the Range<Key> uses, but the impact on other code was quite large (many places use the ranges), and not all of them need this property of being able to approximate the physical size of data within a key range. So I compromised on expressing this as a ShardedRange type, but only using that type selctively: during keyspace repartition, and in tiered compaction when accumulating key ranges. - keyspace partitioning methods take sharding parameters as an input - new `ShardedRange` type wraps a Range<Key> and a shard identity - ShardedRange::page_count is the shard-aware replacement for key_range_size - Callers that don't need to be shard-aware (e.g. vectored get code that just wants to count the number of keys in a keyspace) can use ShardedRange::raw_size to get the faster, shard-naive code (same as old `key_range_size`) - Compaction code is updated to carry a shard identity so that it can use shard aware calculations - Unit tests for the new fragmentation logic. - Add a test for compaction on sharded tenants, that validates that we generate appropriately sized image layers (this fails before fixing keyspace partitioning)

PR #7454 included a workaround that let any existing bugged databases start up. Having used that already, we may now Closes: #7480

Not sure if this should actually be a link pointing to the `persistence.rs` file but following the conventions of the rest of the file, change `persistence.rs` reference to simply be a file name mention.

extracted (and tested) from #7468, part of #7462. The current codebase assumes the keyspace is dense -- which means that if we have a keyspace of 0x00-0x100, we assume every key (e.g., 0x00, 0x01, 0x02, ...) exists in the storage engine. However, the assumption does not hold any more in metadata keyspace. The metadata keyspace is sparse. It is impossible to do per-key check. Ideally, we should not have the assumption of dense keyspace at all, but this would incur a lot of refactors. Therefore, we split the keyspaces we have to dense/sparse and handle them differently in the code for now. At some point in the future, we should assume all keyspaces are sparse. ## Summary of changes * Split collect_keyspace to return dense+sparse keyspace. * Do not allow generating image layers for sparse keyspace (for now -- will fix this next week, we need image layers anyways). * Generate delta layers for sparse keyspace. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem `init_tenant_mgr` blocks the rest of pageserver startup, including starting the admin API. This was noticeable in #7475 , where the init_tenant_mgr runtime could be long enough to trip the controller's 30 second heartbeat timeout. ## Summary of changes - When detaching tenants during startup, spawn the background deletes as background tasks instead of doing them inline - Write all configs before spawning any tenants, so that the config writes aren't fighting tenants for system resources - Write configs with some concurrency (16) rather than writing them all sequentially.

… compaction (#7551) Makes two of the tests work with the tiered compaction that I had to ignore in #7283. The issue was that tiered compaction actually created image layers, but the keys didn't appear in them as `collect_keyspace` didn't include them. Not a compaction problem, but due to how the test is structured. Fixes #7287

It works by listing postgres table with memory dump of safekeepers state. s3 contents for each timeline are checked then against timeline_start_lsn and backup_lsn. If inconsistency is found, before complaining timeline (branch) is checked at control plane; it might have been deleted between the dump take and s3 check.

- pageserver_id in project details is now is optional, fix it - add active_timeline_count guard/stat similar to active_tenant_count - fix safekeeper prefix - count and log deleted keys

Last run with 128 created too much load on cplane.

As it turns out we have at least one case of the same timeline_id in different projects.

## Problem Storage controller was observed to have unexpectedly large memory consumption when loaded with many thousands of shards. This was recently fixed: - #7493 ...but we need a general test that the controller is well behaved with thousands of shards. Closes: #7460 Closes: #7463 ## Summary of changes - Add test test_storage_controller_many_tenants to exercise the system's behaviour with a more substantial workload. This test measures memory consumption and reproduces #7460 before the other changes in this PR. - Tweak reconcile_all's return value to make it nonzero if it spawns no reconcilers, but _would_ have spawned some reconcilers if they weren't blocked by the reconcile concurrency limit. This makes the test's reconcile_until_idle behave as expected (i.e. not complete until the system is nice and calm). - Fix an issue where tenant migrations would leave a spurious secondary location when migrated to some location that was not already their secondary (this was an existing low-impact bug that tripped up the test's consistency checks). On the test with 8000 shards, the resident memory per shard is about 20KiB. This is not really per-shard memory: the primary source of memory growth is the number of concurrent network/db clients we create. With 8000 shards, the test takes 125s to run on my workstation.

## Problem This test became flaky recently with failures like: ``` AssertionError: Log errors on storage_controller: (129, '2024-04-29T16:41:03.591506Z ERROR request{method=PUT path=/control/v1/tenant/b38c0447fbdbcf4e1c023f00b0f7c221/shard_split request_id=34df4975-2ef3-4ed8-b167-2956650e365c}: Error processing HTTP request: InternalServerError(Reconcile error on shard b38c0447fbdbcf4e1c023f00b0f7c221-0002: Cancelled\n') ``` Likely due to #7508 changing how errors are reported from Reconcilers. ## Summary of changes - Tolerate `Reconcile error.*Cancelled` log errors

## Problem The current Makefile assumes that homebrew is used on macos. There are other ways to install dependencies on MacOS (nix, macports, "manually"). It would be great to allow the one who wants to use other options to disable homebrew integration. ## Summary of changes It adds DISABLE_HOMEBREW variable that if set skips extra homebrew-specific configuration steps.

We had an incident where pageserver requests timed out because pageserver couldn't fetch WAL from safekeepers. This incident was caused by a bug in safekeeper logic for timeline activation, which prevented pageserver from finding safekeepers. This bug was since fixed, but there is still a chance of a similar bug in the future due to overall complexity. We add a new broker message to "signal interest" for timeline. This signal will be sent by pageservers `wait_lsn`, and safekeepers will receive this signal to start broadcasting broker messages. Then every broker subscriber will be able to find the safekeepers and connect to them (to start fetching WAL). This feature is not limited to pageservers and any service that wants to download WAL from safekeepers will be able to use this discovery request. This commit changes pageserver's connection_manager (walreceiver) to send a SafekeeperDiscoveryRequest when there is no information about safekeepers present in memory. Current implementation will send these requests only if there is an active wait_lsn() call and no more often than once per 10 seconds. Add `test_broker_discovery` to test this: safekeepers started with `--disable-periodic-broker-push` will not push info to broker so that pageserver must use a discovery to start fetching WAL. Add task_stats in safekeepers broker module to log a warning if there is no message received from the broker for the last 10 seconds. Closes #5471 --------- Co-authored-by: Christian Schwarz <christian@neon.tech>

Instead of showing the full path of layer traversal, we now only show tenant (in tracing context)+timeline+filename. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

close #7391 ## Summary of changes Categorize basebackup error into two types: server error and client error. This makes it easier to set up alerts. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

…tress (#7281)

…migration (#7583) ## Problem The logic in Service::optimize_all would sometimes choose to migrate a tenant to a secondary location that was only recently created, resulting in Reconciler::live_migrate hitting its 5 minute timeout warming up the location, and proceeding to attach a tenant to a location that doesn't have a warm enough local set of layer files for good performance. Closes: #7532 ## Summary of changes - Add a pageserver API for checking download progress of a secondary location - During `optimize_all`, connect to pageservers of candidate optimization secondary locations, and check they are warm. - During shard split, do heatmap uploads and start secondary downloads, so that the new shards' secondary locations start downloading ASAP, rather than waiting minutes for background downloads to kick in. I have intentionally not implemented this by continuously reading the status of locations, to avoid dealing with the scale challenge of efficiently polling & updating 10k-100k locations status. If we implement that in the future, then this code can be simplified to act based on latest state of a location rather than fetching it inline during optimize_all.

This pull request adds the scan interface. Scan operates on a sparse keyspace and retrieves all the key-value pairs from the keyspaces. Currently, scan only supports the metadata keyspace, and by default do not retrieve anything from the ancestor branch. This should be fixed in the future if we need to have some keyspaces that inherits from the parent. The scan interface reuses the vectored get code path by disabling the missing key errors. This pull request also changes the behavior of vectored get on aux file v1/v2 key/keyspace: if the key is not found, it is simply not included in the result, instead of throwing a missing key error. TODOs in future pull requests: limit memory consumption, ensure the search stops when all keys are covered by the image layer, remove `#[allow(dead_code)]` once the code path is used in basebackups / aux files, remove unnecessary fine-grained keyspace tracking in vectored get (or have another code path for scan) to improve performance. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

## Problem Too many connect_compute attempts can overwhelm postgres, getting the connections stuck. ## Summary of changes Limit number of connection attempts that can happen at a given time.

introduced by #7468 conflicting with #7584 Signed-off-by: Alex Chi Z <chi@neon.tech>

…local` (#7606) This is the first step towards representing all of Pageserver configuration as clean `serde::Serialize`able Rust structs in `pageserver_api`. The `neon_local` code will then use those structs instead of the crude `toml_edit` / string concatenation that it does today. refs #7555 --------- Co-authored-by: Alex Chi Z <iskyzh@gmail.com>

Part of neondatabase/cloud#12047. Resolves #7239. In short, this PR: 1. Adds `ComputeSpec.swap_size_bytes: Option<u64>` 2. Adds a flag to compute_ctl: `--resize-swap-on-bind` 3. Implements running `/neonvm/bin/resize-swap` with the value from the compute spec before starting postgres, if both the value in the spec *AND* the flag are specified. 4. Adds `sudo` to the final image 5. Adds a file in `/etc/sudoers.d` to allow `compute_ctl` to resize swap Various bits of reasoning about design decisions in the added comments. In short: We have both a compute spec field and a flag to make rollout easier to implement. The flag will most likely be removed as part of cleanups for neondatabase/cloud#12047.

- On a non-pooled start, do not reset the 'start_time' after launching the HTTP service. In a non-pooled start, it's fair to include that in the total startup time. - When setting wait_for_spec_ms and resetting start_time, call Utc::now() only once. It's a waste of cycles to call it twice, but also, it ensures the time between setting wait_for_spec_ms and resetting start_time is included in one or the other time period. These differences should be insignificant in practice, in the microsecond range, but IMHO it seems more logical and readable this way too. Also fix and clarify some of the surrounding comments. (This caught my eye while reviewing PR #7577)

The top reason for it being flaky.

Previously its segment header and page header of first record weren't initialized because compute streams data only since first record LSN. Also, fix a bug in the existing code for initialization: xlp_rem_len must not include page header. These changes make first segment pg_waldump'able.

To test it as well.

github-actions · 2024-05-06T06:41:25Z

2880 tests run: 2758 passed, 1 failed, 121 skipped (full report)

Failures on Postgres 14

test_delete_timeline_exercise_crash_safety_failpoints[Check.RETRY_WITHOUT_RESTART-timeline-delete-after-index-delete]: release

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_delete_timeline_exercise_crash_safety_failpoints[release-pg14-Check.RETRY_WITHOUT_RESTART-timeline-delete-after-index-delete]"

Flaky tests (3)

Postgres 15

test_partial_evict_tenant[relative_equal]: release
test_partial_evict_tenant[relative_spare]: release
test_lock_time_tracing: release

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
0353a72 at 2024-05-06T06:41:24.526Z :recycle:}

problame

PS and SK are +1

danieltprice · 2024-05-09T08:59:24Z

Reviewed for changelog.

jcsp and others added 30 commits April 29, 2024 08:52

proxy: Exclude compute and retries (#7529)

24ce878

## Problem Alerts fire if the connection the compute is slow. ## Summary of changes Exclude compute and retry from latencies.

proxy: Adjust retry wake compute (#7537)

90cadfa

## Problem Right now we always do retry wake compute. ## Summary of changes Create a list of errors when we could avoid needless retries.

pagserver: use vectored read path in benchmarks (#7498)

1f417af

## Problem Benchmarks don't use the vectored read path. ## Summary of changes * Update the benchmarks to use the vectored read path for both singular and vectored gets. * Disable validation for the benchmarks

chore(vm-image): specify sql exporter listen port (#7526)

89cae64

Extracted from #7514, 9399 is the default port. We want to specify it b/c we will start a second sql exporter for autoscaling agent soon. Signed-off-by: Alex Chi Z <chi@neon.tech>

Update azure_* crates to 0.19 (#7539)

cddafc7

Updates the four azure SDK crates used by remote_storage to 0.19.

pageserver: remove workarounds from #7454 (#7550)

577982b

PR #7454 included a workaround that let any existing bugged databases start up. Having used that already, we may now Closes: #7480

docs: fix unintentional file link (#7506)

84b6b95

Not sure if this should actually be a link pointing to the `persistence.rs` file but following the conventions of the rest of the file, change `persistence.rs` reference to simply be a file name mention.

s3_scrubber: revive garbage collection for safekeepers.

ea37234

- pageserver_id in project details is now is optional, fix it - add active_timeline_count guard/stat similar to active_tenant_count - fix safekeeper prefix - count and log deleted keys

Decrease CONSOLE_CONCURRENCY.

7434674

Last run with 128 created too much load on cplane.

Recheck tenant_id in find_timeline_branch.

9f792f9

As it turns out we have at least one case of the same timeline_id in different projects.

Add retries to cloud_admin client.

4ac4b21

Add more context to s3 listing error.

3a2f107

chore(pageserver): concise error message for layer traversal (#7565)

26e6ff8

Instead of showing the full path of layer traversal, we now only show tenant (in tracing context)+timeline+filename. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

chore(pageserver): categorize basebackup errors (#7523)

5558457

close #7391 ## Summary of changes Categorize basebackup error into two types: server error and client error. This makes it easier to set up alerts. --------- Signed-off-by: Alex Chi Z <chi@neon.tech>

Add retry loops and bump test timeout in test_pageserver_connection_s…

d43d773

…tress (#7281)

jcsp and others added 10 commits May 3, 2024 14:28

proxy: add connect compute concurrency lock (#7607)

9b65946

## Problem Too many connect_compute attempts can overwhelm postgres, getting the connections stuck. ## Summary of changes Limit number of connection attempts that can happen at a given time.

fix(pageserver): remove update_gc_info calls in tests (#7608)

ef03b38

introduced by #7468 conflicting with #7584 Signed-off-by: Alex Chi Z <chi@neon.tech>

Allow bad state (not active) pageserver error/warns in walcraft test.

5da3e21

The top reason for it being flaky.

pg_waldump segment on safekeeper in test_pg_waldump.

0353a72

To test it as well.

vipvap requested review from a team as code owners May 6, 2024 06:04

vipvap requested review from tristan957, arssher, conradludgate and mtyazici and removed request for a team May 6, 2024 06:04

problame approved these changes May 6, 2024

View reviewed changes

tristan957 approved these changes May 6, 2024

View reviewed changes

arpad-m approved these changes May 6, 2024

View reviewed changes

problame merged commit c4d7d59 into release May 7, 2024
103 of 107 checks passed

problame deleted the rc/2024-05-06 branch May 7, 2024 07:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release 2024-05-06 #7615

Release 2024-05-06 #7615

vipvap commented May 6, 2024

github-actions bot commented May 6, 2024

Postgres 15

problame left a comment

danieltprice commented May 9, 2024

Release 2024-05-06 #7615

Release 2024-05-06 #7615

Conversation

vipvap commented May 6, 2024