Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release 2024-04-29 #7530

Merged
merged 48 commits into from
Apr 29, 2024
Merged

Release 2024-04-29 #7530

merged 48 commits into from
Apr 29, 2024

Conversation

vipvap
Copy link

@vipvap vipvap commented Apr 29, 2024

Release 2024-04-29

Please merge this Pull Request using 'Create a merge commit' button

hlinnaka and others added 30 commits April 22, 2024 10:58
As noted in the comment, the craft_internal() function fails if the
inserted WAL happens to land at page boundary. I bumped into that with
PR #7377; it changed the arguments of a few SQL functions in
neon_test_utils extension, which changed the WAL positions slightly, and
caused a test failure.
There were two issues with the test at page boundaries:

1. If the first logical message with 10 bytes payload crossed a page
boundary, the calculated 'base_size' was too large because it included
the page header.

2. If it was inserted near the end of a page so that there was not
enough room for another one, we did "remaining_lsn += XLOG_BLCKSZ" but
that didn't take into account the page headers either.

As a result, the test would fail if the WAL insert position at the
beginning of the test was too close to the end of a WAL page. Fix the
calculations by repeating the 10-byte logical message if the starting
position is not suitable.

I bumped into this with PR #7377; it changed the arguments of a few SQL
functions in neon_test_utils extension, which changed the WAL positions
slightly, and caused a test failure.


This is similar to #7436, but
for different test.
Currently, any `Timeline::schedule_uploads` will generate a fresh
`TimelineMetadata` instead of updating the values, which it means to
update. This makes it impossible for #6994 to work while `Timeline`
receives layer flushes by overwriting any configured new
`ancestor_timeline_id` and possible `ancestor_lsn`.

The solution is to only make full `TimelineMetadata` "updates" from one
place: branching. At runtime, update only the three fields, same as
before in `Timeline::schedule_updates`.
Before, we asserted that a layer would only be loaded by the timeline
that initially created it. Now, with the ancestor detach, we will want
to utilize remote copy as much as possible, so we will need to open
other timeline layers as our own.

Cc: #6994
## Problem

Currently we cannot configure retries, also, we don't really have
visibility of what's going on there.

## Summary of changes

* Added cli params
* Improved logging
* Decrease the number of retries: it feels like most of retries doesn't
help. Once there would be better errors handling, we can increase it
back.
## Problem

- #7451 

INIT_FORKNUM blocks must be stored on shard 0 to enable including them
in basebackup.

This issue can be missed in simple tests because creating an unlogged
table isn't sufficient -- to repro I had to create an _index_ on an
unlogged table (then restart the endpoint).

Closes: #7451 

## Summary of changes

- Add a reproducer for the issue.
- Tweak the condition for `key_is_shard0` to include anything that isn't
a normal relation block _and_ any normal relation block whose forknum is
INIT_FORKNUM.
- To enable existing databases to recover from the issue, add a special
case that omits relations if they were stored on the wrong INITFORK.
This enables postgres to start and the user to drop the table and
recreate it.
…7455)

fixes #7452

Also, drive-by improve the usage instructions with commands I found
useful during that incident.

The patch in the fork of `svg_fmt` is [being
upstreamed](nical/rust_debug#4), but, in the
meantime,
let's commit what we have because it was useful during the incident.
As part of #7375 and to improve
the current vectored get implementation, we separate the missing key
error out. This also saves us several Box allocations in the get page
implementation.

## Summary of changes

* Create a caching field of layer traversal id for each of the layer.
* Remove box allocations for layer traversal id retrieval and implement
MissingKey error message as before. This should be a little bit faster.
* Do not format error message until `Display`.
* For in-mem layer, the descriptor is different before/after frozen. I'm
using once lock for that.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Instead of trusting that a request with latest == true means that the
request LSN was at least last_record_lsn, remember explicitly when the
relation cache was initialized.

Incidentally, this allows updating the relation size cache also on reads
from read-only endpoints, when the endpoint is at a relatively recent
LSN (more recent than the end of the timeline when the timeline was
loaded in the pageserver).

Add a comment to wait_or_get_last_lsn() that it might be better to use
an older LSN when possible. Note that doing that would be unsafe,
without the relation cache changes in this commit!
…format changes (#7458)

## Problem
The `export_import_between_pageservers` script us to do major storage format changes
in the past. If we have to do such breaking changes in the future this approach
wouldn't be suitable because:
1. It doesn't scale to the current size of the fleet
2. It loses history

## Summary of changes
Remove the script and its associated test.
Keep `fullbasebackup` and friends because it's useful for debugging.

Closes neondatabase/cloud#11648
…ler (#7476)

Tested this locally via a simple patch, the `tenant_id` is now gone from
the json.

Follow-up of #7055, prerequisite for #7469.
## Problem
Vectored get would descend into ancestor timelines for aux files.
This is not the behaviour of the legacy read path and blocks cutting
over to the vectored read path.

Fixes #7379

## Summary of Changes
Treat non inherited keys specially in vectored get. At the point when
we want to descend into the ancestor mark all pending non inherited keys
as errored out at the key level. Note that this diverges from the
standard vectored get behaviour for missing keys which is a top level
error. This divergence is required to avoid blocking compaction in case
such an error is encountered when compaction aux files keys. I'm pretty
sure the bug I just described predates the vectored get implementation,
but it's still worth fixing.
## Problem
We recently went through an incident where compaction was inhibited by a
bug. We didn't observe this until quite late because we did not have alerting
on deep reads.

## Summary of changes
+ Tweak an existing metric that tracks the depth of a read on the
non-vectored read path:
  * Give it a better name
  * Track all layers
  * Larger buckets
+ Add a similar metric for the vectored read path
+ Add a compaction smoke test which uses these metrics. This test would
have caught
the compaction issue mentioned earlier.

Related #7428
## Problem

In some dev/test environments, there aren't health checks to guarantee
the database is available before starting the controller. This creates
friction for the developer.

## Summary of changes

- Wait up to 5 seconds for the database to become available on startup
Extracted from #7375. We assume
everything >= 0x80 are metadata keys. AUX file keys are part of the
metadata keys, and we use `0x90` as the prefix for AUX file keys.

The AUX file encoding is described in the code comment. We use xxhash128
as the hash algorithm. It seems to be portable according to the
introduction,

> xxHash is an Extremely fast Hash algorithm, processing at RAM speed
limits. Code is highly portable, and produces hashes identical across
all platforms (little / big endian).

...though whether the Rust version follows the same convention is
unknown and might need manual review of the library. Anyways, we can
always change the hash algorithm before rolling it out in
staging/end-user, and I made a quick decision to use xxhash here because
it generates 128b hash + portable. We can save the discussion of which
hash algorithm to use later.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem

Split off from #7399, which is
the first piece of code that does a WithDelimiter object listing using a
prefix that isn't a full directory name.

## Summary of changes

- Revise list function to not append a `/` to the prefix -- prefixes
don't have to end with a slash.
- Fix local_fs implementation of list to not assume that WithDelimiter
case will always use a directory as a prerfix.
- Remove `list_files`, `list_prefixes` wrappers, as they add little
value and obscure the underlying list function -- we need callers to
understand the semantics of what they're really calling (listobjectsv2)
## Problem

We already made a change in #6407 to make pitr_interval authoritative
for synthetic size calculations (do not charge users for data retained
due to gc_horizon), but that change didn't cover the case where someone
entirely disables time-based retention by setting pitr_interval=0

Relates to: #6374

## Summary of changes

When pitr_interval is zero, do not set `pitr_cutoff` based on
gc_horizon.

gc_horizon is still enforced, but separately (its value is passed
separately, there was never a need to claim pitr_cutoff to gc_horizon)

## More detail

### Issue 1
Before this PR, we would skip the update_gc_info for timelines with
last_record_lsn() < gc_horizon.
Let's call such timelines "tiny".

The rationale for that presumably was that we can't GC anything in the
tiny timelines, why bother to call update_gc_info().

However, synthetic size calculation relies on up-to-date
update_gc_info() data.

Before this PR, tiny timelines would never get an updated
GcInfo::pitr_horizon (it remained Lsn(0)).
Even on projects with pitr_interval=0d.

With this PR, update_gc_info is always called, hence
GcInfo::pitr_horizon is always updated, thereby
providing synthetic size calculation with up-to-data data.

### Issue 2
Before this PR, regardless of whether the timeline is "tiny" or not,
GcInfo::pitr_horizon was clamped to at least last_record_lsn -
gc_horizon, even if the pitr window in terms of LSN range was shorter
(=less than) the gc_horizon.

With this PR, that clamping is removed, so, for pitr_interval=0, the
pitr_horizon = last_record_lsn.
There was an edge case where
`get_lsn_by_timestamp`/`find_lsn_for_timestamp` could have returned an
lsn that is before the limits we enforce: when we did find SLRU entries
with timestamps before the one we search for.

The API contract of `get_lsn_by_timestamp` is to not return something
before the anchestor lsn.

cc https://neondb.slack.com/archives/C03F5SM1N02/p1713871064147029
As seen with a recent incident, eviction tasks can cause pageserver-wide
permit starvation on the background task semaphore when synthetic size
calculation takes a long time for a tenant that has more than our permit
number of timelines or multiple tenants that have slow synthetic size
and total number of timelines exceeds the permits. Metric links can be
found in the internal [slack thread].

As a solution, release the permit while waiting for the state guarding
the synthetic size calculation. This will most likely hurt the eviction
task eviction performance, but that does not matter because we are
hoping to get away from it using OnlyImitiate policy anyway and rely
solely on disk usage-based eviction.

[slack thread]:
https://neondb.slack.com/archives/C06UEMLK7FE/p1713810505587809?thread_ts=1713468604.508969&cid=C06UEMLK7FE
## Problem

## Summary of changes

By default, it's 5s retry.
## Problem
Vectored and non-vectored read paths don't publish the same set of
metrics. Metrics parity is needed for coalescing the read paths.

## Summary of changes
* Publish reconstruct time and fetching data for reconstruct time from
the vectored read path
* Remove pageserver_getpage_reconstruct_seconds{res="err"} - wasn't used
anyway
## Problem
If the previous step of the vectored left no further keyspace to
investigate (i.e. keyspace remains empty after removing keys completed in the previous step),
then we'd still grab the layers lock, potentially add an in-mem layer to the fringe
and at some further point read its index without reading any values from it.

## Summary of changes
If there's nothing left in the current keyspace, then skip the search
and just select the next item from the fringe as usual.

When running `test_pg_regress[release-pg16]` with the vectored read path
for singular gets this improved perf drastically (see PR cover letter).

## Correctness
Since no keys remained from the previous range (i.e. we are on a leaf
node) there's nothing that search can find in deeper nodes.
macOS max_rss is in bytes, while Linux is in kilobytes.
https://stackoverflow.com/a/59915669

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
#7461)

Currently we move data to the intended storage class via lifecycle
rules, but those are a daily batch job so data first spends up to a day
in standard storage.

Therefore, make it possible to specify the storage class used for
uploads to S3 so that the data doesn't have to be migrated
automatically.

The advantage of this is that it gives cleaner billing reports.

Part of neondatabase/cloud#11348
## Problem

Storage controller memory can spike very high if we have many tenants
and they all try to reconcile at the same time.

Related:
- #7463
- #7460

Not closing those issues in this PR, because the test coverage for them
will be in #7475

## Summary of changes

- Add a CLI arg `--reconciler-concurrency`, defaulted to 128
- Add a semaphore to Service with this many units
- In `maybe_reconcile_shard`, try to acquire semaphore unit. If we can't
get one, return a ReconcileWaiter for a future sequence number, and push
the TenantShardId onto a channel of delayed IDs.
- In `process_result`, consume from the channel of delayed IDs if there
are semaphore units available and call maybe_reconcile_shard again for
these delayed shards.

This has been tested in #7475,
but will land that PR separately because it contains other changes &
needs the test stabilizing. This change is worth merging sooner, because
it fixes a practical issue with larger shard counts.
## Problem

The `WithClientIp` AsyncRead/Write abstraction never filled me with much
joy. I would just rather read the protocol header once and then get the
remaining buf and reader.

## Summary of changes

* Replace `WithClientIp::wait_for_addr` with `read_proxy_protocol`.
* Replace `WithClientIp` with `ChainRW`.
* Optimise `ChainRW` to make the standard path more optimal.
## Problem

## Summary of changes

Decrease waiting time
## Problem

Cancellations were published to the channel, that was never read.

## Summary of changes

Fallback to global redis publishing.
## Problem
We are currently supporting two read paths. No bueno.

## Summary of changes
High level: use vectored read path to serve get page requests - gated by
`get_impl` config
Low level:
1. Add ps config, `get_impl` to specify which read path to use when
serving get page requests
2. Fix base cached image handling for the vectored read path. This was
subtly broken: previously we
would not mark keys that went past their cached lsn as complete. This is
a self standing change which
could be its own PR, but I've included it here because writing separate
tests for it is tricky.
3. Fork get page to use either the legacy or vectored implementation 
4. Validate the use of vectored read path when serving get page requests
against the legacy implementation.
Controlled by `validate_vectored_get` ps config.
5. Use the vectored read path to serve get page requests in tests (with
validation).

## Note
Since the vectored read path does not go through the page cache to read
buffers, this change also amounts to a removal of the buffer page cache. Materialized page cache
is still used.
## Problem

Start switching from the global redis to the regional one

## Summary of changes

* Publish cancellations to the regional redis
* Listen notifications from both: global and regional
hlinnaka and others added 14 commits April 25, 2024 20:45
Before PR #7377, on-demand SLRU download always used the basebackup's
LSN in the SLRU download, but that LSN might get garbage-collected away
in the pageserver. We should request the latest LSN, like with GetPage
requests, with the LSN just indicating that we know that the page hasn't
been changed since the LSN (since the basebackup in this case).

Add test to demonstrate the problem. Without the fix, it fails with
"tried to request a page version that was garbage collected" error from
the pageserver.

I wrote this test as part of earlier PR #6693, but that fell through
the cracks and was never applied. PR #7377 superseded the fix from
that older PR, but the test is still valid.
These are testability/logging improvements spun off from #7475

- Don't log warnings for shutdown errors in compute hook
- Revise logging around heartbeats and reconcile_all so that we aren't
emitting such a large volume of INFO messages under normal quite
conditions.
- Clean up the `last_error` of TenantShard to hold a ReconcileError
instead of a String, and use that properly typed error to suppress
reconciler cancel errors during reconcile_all_now. This is important for
tests that iteratively call that, as otherwise they would get 500 errors
when some reconciler in flight was cancelled (perhaps due to a state
change on the tenant shard starting a new reconciler).
part of #7124

Changes
-------

This PR replaces the `EphemeralFile::write_blob`-specifc `struct Writer`
with re-use of `owned_buffers_io::write::BufferedWriter`.

Further, it restructures the code to cleanly separate

* the high-level aspect of EphemeralFile's write_blob / read_blk API
* the page-caching aspect
* the aspect of IO
  * performing buffered write IO to an underlying VirtualFile
* serving reads from either the VirtualFile or the buffer if it hasn't
been flushed yet
* the annoying "feature" that reads past the end of the written range
are allowed and expected to return zeroed memory, as long as one remains
within one PAGE_SZ
Linter error from a merge collision
part of #7124

# Problem

(Re-stating the problem from #7124 for posterity)

The `test_bulk_ingest` benchmark shows about 2x lower throughput with
`tokio-epoll-uring` compared to `std-fs`.
That's why we temporarily disabled it in #7238.

The reason for this regression is that the benchmark runs on a system
without memory pressure and thus std-fs writes don't block on disk IO
but only copy the data into the kernel page cache.
`tokio-epoll-uring` cannot beat that at this time, and possibly never.
(However, under memory pressure, std-fs would stall the executor thread
on kernel page cache writeback disk IO. That's why we want to use
`tokio-epoll-uring`. And we likely want to use O_DIRECT in the future,
at which point std-fs becomes an absolute show-stopper.)

More elaborate analysis:
https://neondatabase.notion.site/Why-test_bulk_ingest-is-slower-with-tokio-epoll-uring-918c5e619df045a7bd7b5f806cfbd53f?pvs=4

# Changes

This PR increases the buffer size of `blob_io` and `EphemeralFile` from
PAGE_SZ=8k to 64k.

Longer-term, we probably want to do double-buffering / pipelined IO.

# Resource Usage

We currently do not flush the buffer when freezing the InMemoryLayer.
That means a single Timeline can have multiple 64k buffers alive, esp if
flushing is slow.
This poses an OOM risk.

We should either bound the number of frozen layers
(#7317).

Or we should change the freezing code to flush the buffer and drop the
allocation.

However, that's future work.

# Performance

(Measurements done on i3en.3xlarge.)

The `test_bulk_insert.py` is too noisy, even with instance storage. It
varies by 30-40%. I suspect that's due to compaction. Raising amount of
data by 10x doesn't help with the noisiness.)

So, I used the `bench_ingest` from @jcsp 's #7409  .
Specifically, the `ingest-small-values/ingest 128MB/100b seq` and
`ingest-small-values/ingest 128MB/100b seq, no delta` benchmarks.

|     |                   | seq | seq, no delta |
|-----|-------------------|-----|---------------|
| 8k  | std-fs            | 55  | 165           |
| 8k  | tokio-epoll-uring | 37  | 107           |
| 64k | std-fs            | 55  | 180           |
| 64k | tokio-epoll-uring | 48  | 164           |

The `8k` is from before this PR, the `64k` is with this PR.
The values are the throughput reported by the benchmark (MiB/s).

We see that this PR gets `tokio-epoll-uring` from 67% to 87% of `std-fs`
performance in the `seq` benchmark. Notably, `seq` appears to hit some
other bottleneck at `55 MiB/s`. CC'ing #7418 due to the apparent
bottlenecks in writing delta layers.

For `seq, no delta`, this PR gets `tokio-epoll-uring` from 64% to 91% of
`std-fs` performance.
…7420)

## Problem
PR #7230 attempted to introduce a WAL ingest threshold for checking
whether enough deltas are stacked to warrant creating a new image layer.
However, this check was incorrectly performed at the compaction
partition level instead of the timeline level. Hence, it inhibited GC
for any keys outside of the first partition.

## Summary of Changes
Hoist the check up to the timeline level.
…7520)

Implements an approach different from the one #7488 chose: We now return
`past` instead of `present` (or`future`) when encountering the edge case
where commit_lsn < min_lsn. In my opinion, both `past` and `present` are
correct responses, but past is slightly better as the lsn returned by
`present` with #7488 is one too "new". In practice, this shouldn't
matter much, but shrug.

We agreed in slack that this is the better approach:
https://neondb.slack.com/archives/C03F5SM1N02/p1713871064147029
Changing metadata format is not easy. This pull request adds a
tenant-level flag on whether to enable aux file v2. As long as we don't
roll this out to the user and guarantee our staging projects can persist
tenant config correctly, we can test the aux file v2 change with setting
this flag. Previous discussion at
#7424.

Signed-off-by: Alex Chi Z <chi@neon.tech>
extracted from #7468, part of
#7462.

In the page server, we use i128 (instead of u128) to do the integer
representation of the key, which indicates that the highest bit of the
key should not be 1. This constraints our keyspace to <= 0x7F.

Also fix the bug of `to_i128` that dropped the highest 4b. Now we keep
3b of them, dropping the sign bit.

And on that, we shrink the metadata keyspace to 0x60-0x7F for now, and
once we add support for u128, we can have a larger metadata keyspace.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
starting at commit
dbb0c96,
macOS reports warning for a few functions in the virtual file module.

Signed-off-by: Alex Chi Z <chi@neon.tech>
We know that updating gc info can take a very long time from [recent
incident], and holding `Tenant::gc_cs` affects many per-tenant
operations in the system. We need a direct way to observe the time it
takes. The solution is to add metrics so that we know when this happens:
- 2 new per-timeline metric
- 1 new global histogram

Verified that the buckets are okay-ish in [dashboard]. In our current
state, we will see a lot more of `Inf,` but that is probably okay; at
least we can learn which timelines are having issues.

Can we afford to add these metrics? A bit unclear, see [another
dashboard] with top pageserver `/metrics` response sizes.

[dashboard]:
https://neonprod.grafana.net/d/b7a5a5e2-1276-4bb0-9e3a-b4528adb6eb6/storage-operations-histograms-in-prod?orgId=1&var-datasource=ZNX49CDVz&var-instance=All&var-operation=All&from=now-7d&to=now

[another dashboard]:
https://neonprod.grafana.net/d/MQx4SN-Vk/metric-sizes-on-prod-and-some-correlations?orgId=1

[recent incident]:
https://neondb.slack.com/archives/C06UEMLK7FE/p1713817696580119?thread_ts=1713468604.508969&cid=C06UEMLK7FE
@vipvap vipvap requested review from a team as code owners April 29, 2024 06:04
@vipvap vipvap requested review from petuhovskiy, khanova and hlinnaka and removed request for a team April 29, 2024 06:04
Copy link

2796 tests run: 2675 passed, 0 failed, 121 skipped (full report)


Flaky tests (1)

Postgres 15

  • test_vm_bit_clear_on_heap_lock: debug

Code coverage* (full report)

  • functions: 28.3% (6548 of 23146 functions)
  • lines: 47.0% (46238 of 98480 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
3695a1e at 2024-04-29T06:47:49.437Z :recycle:

@arpad-m arpad-m merged commit 0de1e1d into release Apr 29, 2024
107 checks passed
@arpad-m arpad-m deleted the rc/2024-04-29 branch April 29, 2024 13:09
@danieltprice
Copy link
Contributor

Reviewed for changelog.

1 similar comment
@danieltprice
Copy link
Contributor

Reviewed for changelog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet