Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute <-> pageserver protocol version 2 #7377

Merged
merged 5 commits into from
Apr 25, 2024
Merged

Conversation

hlinnaka
Copy link
Contributor

@hlinnaka hlinnaka commented Apr 14, 2024

This is the same functionality as #6718, with a bunch of cleanups and refactoring based on comments there. Main changes:

  • The new protocol version is now indicated by using pagestream_v2 command when the pagestream is started. The message tags of individual request are unchanged and are the same in both versions (per Send LSN range in getpage request #6718 (comment))
  • The requests now carry two LSNs: request_lsn and not_modified_since. This is more clear than the concept of a horizon from that previous PR. There are no magic values, a request must provide a valid LSN for both. Lsn::MAX is valid for request_lsn as long as `not_modified_since is a valid value, however. That can be used to fetch the latest version of the page. It's used by some tests, but not by the actual compute. The actual compute always passes a valid LSN for both.
  • This PR builds on top of Refactor updating relation size cache on reads #7376. It was awkward to push down both LSNs to all the functions in pgdatadir_mapping.rs, and the old logic for when the relsize cache could be updated depended on the 'latest' flag, so it was easier to refactor it away first.

Copy link

github-actions bot commented Apr 14, 2024

2820 tests run: 2699 passed, 0 failed, 121 skipped (full report)


Flaky tests (4)

Postgres 16

  • test_compute_pageserver_connection_stress: release
  • test_vm_bit_clear_on_heap_lock: debug

Postgres 15

  • test_compute_pageserver_connection_stress: debug
  • test_partial_evict_tenant[relative_equal]: release

Code coverage* (full report)

  • functions: 28.1% (6482 of 23061 functions)
  • lines: 46.9% (46065 of 98202 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
36c2a88 at 2024-04-25T17:53:08.701Z :recycle:

@knizhnik knizhnik mentioned this pull request Apr 16, 2024
5 tasks
@knizhnik knizhnik self-requested a review April 18, 2024 07:27
@hlinnaka hlinnaka changed the title Refactor how the request LSNs are tracked in compute Compute <-> pageserver protocol version 2 Apr 18, 2024
@hlinnaka hlinnaka changed the base branch from main to rel_size_cache-update April 18, 2024 21:39
@hlinnaka hlinnaka marked this pull request as ready for review April 18, 2024 21:40
@hlinnaka hlinnaka requested review from a team as code owners April 18, 2024 21:40
@hlinnaka hlinnaka requested review from arssher and koivunej and removed request for a team April 18, 2024 21:40
@hlinnaka
Copy link
Contributor Author

TODOs:

@hlinnaka hlinnaka force-pushed the getpage_lsn_range-client branch 4 times, most recently from 8afdacd to f6c9fa8 Compare April 19, 2024 07:52
hlinnaka added a commit that referenced this pull request Apr 19, 2024
As noted in the comment, the craft_internal() function fails if the
inserted WAL happens to land at page boundary. I bumped into that with
PR #7377; it changed the arguments of a few SQL functions in
neon_test_utils extension, which changed the WAL positions slightly,
and caused a test failure.
hlinnaka added a commit that referenced this pull request Apr 19, 2024
As noted in the comment, the craft_internal() function fails if the
inserted WAL happens to land at page boundary. I bumped into that with
PR #7377; it changed the arguments of a few SQL functions in
neon_test_utils extension, which changed the WAL positions slightly,
and caused a test failure.
hlinnaka added a commit that referenced this pull request Apr 19, 2024
As noted in the comment, the craft_internal() function fails if the
inserted WAL happens to land at page boundary. I bumped into that with
PR #7377; it changed the arguments of a few SQL functions in
neon_test_utils extension, which changed the WAL positions slightly,
and caused a test failure.
hlinnaka added a commit that referenced this pull request Apr 19, 2024
As noted in the comment, the craft_internal() function fails if the
inserted WAL happens to land at page boundary. I bumped into that with
PR #7377; it changed the arguments of a few SQL functions in
neon_test_utils extension, which changed the WAL positions slightly,
and caused a test failure.
@hlinnaka
Copy link
Contributor Author

Added one more test, to demonstrate the original error: 34f75f5.

@hlinnaka
Copy link
Contributor Author

hlinnaka commented Apr 24, 2024

Replying to @problame's comment on the original PR:

Please add a test case that demonstrates the problem / regress-tests your solution.

added: 34f75f5

Also, you are making a breaking change to the protocol.

Please provide a plan to orchestrate the rollout.

  1. Push this PR
  2. Wait for the new storage version to be rolled out everywhere, and for ~ one week to pass so that we are confident that we will not need to roll it back anymore
  3. Change the neon.protocol_version=2 compute option on one region
  4. If no issues, make neon.protocol_version=2 default so that it gets rolled out everywhere

Removing the old version:

  1. Wait for all old computes that are still using protocol version 1 to expire
  2. Remove any neon.protocol_version from all endpoint, project, and region configs.
  3. Remove the code for protocol version 1 from storage and the neon extension.

A Slack channel (e..g, #proj-page_service-lsn-range-requests ) is probably most appropriate, as this touches multiple components.

We are coordinating this on the #escalation-2024-04-12-tried-to-request-a-page-version-that-was-garbage-coll channel now.

@hlinnaka hlinnaka requested a review from problame April 24, 2024 13:48
hlinnaka added a commit that referenced this pull request Apr 24, 2024
The 'latest' argument was passed to the functions in
pgdatadir_mapping.rs to know when they can update the relsize
cache. Commit e69ff3f changed how the relsize cache is updated,
making the 'latest' argument unused.
hlinnaka added a commit that referenced this pull request Apr 24, 2024
In the old protocol version, the client sent with each request:

- latest: bool. If true, the client requested the latest page
  version, and the 'lsn' was just a hint of when the page was last
  modified
- lsn: Lsn, the page version to return

This protocol didn't allow requesting a page at a particular
non-latest LSN and *also* sending a hint on when the page was last
modified. That put a read only compute into an awkward position where
it had to either request each page at the replay-LSN, which could be
very close to the last LSN written in the primary and therefore
require the pageserver to wait for it to arrive, or an older LSN which
could already be garbage collected in the pageserver, resulting in an
error. The new protocol version fixes that by allowing a read only
compute to send both LSNs.

To use the new protocol version, use "pagestream_v2" command instead
of just "pagestream". The old protocol version is still supported, for
compatibility with old computes (and in fact there is no client
support yet, it is added by the next commit).
hlinnaka added a commit that referenced this pull request Apr 24, 2024
Instead of thinking in terms of 'latest' and 'lsn' of the request,
each request has two LSNs: the request LSN and 'not_modified_since'
LSN. The request is nominally made at the request LSN, that determines
what page version we want to see. But as a hint, we also include
'not_modified_since'. It tells the pageserver that the page has not
been modified since that LSN, which allows the pageserver to skip
waiting for newer WAL to arrive, and could allow more optimizations in
the future.

Refactor the internal functions to calculate the request LSN to
calculate both LSNs.

Sending two LSNs to the pageserver requires using the new protocol
version 2. The previous commit added the server support for it, but we
still default to the old protocol for compatibility with old
pageservers. The 'neon.protocol_version' GUC can be used to use the
new protocol.

The new protocol addresses one cause of issue #6211, although you can
still get the same error if you have a standby that is lagging behind
so that the page version it needs is genuinely GC'd away.
hlinnaka added a commit that referenced this pull request Apr 24, 2024
Before PR #7377, on-demand SLRU download always used the basebackup's
LSN in the SLRU download, but that LSN might get garbage-collected away
in the pageserver. We should request the latest LSN, like with GetPage
requests, with the LSN just indicating that we know that the page hasn't
been changed since the LSN (since the basebackup in this case).

Add test to demonstrate the problem. Without the fix, it fails with
"tried to request a page version that was garbage collected" error from
the pageserver.

I wrote this test as part of earlier PR #6693, but that fell through
the cracks and was never applied. PR #7377 superseded the fix from
that older PR, but the test is still valid.
Copy link
Member

@petuhovskiy petuhovskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed on behalf of safekeepers group. Someone with deeper pageserver knowledge still should review to check for potential bugs.

Copy link
Contributor

@VladLazar VladLazar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for the informative commit messages.

pageserver/src/page_service.rs Show resolved Hide resolved
The 'latest' argument was passed to the functions in
pgdatadir_mapping.rs to know when they can update the relsize
cache. Commit e69ff3f changed how the relsize cache is updated,
making the 'latest' argument unused.
In the old protocol version, the client sent with each request:

- latest: bool. If true, the client requested the latest page
  version, and the 'lsn' was just a hint of when the page was last
  modified
- lsn: Lsn, the page version to return

This protocol didn't allow requesting a page at a particular
non-latest LSN and *also* sending a hint on when the page was last
modified. That put a read only compute into an awkward position where
it had to either request each page at the replay-LSN, which could be
very close to the last LSN written in the primary and therefore
require the pageserver to wait for it to arrive, or an older LSN which
could already be garbage collected in the pageserver, resulting in an
error. The new protocol version fixes that by allowing a read only
compute to send both LSNs.

To use the new protocol version, use "pagestream_v2" command instead
of just "pagestream". The old protocol version is still supported, for
compatibility with old computes (and in fact there is no client
support yet, it is added by the next commit).
Instead of thinking in terms of 'latest' and 'lsn' of the request,
each request has two LSNs: the request LSN and 'not_modified_since'
LSN. The request is nominally made at the request LSN, that determines
what page version we want to see. But as a hint, we also include
'not_modified_since'. It tells the pageserver that the page has not
been modified since that LSN, which allows the pageserver to skip
waiting for newer WAL to arrive, and could allow more optimizations in
the future.

Refactor the internal functions to calculate the request LSN to
calculate both LSNs.

Sending two LSNs to the pageserver requires using the new protocol
version 2. The previous commit added the server support for it, but we
still default to the old protocol for compatibility with old
pageservers. The 'neon.protocol_version' GUC can be used to use the
new protocol.

The new protocol addresses one cause of issue #6211, although you can
still get the same error if you have a standby that is lagging behind
so that the page version it needs is genuinely GC'd away.
Before PR #7377, on-demand SLRU download always used the basebackup's
LSN in the SLRU download, but that LSN might get garbage-collected away
in the pageserver. We should request the latest LSN, like with GetPage
requests, with the LSN just indicating that we know that the page hasn't
been changed since the LSN (since the basebackup in this case).

Add test to demonstrate the problem. Without the fix, it fails with
"tried to request a page version that was garbage collected" error from
the pageserver.

I wrote this test as part of earlier PR #6693, but that fell through
the cracks and was never applied. PR #7377 superseded the fix from
that older PR, but the test is still valid.
@hlinnaka hlinnaka enabled auto-merge (rebase) April 25, 2024 17:44
@hlinnaka hlinnaka merged commit ca8fca0 into main Apr 25, 2024
53 checks passed
@hlinnaka hlinnaka deleted the getpage_lsn_range-client branch April 25, 2024 17:45
hlinnaka added a commit that referenced this pull request Apr 25, 2024
The 'latest' argument was passed to the functions in
pgdatadir_mapping.rs to know when they can update the relsize
cache. Commit e69ff3f changed how the relsize cache is updated,
making the 'latest' argument unused.
hlinnaka added a commit that referenced this pull request Apr 25, 2024
In the old protocol version, the client sent with each request:

- latest: bool. If true, the client requested the latest page
  version, and the 'lsn' was just a hint of when the page was last
  modified
- lsn: Lsn, the page version to return

This protocol didn't allow requesting a page at a particular
non-latest LSN and *also* sending a hint on when the page was last
modified. That put a read only compute into an awkward position where
it had to either request each page at the replay-LSN, which could be
very close to the last LSN written in the primary and therefore
require the pageserver to wait for it to arrive, or an older LSN which
could already be garbage collected in the pageserver, resulting in an
error. The new protocol version fixes that by allowing a read only
compute to send both LSNs.

To use the new protocol version, use "pagestream_v2" command instead
of just "pagestream". The old protocol version is still supported, for
compatibility with old computes (and in fact there is no client
support yet, it is added by the next commit).
hlinnaka added a commit that referenced this pull request Apr 25, 2024
Instead of thinking in terms of 'latest' and 'lsn' of the request,
each request has two LSNs: the request LSN and 'not_modified_since'
LSN. The request is nominally made at the request LSN, that determines
what page version we want to see. But as a hint, we also include
'not_modified_since'. It tells the pageserver that the page has not
been modified since that LSN, which allows the pageserver to skip
waiting for newer WAL to arrive, and could allow more optimizations in
the future.

Refactor the internal functions to calculate the request LSN to
calculate both LSNs.

Sending two LSNs to the pageserver requires using the new protocol
version 2. The previous commit added the server support for it, but we
still default to the old protocol for compatibility with old
pageservers. The 'neon.protocol_version' GUC can be used to use the
new protocol.

The new protocol addresses one cause of issue #6211, although you can
still get the same error if you have a standby that is lagging behind
so that the page version it needs is genuinely GC'd away.
hlinnaka added a commit that referenced this pull request Apr 25, 2024
Before PR #7377, on-demand SLRU download always used the basebackup's
LSN in the SLRU download, but that LSN might get garbage-collected away
in the pageserver. We should request the latest LSN, like with GetPage
requests, with the LSN just indicating that we know that the page hasn't
been changed since the LSN (since the basebackup in this case).

Add test to demonstrate the problem. Without the fix, it fails with
"tried to request a page version that was garbage collected" error from
the pageserver.

I wrote this test as part of earlier PR #6693, but that fell through
the cracks and was never applied. PR #7377 superseded the fix from
that older PR, but the test is still valid.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants