Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver management API doesn't poll request handlers to completion if client hangs up [P:1] [S:1] #3478

Closed
problame opened this issue Jan 30, 2023 · 8 comments
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Milestone

Comments

@problame
Copy link
Contributor

problame commented Jan 30, 2023

Steps to reproduce

  1. Modify a management API endpoint to tokio::time::sleep().await for 10 seconds, and log something before and after.
  2. Make a request to that endpoint using curl.
  3. Ctrl-C curl

Expected result

Log output:

T+00 the message you logged before sleep
T+10 the message you logged after sleep

Actual result

T+00 the message you logged before sleep

Environment

All


Why Is This A Problem?

Some management API handlers are not cancel-safe. In fact, I suspect most of them aren't.
One example is /detach.
If the client hangs up after we've transitioned the tenant to Stopping, but before mgr::detach_tenant returns, the following happens:

  • the detach operation isn't finished
  • the tenant remains in Stopping state
  • subsequent detatch operations fail because we fail them if the tenant is Stopping (rightly so)
  • after pageserver restart, the tenant will come up as attached again
    • In the worst case, this can result in split brain. Right now this is not an issue, but once we add automatic tenant relocation, it will be (cc @LizardWizzard )

Solution

Spawn all management API tasks as task_mgr task so that they run to completion, even if the webserver stops polling the futures.

The implementation should create some kind of middleware to do this, and that middleware should also pass in a RequestContext to the task. Once RequestContext learns cancellation, the middleware should cancel the RequestContext when the request handling future (not the task that is spawned!) is dropped.

If this issue is worked on before RequestContext supports cancellation, use a CancellationToken instead.

Consider reducing control-plane timeout once issue is fixed: https://github.com/neondatabase/cloud/commit/ff390331f9bdb80d7120d30f8d6d1ea4ee588f37. Maybe some other tweaks will be needed too

@LizardWizzard
Copy link
Contributor

Hmm, yeah, the whole thing doesnt look right to me. This is definitely a problem.

There are several problems with detach. Initially I thought about it in a way that it shouldnt block for a long time, and should be resumable. I e when it fails on something it is safe to retry the request and it will retry the thing that failed, no the operation from the beginning. So it may make sense to make it cancel safe. Detach also should work correctly through power cycle. (This should already be solved with mark file)

From the other side we can go to completely async model, let detach return 202, and let the client poll it till the completion. It will make error reporting harder. Because we need to keep failed operations in memory to give them back during poll phase. Client will have to handle other states, e g if pageserver was restarted when detach was in progress.

@neondatabase-bot neondatabase-bot bot added this to the 2023/03 milestone Feb 19, 2023
@problame problame changed the title pageserver management API doesn't poll request handlers to completion if client hangs up pageserver management API doesn't poll request handlers to completion if client hangs up [P:1] [S:1] Feb 20, 2023
@shanyp shanyp added the triaged bugs that were already triaged label Mar 23, 2023
@LizardWizzard
Copy link
Contributor

Thoughts: See how this relates to changing http framework, think about using spawn for each connection handler. Beware of potential concurrency issues, e g when retry happens when previous call didnt finish

@koivunej
Copy link
Contributor

koivunej commented Apr 7, 2023

The topic is wrong, it's a feature not a bug that this happens, because there's no way to do the other way around if you'd have different defaults.

I thought we had already spawned a task for detach for example, but we don't, this is causing some issues in prod: https://neondb.slack.com/archives/C03H1K0PGKH/p1680849458972369 -- at least for one tenant upload task shutdown took so long.

@ololobus
Copy link
Member

ololobus commented Apr 7, 2023

Thoughts: See how this relates to changing http framework, think about using spawn for each connection handler.

If I got it correctly, then if you spawn a new thread and respond with 200 OK to the caller. There is still a chance that pageserver will restart until spawned thread finishes its job, but the caller (control-plane) won't retry call in this case anymore as it got OK already

Another point to consider: in the handler only persistently set some flag that detach requested and process in in the background. Then caller will need to re-check that tenant was actually fully detached

@LizardWizzard
Copy link
Contributor

There is still a chance that pageserver will restart until spawned thread finishes its job, but the caller (control-plane) won't retry call in this case anymore as it got OK already

Obviously it should return accepted or something like that so it would be polled again.

The original detach idea is that it should be cancel safe, i e the retry request should be able to continue half finished operation.

@problame
Copy link
Contributor Author

We can run the detach in a task_mgr task, yet still make the request blocking, e.g., by communicating the detach result to the request handler using a oneshot.

We could also do this systematically in the RequestSpan type.

This doesn't solve the underlying issue of detach being not cancel & crash-safe.
But, in practice, it will make the issue much less likely to occur.
So, it might be a good interim mitigation.

In today's pageserver meeting we decided that @LizardWizzard scope out what it takes to make detach cancel & crash-safe. Depending on what comes out of that, we should act on my proposal in the previous paragraph.
(Do we have an issue for that/)

In general, I think our code is not written with cancel-by-drop-safety in mind, so, the default should be that request handlers are run to completion in a task, even if the client hangs up.
Request handlers should check for cancellation (mechanism TBD) so that they complete faster if the client hangs up.
That's what this issue was supposed to be about, and I don't get why @koivunej thinks it's ill-named.

@koivunej
Copy link
Contributor

koivunej commented May 8, 2023

Oops, it seems I've failed to post a reply in between now twice most likely.

I understood the objected topic as "rust is at fault with cancellable futures" from an earlier slack discussion. My #4159 PR is very much related to this topic.

I guess in the past comment which I didn't post or posted it elsewhere I was thinking of we have two choices:

  • return busy error code
  • coalesce the request handling

With #4159's approach, we would get cancellation free futures by default, which we could modify to be cancellable. Of course we should still handle crash safety, or at least consider it. I am thinking hard to come up with a good reason why we would go through of different crash safety matters even, because we could just have a persistent log for all of the accepted operations, and upon recovery apply the changes locally, reducing the crash-safety surface to just the log.

I don't mean that the log would ever be used to rollback operations, just to continue to finish the started ones.

hlinnaka added a commit that referenced this issue May 23, 2023
We now spawn a new task for every HTTP request, and wait on the
JoinHandle. If Hyper drops the Future, the spawned task will keep
running. This protects the rest of the pageserver code from unexpected
async cancellations.

This creates a CancellationToken for each request and passes it to the
handler function. If the HTTP request is dropped by the client, the
CancellationToken is signaled. None of the handler functions make use
for the CancellationToken currently, but they now they could.

The CancellationToken arguments also work like documentation. When
you're looking at a function signature and you see that it takes a
CancellationToken as argument, it's a nice hint that the function
might run for a long time, and won't be async cancelled. The default
assumption in the pageserver is now that async functions are not
cancellation-safe anyway, unless explictly marked as such, but this is
a nice extra reminder.

Spawning a task for each request is OK from a performance point of
view because spawning is very cheap in Tokio, and none of our HTTP
requests are very performance critical anyway.

Fixes issue #3478
hlinnaka added a commit that referenced this issue May 23, 2023
If the timeline is already being deleted, return an error. We used to
notice the duplicate request and error out in
persist_index_part_with_deleted_flag(), but it's better to detect it
earlier. Add an explicit flag for the deletion.

Note: This doesn't do anything about the async cancellation problem
(github issue #3478): if the original HTTP request dropped, because
the client disconnected, the timeline deletion stops half-way through
the operation. That needs to be fixed, too, but that's a separate
story.
hlinnaka added a commit that referenced this issue May 27, 2023
If the timeline is already being deleted, return an error. We used to
notice the duplicate request and error out in
persist_index_part_with_deleted_flag(), but it's better to detect it
earlier. Add an explicit flag for the deletion.

Note: This doesn't do anything about the async cancellation problem
(github issue #3478): if the original HTTP request dropped, because
the client disconnected, the timeline deletion stops half-way through
the operation. That needs to be fixed, too, but that's a separate
story.
hlinnaka added a commit that referenced this issue May 27, 2023
If the timeline is already being deleted, return an error. We used to
notice the duplicate request and error out in
persist_index_part_with_deleted_flag(), but it's better to detect it
earlier. Add an explicit lock for the deletion.

Note: This doesn't do anything about the async cancellation problem
(github issue #3478): if the original HTTP request dropped, because the
client disconnected, the timeline deletion stops half-way through the
operation. That needs to be fixed, too, but that's a separate story.

(This is a simpler replacement for PR #4194. I'm also working on the
cancellation shielding, see PR #4314.)
hlinnaka added a commit that referenced this issue May 29, 2023
We now spawn a new task for every HTTP request, and wait on the
JoinHandle. If Hyper drops the Future, the spawned task will keep
running. This protects the rest of the pageserver code from unexpected
async cancellations.

This creates a CancellationToken for each request and passes it to the
handler function. If the HTTP request is dropped by the client, the
CancellationToken is signaled. None of the handler functions make use
for the CancellationToken currently, but they now they could.

The CancellationToken arguments also work like documentation. When
you're looking at a function signature and you see that it takes a
CancellationToken as argument, it's a nice hint that the function
might run for a long time, and won't be async cancelled. The default
assumption in the pageserver is now that async functions are not
cancellation-safe anyway, unless explictly marked as such, but this is
a nice extra reminder.

Spawning a task for each request is OK from a performance point of
view because spawning is very cheap in Tokio, and none of our HTTP
requests are very performance critical anyway.

Fixes issue #3478
hlinnaka added a commit that referenced this issue Jun 1, 2023
We now spawn a new task for every HTTP request, and wait on the
JoinHandle. If Hyper drops the Future, the spawned task will keep
running. This protects the rest of the pageserver code from unexpected
async cancellations.

This creates a CancellationToken for each request and passes it to the
handler function. If the HTTP request is dropped by the client, the
CancellationToken is signaled. None of the handler functions make use
for the CancellationToken currently, but they now they could.

The CancellationToken arguments also work like documentation. When
you're looking at a function signature and you see that it takes a
CancellationToken as argument, it's a nice hint that the function
might run for a long time, and won't be async cancelled. The default
assumption in the pageserver is now that async functions are not
cancellation-safe anyway, unless explictly marked as such, but this is
a nice extra reminder.

Spawning a task for each request is OK from a performance point of
view because spawning is very cheap in Tokio, and none of our HTTP
requests are very performance critical anyway.

Fixes issue #3478
hlinnaka added a commit that referenced this issue Jun 2, 2023
We now spawn a new task for every HTTP request, and wait on the
JoinHandle. If Hyper drops the Future, the spawned task will keep
running. This protects the rest of the pageserver code from unexpected
async cancellations.

This creates a CancellationToken for each request and passes it to the
handler function. If the HTTP request is dropped by the client, the
CancellationToken is signaled. None of the handler functions make use
for the CancellationToken currently, but they now they could.

The CancellationToken arguments also work like documentation. When
you're looking at a function signature and you see that it takes a
CancellationToken as argument, it's a nice hint that the function might
run for a long time, and won't be async cancelled. The default
assumption in the pageserver is now that async functions are not
cancellation-safe anyway, unless explictly marked as such, but this is a
nice extra reminder.

Spawning a task for each request is OK from a performance point of view
because spawning is very cheap in Tokio, and none of our HTTP requests
are very performance critical anyway.

Fixes issue #3478
@hlinnaka
Copy link
Contributor

This was fixed in PR #4314.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

6 participants