pageserver management API doesn't poll request handlers to completion if client hangs up [P:1] [S:1] #3478

problame · 2023-01-30T10:47:13Z

Steps to reproduce

Modify a management API endpoint to tokio::time::sleep().await for 10 seconds, and log something before and after.
Make a request to that endpoint using curl.
Ctrl-C curl

Expected result

Log output:

T+00 the message you logged before sleep
T+10 the message you logged after sleep

Actual result

T+00 the message you logged before sleep

Environment

All

Why Is This A Problem?

Some management API handlers are not cancel-safe. In fact, I suspect most of them aren't.
One example is /detach.
If the client hangs up after we've transitioned the tenant to Stopping, but before mgr::detach_tenant returns, the following happens:

the detach operation isn't finished
the tenant remains in Stopping state
subsequent detatch operations fail because we fail them if the tenant is Stopping (rightly so)
after pageserver restart, the tenant will come up as attached again
- In the worst case, this can result in split brain. Right now this is not an issue, but once we add automatic tenant relocation, it will be (cc @LizardWizzard )

Solution

Spawn all management API tasks as task_mgr task so that they run to completion, even if the webserver stops polling the futures.

The implementation should create some kind of middleware to do this, and that middleware should also pass in a RequestContext to the task. Once RequestContext learns cancellation, the middleware should cancel the RequestContext when the request handling future (not the task that is spawned!) is dropped.

If this issue is worked on before RequestContext supports cancellation, use a CancellationToken instead.

Consider reducing control-plane timeout once issue is fixed: https://github.com/neondatabase/cloud/commit/ff390331f9bdb80d7120d30f8d6d1ea4ee588f37. Maybe some other tweaks will be needed too

The text was updated successfully, but these errors were encountered:

LizardWizzard · 2023-01-30T14:19:42Z

Hmm, yeah, the whole thing doesnt look right to me. This is definitely a problem.

There are several problems with detach. Initially I thought about it in a way that it shouldnt block for a long time, and should be resumable. I e when it fails on something it is safe to retry the request and it will retry the thing that failed, no the operation from the beginning. So it may make sense to make it cancel safe. Detach also should work correctly through power cycle. (This should already be solved with mark file)

From the other side we can go to completely async model, let detach return 202, and let the client poll it till the completion. It will make error reporting harder. Because we need to keep failed operations in memory to give them back during poll phase. Client will have to handle other states, e g if pageserver was restarted when detach was in progress.

LizardWizzard · 2023-03-23T15:01:17Z

Thoughts: See how this relates to changing http framework, think about using spawn for each connection handler. Beware of potential concurrency issues, e g when retry happens when previous call didnt finish

koivunej · 2023-04-07T07:03:13Z

The topic is wrong, it's a feature not a bug that this happens, because there's no way to do the other way around if you'd have different defaults.

I thought we had already spawned a task for detach for example, but we don't, this is causing some issues in prod: https://neondb.slack.com/archives/C03H1K0PGKH/p1680849458972369 -- at least for one tenant upload task shutdown took so long.

ololobus · 2023-04-07T07:21:12Z

Thoughts: See how this relates to changing http framework, think about using spawn for each connection handler.

If I got it correctly, then if you spawn a new thread and respond with 200 OK to the caller. There is still a chance that pageserver will restart until spawned thread finishes its job, but the caller (control-plane) won't retry call in this case anymore as it got OK already

Another point to consider: in the handler only persistently set some flag that detach requested and process in in the background. Then caller will need to re-check that tenant was actually fully detached

LizardWizzard · 2023-04-07T08:44:47Z

There is still a chance that pageserver will restart until spawned thread finishes its job, but the caller (control-plane) won't retry call in this case anymore as it got OK already

Obviously it should return accepted or something like that so it would be polled again.

The original detach idea is that it should be cancel safe, i e the retry request should be able to continue half finished operation.

problame · 2023-04-11T10:21:01Z

We can run the detach in a task_mgr task, yet still make the request blocking, e.g., by communicating the detach result to the request handler using a oneshot.

We could also do this systematically in the RequestSpan type.

This doesn't solve the underlying issue of detach being not cancel & crash-safe.
But, in practice, it will make the issue much less likely to occur.
So, it might be a good interim mitigation.

In today's pageserver meeting we decided that @LizardWizzard scope out what it takes to make detach cancel & crash-safe. Depending on what comes out of that, we should act on my proposal in the previous paragraph.
(Do we have an issue for that/)

In general, I think our code is not written with cancel-by-drop-safety in mind, so, the default should be that request handlers are run to completion in a task, even if the client hangs up.
Request handlers should check for cancellation (mechanism TBD) so that they complete faster if the client hangs up.
That's what this issue was supposed to be about, and I don't get why @koivunej thinks it's ill-named.

koivunej · 2023-05-08T14:12:03Z

Oops, it seems I've failed to post a reply in between now twice most likely.

I understood the objected topic as "rust is at fault with cancellable futures" from an earlier slack discussion. My #4159 PR is very much related to this topic.

I guess in the past comment which I didn't post or posted it elsewhere I was thinking of we have two choices:

return busy error code
coalesce the request handling

With #4159's approach, we would get cancellation free futures by default, which we could modify to be cancellable. Of course we should still handle crash safety, or at least consider it. I am thinking hard to come up with a good reason why we would go through of different crash safety matters even, because we could just have a persistent log for all of the accepted operations, and upon recovery apply the changes locally, reducing the crash-safety surface to just the log.

I don't mean that the log would ever be used to rollback operations, just to continue to finish the started ones.

We now spawn a new task for every HTTP request, and wait on the JoinHandle. If Hyper drops the Future, the spawned task will keep running. This protects the rest of the pageserver code from unexpected async cancellations. This creates a CancellationToken for each request and passes it to the handler function. If the HTTP request is dropped by the client, the CancellationToken is signaled. None of the handler functions make use for the CancellationToken currently, but they now they could. The CancellationToken arguments also work like documentation. When you're looking at a function signature and you see that it takes a CancellationToken as argument, it's a nice hint that the function might run for a long time, and won't be async cancelled. The default assumption in the pageserver is now that async functions are not cancellation-safe anyway, unless explictly marked as such, but this is a nice extra reminder. Spawning a task for each request is OK from a performance point of view because spawning is very cheap in Tokio, and none of our HTTP requests are very performance critical anyway. Fixes issue #3478

If the timeline is already being deleted, return an error. We used to notice the duplicate request and error out in persist_index_part_with_deleted_flag(), but it's better to detect it earlier. Add an explicit flag for the deletion. Note: This doesn't do anything about the async cancellation problem (github issue #3478): if the original HTTP request dropped, because the client disconnected, the timeline deletion stops half-way through the operation. That needs to be fixed, too, but that's a separate story.

If the timeline is already being deleted, return an error. We used to notice the duplicate request and error out in persist_index_part_with_deleted_flag(), but it's better to detect it earlier. Add an explicit lock for the deletion. Note: This doesn't do anything about the async cancellation problem (github issue #3478): if the original HTTP request dropped, because the client disconnected, the timeline deletion stops half-way through the operation. That needs to be fixed, too, but that's a separate story. (This is a simpler replacement for PR #4194. I'm also working on the cancellation shielding, see PR #4314.)

We now spawn a new task for every HTTP request, and wait on the JoinHandle. If Hyper drops the Future, the spawned task will keep running. This protects the rest of the pageserver code from unexpected async cancellations. This creates a CancellationToken for each request and passes it to the handler function. If the HTTP request is dropped by the client, the CancellationToken is signaled. None of the handler functions make use for the CancellationToken currently, but they now they could. The CancellationToken arguments also work like documentation. When you're looking at a function signature and you see that it takes a CancellationToken as argument, it's a nice hint that the function might run for a long time, and won't be async cancelled. The default assumption in the pageserver is now that async functions are not cancellation-safe anyway, unless explictly marked as such, but this is a nice extra reminder. Spawning a task for each request is OK from a performance point of view because spawning is very cheap in Tokio, and none of our HTTP requests are very performance critical anyway. Fixes issue #3478

hlinnaka · 2023-06-14T21:43:16Z

This was fixed in PR #4314.

problame added the t/bug Issue Type: Bug label Jan 30, 2023

problame mentioned this issue Jan 30, 2023

pageserver management API should log start of request handling [P:3] [S:3] #3479

Closed

problame added the c/storage/pageserver Component: storage: pageserver label Jan 30, 2023

LizardWizzard mentioned this issue Jan 30, 2023

Epic: tenant relocation between pageserver nodes #886

Closed

1 task

neondatabase-bot bot added this to the 2023/03 milestone Feb 19, 2023

problame changed the title ~~pageserver management API doesn't poll request handlers to completion if client hangs up~~ pageserver management API doesn't poll request handlers to completion if client hangs up [P:1] [S:1] Feb 20, 2023

shanyp added the triaged bugs that were already triaged label Mar 23, 2023

shanyp assigned problame Mar 23, 2023

LizardWizzard assigned LizardWizzard and unassigned problame Mar 23, 2023

This was referenced Apr 10, 2023

Epic: improve unit test coverage in pageserver #3667

Closed

keep track of timeline deletion status in IndexPart to prevent timeline resurrection #3919

Merged

problame mentioned this issue Apr 25, 2023

feat: warn when requests get cancelled #4064

Merged

hlinnaka mentioned this issue May 17, 2023

Document that our code is mostly not async cancellation-safe. #4216

Merged

problame mentioned this issue May 19, 2023

Pageserver crash when tenant detach is in progress can lead to detach never making progress (possible inconsistency on startup) #4284

Closed

hlinnaka mentioned this issue May 23, 2023

Don't allow two timeline_delete operations to run concurrently. #4313

Merged

hlinnaka mentioned this issue May 23, 2023

Shield HTTP request handlers from async cancellations. #4314

Merged

hlinnaka closed this as completed Jun 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver management API doesn't poll request handlers to completion if client hangs up [P:1] [S:1] #3478

pageserver management API doesn't poll request handlers to completion if client hangs up [P:1] [S:1] #3478

problame commented Jan 30, 2023 •

edited by ololobus

Loading

LizardWizzard commented Jan 30, 2023

LizardWizzard commented Mar 23, 2023

koivunej commented Apr 7, 2023

ololobus commented Apr 7, 2023

LizardWizzard commented Apr 7, 2023

problame commented Apr 11, 2023

koivunej commented May 8, 2023 •

edited

Loading

hlinnaka commented Jun 14, 2023

pageserver management API doesn't poll request handlers to completion if client hangs up [P:1] [S:1] #3478

pageserver management API doesn't poll request handlers to completion if client hangs up [P:1] [S:1] #3478

Comments

problame commented Jan 30, 2023 • edited by ololobus Loading

Steps to reproduce

Expected result

Actual result

Environment

Why Is This A Problem?

Solution

LizardWizzard commented Jan 30, 2023

LizardWizzard commented Mar 23, 2023

koivunej commented Apr 7, 2023

ololobus commented Apr 7, 2023

LizardWizzard commented Apr 7, 2023

problame commented Apr 11, 2023

koivunej commented May 8, 2023 • edited Loading

hlinnaka commented Jun 14, 2023

problame commented Jan 30, 2023 •

edited by ololobus

Loading

koivunej commented May 8, 2023 •

edited

Loading