Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(server): prevent intermittent hanging when requesting a tarball worker #7041

Merged
merged 1 commit into from
Sep 5, 2023

Conversation

gluxon
Copy link
Member

@gluxon gluxon commented Sep 4, 2023

Problem

Fixes #7038. When running pnpm server start, the tarball fetcher worker pool (introduced in #6850) is immediately marked as "finishing" by this statement.

await global.finishWorkers?.()

The server continues running and handles requests after this point, which causes a few problems:

  1. It results in a degraded state where the worker pool will think it's exiting and therefore never reuses workers. When the server is asked to do work it'll create a new worker, but immediately discard it.
  2. In addition to the degraded state, there's an edge case in the worker pool finishing state's implementation that can cause hanging.

The exact conditions of the hanging were observed in #7038 (comment). In summary, the hanging happens if a worker is requested in between the time that a different worker has been marked for cleanup, but before the exit event for that other worker has fired.

It's a bit surprising to me that the worker pool will accept new work after finishAsync is called. I'm not sure if that's intentional.

Reproducing the hanging

The hanging can be reproduced consistently by forcing some contrived settings:

  1. Adding a 10s sleep to exit event handler's logic. WorkerPool.ts#L209-L214
  2. Forcibly setting the worker pool max size to 1. This is the value CI on this repo evaluates to.
    const maxWorkers = Math.max(2, os.cpus().length - Math.abs(process.env.PNPM_WORKERS ? parseInt(process.env.PNPM_WORKERS) : 0)) - 1

After performing above, running pnpm server start and pnpm install is-positive@1.0.0 will reproduce the hanging.

Although this is fairly contrived, this was happening very consistently on pnpm's CI jobs.

Changes

The main idea behind this PR is to delay the finishAsync call until the server is exiting through Ctrl-C or a POST /exit request.

await global.finishWorkers?.()

I'm not sure if this is the best approach. I'd welcome suggestions around an alternative fix.

Alternatives

We could also set up a check around the global.finishWorkers?.() call and avoid running it if the current process is driving a pnpm server. This felt a bit more hacky to me though. To check if the current process is a pnpm server, information needs to be propagated back up the command handler chain or some kind of sideways information passing. In theory there are other situations where we'd want to avoid global.finishWorkers?.() from running as well.

@gluxon
Copy link
Member Author

gluxon commented Sep 4, 2023

One problem the current approach is revealing is that it can sometimes take 10s for the worker pool to shut down. This is after the server has finished handling an installation request.

❯ pnpm server start
fetching_started registry.npmjs.org/is-positive/3.1.0
Got request to stop the server
Server stopped
Shut down workers in: 10009.397833999246

This causes the 5s graceful shutdown timer to kill the server.

❯ pnpm install is-positive@3.1.0 && pnpm server stop
A store server is running. All store manipulations are delegated to it.
Already up to date
Progress: resolved 1, reused 0, downloaded 0, added 0, done
Done in 1.1s
 WARN  Graceful shutdown failed
Server process terminated

I wonder if we should bump up the 5s. I think it makes sense to keep the worker pool around and avoid shutting down it down until the server exits, but it's surprising that it takes ~10s regularly.

@gluxon gluxon marked this pull request as ready for review September 4, 2023 05:31
@gluxon gluxon requested a review from zkochan as a code owner September 4, 2023 05:31
// Intentionally avoid returning control back to the caller until the server
// exits. This defers cleanup operations that should not run before the server
// finishes.
await server.waitForClose
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main pnpm command handler will wait for this promise to resolve here:

result = await result

Everything after this main.ts line looks safe to run on server close. I could use another reviewer to gut check this change in behavior though.

@gluxon gluxon changed the title fix(server): prevent an intermittent deadlock when requesting a tarball worker fix(server): prevent an intermittent hanging when requesting a tarball worker Sep 4, 2023
@gluxon gluxon changed the title fix(server): prevent an intermittent hanging when requesting a tarball worker fix(server): prevent intermittent hanging when requesting a tarball worker Sep 4, 2023
@zkochan zkochan merged commit 548768e into pnpm:main Sep 5, 2023
8 checks passed
@gluxon gluxon deleted the server-workers-finishing branch September 9, 2023 20:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CI tests hanging on pool.checkoutWorkerAsync(true)
2 participants