fix(server): prevent intermittent hanging when requesting a tarball worker #7041

gluxon · 2023-09-04T05:09:23Z

Problem

Fixes #7038. When running pnpm server start, the tarball fetcher worker pool (introduced in #6850) is immediately marked as "finishing" by this statement.

pnpm/pnpm/src/main.ts

Line 300 in 131c723

await global.finishWorkers?.()

The server continues running and handles requests after this point, which causes a few problems:

It results in a degraded state where the worker pool will think it's exiting and therefore never reuses workers. When the server is asked to do work it'll create a new worker, but immediately discard it.
In addition to the degraded state, there's an edge case in the worker pool finishing state's implementation that can cause hanging.

The exact conditions of the hanging were observed in #7038 (comment). In summary, the hanging happens if a worker is requested in between the time that a different worker has been marked for cleanup, but before the exit event for that other worker has fired.

It's a bit surprising to me that the worker pool will accept new work after finishAsync is called. I'm not sure if that's intentional.

Reproducing the hanging

The hanging can be reproduced consistently by forcing some contrived settings:

Adding a 10s sleep to exit event handler's logic. WorkerPool.ts#L209-L214
Forcibly setting the worker pool max size to 1. This is the value CI on this repo evaluates to.

pnpm/worker/src/index.ts

Line 22 in 494f875

const maxWorkers = Math.max(2, os.cpus().length - Math.abs(process.env.PNPM_WORKERS ? parseInt(process.env.PNPM_WORKERS) : 0)) - 1

After performing above, running pnpm server start and pnpm install is-positive@1.0.0 will reproduce the hanging.

Although this is fairly contrived, this was happening very consistently on pnpm's CI jobs.

Changes

The main idea behind this PR is to delay the finishAsync call until the server is exiting through Ctrl-C or a POST /exit request.

pnpm/pnpm/src/main.ts

Line 300 in 131c723

await global.finishWorkers?.()

I'm not sure if this is the best approach. I'd welcome suggestions around an alternative fix.

Alternatives

We could also set up a check around the global.finishWorkers?.() call and avoid running it if the current process is driving a pnpm server. This felt a bit more hacky to me though. To check if the current process is a pnpm server, information needs to be propagated back up the command handler chain or some kind of sideways information passing. In theory there are other situations where we'd want to avoid global.finishWorkers?.() from running as well.

gluxon · 2023-09-04T05:29:15Z

One problem the current approach is revealing is that it can sometimes take 10s for the worker pool to shut down. This is after the server has finished handling an installation request.

❯ pnpm server start
fetching_started registry.npmjs.org/is-positive/3.1.0
Got request to stop the server
Server stopped
Shut down workers in: 10009.397833999246

This causes the 5s graceful shutdown timer to kill the server.

pnpm/store/plugin-commands-server/src/stop.ts

Line 49 in 4ca53b0

await delay(5000)

❯ pnpm install is-positive@3.1.0 && pnpm server stop
A store server is running. All store manipulations are delegated to it.
Already up to date
Progress: resolved 1, reused 0, downloaded 0, added 0, done
Done in 1.1s
 WARN  Graceful shutdown failed
Server process terminated

I wonder if we should bump up the 5s. I think it makes sense to keep the worker pool around and avoid shutting down it down until the server exits, but it's surprising that it takes ~10s regularly.

gluxon · 2023-09-04T05:13:00Z

store/plugin-commands-server/src/start.ts

+  // Intentionally avoid returning control back to the caller until the server
+  // exits. This defers cleanup operations that should not run before the server
+  // finishes.
+  await server.waitForClose


The main pnpm command handler will wait for this promise to resolve here:

pnpm/pnpm/src/main.ts

Line 283 in 131c723

result = await result

Everything after this main.ts line looks safe to run on server close. I could use another reviewer to gut check this change in behavior though.

fix(server): fix a potential deadlock when requesting a tarball worker

1e39fbf

gluxon marked this pull request as ready for review September 4, 2023 05:31

gluxon requested a review from zkochan as a code owner September 4, 2023 05:31

gluxon commented Sep 4, 2023

View reviewed changes

gluxon mentioned this pull request Sep 4, 2023

CI tests hanging on pool.checkoutWorkerAsync(true) #7038

Closed

4 tasks

gluxon changed the title ~~fix(server): prevent an intermittent deadlock when requesting a tarball worker~~ fix(server): prevent an intermittent hanging when requesting a tarball worker Sep 4, 2023

gluxon changed the title ~~fix(server): prevent an intermittent hanging when requesting a tarball worker~~ fix(server): prevent intermittent hanging when requesting a tarball worker Sep 4, 2023

zkochan approved these changes Sep 5, 2023

View reviewed changes

zkochan merged commit 548768e into pnpm:main Sep 5, 2023
8 checks passed

gluxon deleted the server-workers-finishing branch September 9, 2023 20:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(server): prevent intermittent hanging when requesting a tarball worker #7041

fix(server): prevent intermittent hanging when requesting a tarball worker #7041

gluxon commented Sep 4, 2023 •

edited

gluxon commented Sep 4, 2023 •

edited

gluxon Sep 4, 2023

fix(server): prevent intermittent hanging when requesting a tarball worker #7041

fix(server): prevent intermittent hanging when requesting a tarball worker #7041

Conversation

gluxon commented Sep 4, 2023 • edited

Problem

Reproducing the hanging

Changes

Alternatives

gluxon commented Sep 4, 2023 • edited

gluxon Sep 4, 2023

Choose a reason for hiding this comment

gluxon commented Sep 4, 2023 •

edited

gluxon commented Sep 4, 2023 •

edited