Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disgraceful shutdown, request before signal interrupted with ECONNREFUSED #11416

Open
6 of 15 tasks
soryy708 opened this issue Apr 2, 2023 · 25 comments
Open
6 of 15 tasks

Comments

@soryy708
Copy link

soryy708 commented Apr 2, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Current behavior

When the application is shutdown (e.g. with SIGTERM shutdown hook), requests that started before shutdown fail with ECONNREFUSED.

Minimum reproduction code

https://github.com/soryy708/nestjs-disgraceful-shutdown-mcve

Steps to reproduce

  1. Install dependencies by running yarn install
  2. Run the tests by running yarn test

Expected behavior

Requests that started before shutdown should succeed, only new requests after shutdown should fail.

Package

  • I don't know. Or some 3rd-party package
  • @nestjs/common
  • @nestjs/core
  • @nestjs/microservices
  • @nestjs/platform-express
  • @nestjs/platform-fastify
  • @nestjs/platform-socket.io
  • @nestjs/platform-ws
  • @nestjs/testing
  • @nestjs/websockets
  • Other (see below)

Other package

No response

NestJS version

9.3.12

Packages versions

{
"@nestjs/axios": "^2.0.0",
"@nestjs/common": "^9.3.12",
"@nestjs/core": "^9.3.12",
"@nestjs/platform-express": "^9.3.12",
"@nestjs/terminus": "^9.2.2",
"@nestjs/testing": "^9.3.12",
}

Node.js version

16.19.0

In which operating systems have you tested?

  • macOS
  • Windows
  • Linux

Other

No response

@soryy708 soryy708 added the needs triage This issue has not been looked into label Apr 2, 2023
@josephaw1022
Copy link

I believe the error lies somewhere here or in one these methods. Somewhere in one of these, the request is being stopped

Screenshot from 2023-04-05 19-49-50

@josephaw1022
Copy link

Screenshot from 2023-04-06 15-18-11

So, I could be wrong, but what I have found is that the callDestroyHook method is iteratively calling callModuleDestroyHook and that is what is removing the modules via the shift method on the list of providers. So, what my intuition is telling me, is that the provider is being removed while the controller is utilizing it and thus the request is just being stopped. Again, I could be wrong, but I think this is a potential root cause.

@Ekott2006
Copy link

Hello, what's the update on this issue??

@coderhammer
Copy link

I think that the problem lies on the express adapter itself, on the close method.

  public close() {
    this.closeOpenConnections();

    if (!this.httpServer) {
      return undefined;
    }
    return new Promise(resolve => this.httpServer.close(resolve));
  }

As you can see, this close method closes all the openConnections before calling this.httpServer.close.

As you can see in the express documentation (https://expressjs.com/en/advanced/healthcheck-graceful-shutdown.html), the close method on the httpServer is supposed to wait for all connections to be terminated. So technically I think that if we remove this.closeOpenConnections(), it would just work as expected.

https://github.com/nestjs/nest/blob/master/packages/platform-express/adapters/express-adapter.ts#L159-L166

@kamilmysliwiec
Copy link
Member

kamilmysliwiec commented Jun 17, 2023

cc @tolgap 👀

@coderhammer this method call won't have any impact if forceCloseConnections is false (which is the default)

@coderhammer
Copy link

coderhammer commented Jun 17, 2023

Honestly I'm not sure, I just tried on a new project and it closes open connections with the default. Here is the current implementation of the closeOpenConnections method:

  private closeOpenConnections() {
    for (const socket of this.openConnections) {
      socket.destroy();
      this.openConnections.delete(socket);
    }
  }

https://github.com/nestjs/nest/blob/f676276979e651f6258b5bc42666e9aa7fe3dbf4/packages/platform-express/adapters/express-adapter.ts#LL429C1-L434C4

A quick fix for me for now is to implement my own OnModuleDestroy function to wait for all the connections to be closed. This will wait before calling nest's implementation on the http adapter.

import { Injectable, Logger, OnModuleDestroy } from '@nestjs/common';
import { HttpAdapterHost } from '@nestjs/core';

@Injectable()
export class AppService implements OnModuleDestroy {
  constructor(private readonly httpAdapter: HttpAdapterHost) {}

  async onModuleDestroy() {
    Logger.debug('Module destroyed called!');
    const httpServer = this.httpAdapter.httpAdapter.getHttpServer();
    await new Promise((resolve) => httpServer.close(resolve));
    Logger.debug('Waited for all connections to be closed!');
  }
}

EDIT: The behaviour is happening when activating app.enableShutdownHooks() I don't know if this is relevant

@kamilmysliwiec
Copy link
Member

Honestly I'm not sure, I just tried on a new project and it closes open connections with the default. Here is the current implementation of the closeOpenConnections method

this.openConnections will always be empty unless trackOpenConnections (https://github.dev/nestjs/nest/blob/da27730cd4ef37fdf7f29aafee53ed8ac2a8b4b8/packages/platform-express/adapters/express-adapter.ts#L421) is called and it's executed only if forceCloseConnections is true https://github.dev/nestjs/nest/blob/da27730cd4ef37fdf7f29aafee53ed8ac2a8b4b8/packages/platform-express/adapters/express-adapter.ts#L234

@coderhammer
Copy link

I'm so sorry, it works exactly as expected when trying again on a fresh project. I was trying something with a special configuration. Have a nice day! This is not an issue for me anymore.

@soryy708
Copy link
Author

@coderhammer what did you change? Is there a workaround?

@coderhammer
Copy link

Honestly, everything works fine on my side when testing manually. I tried your reproduction repo but I think that it's more a test configuration that is failing :/

@soryy708
Copy link
Author

@coderhammer If you think the issue is with test configuration, where do you think the issue is?

I thought it could be that Test.createTestingModule from @nestjs/testing is bugged, so I added a test suite that doesn't use it and instead uses NestFactory from @nestjs/core, but the bug still happens.

I don't think it's feasible to reproduce this manually (without an automated test), because the timeframe in which this bug occurs is very small, but is enough to be experienced by users in a production system under load.

@soryy708
Copy link
Author

Also I don't think that app.enableShutdownHooks is related, because in the MCVE tests I linked if I comment out enableShutdownHooks it still fails with the same error. (clearly that means my MCVE wasn't minimal. Whoops!)

@coderhammer
Copy link

After investigating a little bit on your test repo, it seems to me that the problem comes from supertest itself. I'll try to make a reproduction later this week

@soryy708
Copy link
Author

soryy708 commented Jun 19, 2023

@coderhammer Looks like Supertest may be related, but is not the only issue.
I added to the MCVE a suite that does the same tests but without Supertest.

The same test case (shutting down while a request is in progress) fails on ECONNREFUSED.

However so does making a request after the server shut down (about which I don't care if it's 503 or ECONNREFUSED, the server finished shutting down).

@kamilmysliwiec
Copy link
Member

@soryy708 could you spin up a pure express application (non-NestJS-driven) just to see what's the original behavior for your test?

@soryy708
Copy link
Author

With the simplest possible Express application (without even Terminus) it acts the same as with Nest not using Supertest.

const express = require('express');

function createServer() {
  const app = express();
  app.get('/health', async (_req, res) => {
    res.status(200).send();
  })
  return new Promise((resolve, reject) => {
    const server = app.listen(0, error => {
      if (error) {
        reject(error);
        return;
      }
      resolve([server.address().port, () => server.close()]);
    });
  })
}

describe("API main (no Supertest)", () => {
  describe("Lifecycle", () => {
    let port;
    let close;

    beforeEach(async () => {
      [port, close] = await createServer();
    });

    afterEach(async () => {
      await close();
    });

    const probeReadiness = async () => {
      const request = new Request(`http://localhost:${port}/health`, { method: 'GET' });
      const response = await fetch(request);
      return { statusCode: response.status };
    }

    describe("When bootstrapped", () => {
      describe("When running", () => {
        it("Should respond successfully", async () => {
          const response = await probeReadiness();
          expect(response.statusCode).toBe(200);
        });
      });

      describe("When shutting down", () => {
        describe("When a request is already in progress", () => {
          it("Should respond successfully", async () => {
            const responsePromise = probeReadiness();
            const closePromise = close();
            const [response] = await Promise.all([
              responsePromise,
              closePromise,
            ]);
            expect(response.statusCode).toBe(200);
          });
        });

        describe("When a new request is made", () => {
          it("Should respond with 503 Service Unavailable", async () => {
            await close();
            const response = await probeReadiness();
            expect(response.statusCode).toBe(503);
          });
        });
      });
    });
  });
});

Result:

 FAIL  ./index.test.js
  Lifecycle
    When bootstrapped
      When running
        √ Should respond successfully (49 ms)
      When shutting down
        When a request is already in progress
          × Should respond successfully (8 ms)
        When a new request is made
          × Should respond with 503 Service Unavailable (3 ms)

Both on ECONNREFUSED.

I expect that if I use Terminus, it will postpone Express'es shutdown long enough to flush the in-progress requests.

@kamilmysliwiec
Copy link
Member

If this behavior can be reproduced with the pure express application, then I don't think there's anything we can/should do given we don't really want to alter the default driver's behavior. Perhaps reporting this issue in the terminus package would make more sense then?

@soryy708
Copy link
Author

Are we sure that it's an issue with the Terminus package?

I'm opening an issue in https://github.com/nestjs/terminus.

@kamilmysliwiec
Copy link
Member

Given that this is the default behavior of HTTP drivers we rely on (Nest is built upon), I wouldn't even call this an "issue" but rather a "feature request". Whether it should be implemented & exposed from the terminus package, or an entirely new package that extends the default driver and adds some extra functionality on top, is a different question.

@soryy708
Copy link
Author

Whatever you want to call it.

It's the default behavior of Express without anything, because Express is not a framework like Nest is.

Express docs tell you that if you want graceful shutdown, you need to add another thing to Express (e.g. @godaddy/terminus, which @nestjs/terminus shares a name with but doesn't depend on).

https://expressjs.com/en/advanced/healthcheck-graceful-shutdown.html

The idea is to observe for SIGINT signals, and when it happens, to have the health check respond with failures, and delay Express'es server.close() until after all pending requests were handled.

@kamilmysliwiec
Copy link
Member

It's the default behavior of Express without anything, because Express is not a framework like Nest is.

This doesn't imply we should be changing the driver's default behavior, especially by default.

Express docs tell you that if you want graceful shutdown, you need to add another thing to Express (e.g. @godaddy/terminus, which @nestjs/terminus shares a name with but doesn't depend on).

So I'd say exposing this feature from the @nestjs/terminus package would make sense.

@tolgap
Copy link
Contributor

tolgap commented Jun 21, 2023

@kamilmysliwiec I have done some investigation on this.

It seems Fastify has an "option" for handling things like this. They introduce an option called return503OnClose: https://github.com/fastify/fastify/blob/main/lib/route.js#L437-L449

Then before Fastify performs the close logic, they return a 503 on any open request in their route handler.

I can see how this issue pops up for regular health checks. But this can also be useful for SSE! I think we should consider keeping this logic in @nestjs/platform-express, but gate it behind an option. Just like Fastify does. We can then introduce this option on the application level. But if you're of the opinion that we should handle this in @nestjs/terminus, we can keep it that way!

@kamilmysliwiec
Copy link
Member

I can see how this issue pops up for regular health checks. But this can also be useful for SSE! I think we should consider keeping this logic in @nestjs/platform-express, but gate it behind an option. Just like Fastify does. We can then introduce this option on the application level. But if you're of the opinion that we should handle this in @nestjs/terminus, we can keep it that way!

Either way works for me. Since it can be useful for SSE too (and we expose abstractions for SSE from the core) then having it in the platform-specific packages would make sense too!

@cmdlucas
Copy link

Hey @kamilmysliwiec
Do you have an ETA for this feature request?

@kamilmysliwiec
Copy link
Member

Have you tried using this package https://www.npmjs.com/package/nestjs-graceful-shutdown?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants