feat(infra): graceful shutdown for bull mq #3326

p-fernandez · 2023-05-04T08:07:17Z

What change does this PR introduce?

Introduces graceful shutdown for the Bull MQ and all the services dependant on it.

Why was this change needed?

Reliability on system scaling up/down, avoiding to lose jobs while closing. These changes should help us to avoid information when shutting down an instance.

Other information (Screenshots)

linear · 2023-05-04T08:07:20Z

NV-2137 Graceful shutdown and bootstrap for workers

Why? (Context)

ECS issues a SIGTERM command prior to removing a particular task from the service. We need to make sure we gracefully close the worker from receiving events, and finish currently processing events prior to exiting.

https://docs.nestjs.com/fundamentals/lifecycle-events

https://docs.bullmq.io/guide/workers/graceful-shutdown

https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/

What?

On SIGTERM with nestjs, call the worker.close method and wait until promise finished

Definition of Done

apps/api/src/app/health/health.controller.ts

p-fernandez · 2023-05-04T08:19:24Z

apps/api/src/app/inbound-parse/services/inbound-parse.queue.service.ts

@@ -4,7 +4,7 @@ import { getRedisPrefix } from '@novu/shared';
 import { InboundEmailParse } from '../usecases/inbound-email-parse/inbound-email-parse.usecase';
 import { InboundEmailParseCommand } from '../usecases/inbound-email-parse/inbound-email-parse.command';
 import { ConnectionOptions } from 'tls';
-import { BullmqService } from '@novu/application-generic';
+import { BullMqService } from '@novu/application-generic';


Renaming based on pattern.

p-fernandez · 2023-05-04T08:19:45Z

apps/api/src/app/shared/constants.ts

@@ -1,3 +1,2 @@
-export const QUEUE_SERVICE = 'QueueService';


Unused. Leftover from refactor to application-generic.

p-fernandez · 2023-05-04T08:19:56Z

apps/api/src/app/shared/shared.module.ts

@@ -118,7 +120,6 @@ const PROVIDERS = [
      return dalService;
    },
  },
-  cacheService,


Duplication.

p-fernandez · 2023-05-04T08:20:37Z

apps/worker/src/app/workflow/services/workflow-queue.service.ts

-  public async gracefulShutdown() {
-    // Right now we only want this for testing purposes
-    if (process.env.NODE_ENV === 'test') {
-      await this.bullMqService.queue.drain();
-      await this.bullMqService.worker.close();
-    }
-  }


Used just for testing. Now this will be implemented in the BullMqService that this service is dependant.

p-fernandez · 2023-05-04T08:20:46Z

apps/ws/src/shared/constants.ts

@@ -1,2 +1 @@
-export const QUEUE_SERVICE = 'QueueService';


Unused. Leftover from refactor to application-generic.

p-fernandez · 2023-05-04T08:21:14Z

packages/application-generic/src/health/cache.health-indicator.ts

Moved to application generic for reusability.

packages/application-generic/src/health/queue.health-indicator.ts

p-fernandez · 2023-05-04T08:22:02Z

packages/application-generic/src/health/trigger-queue.health-indicator.ts

+    const runningStatus =
+      await this.triggerQueueService.bullMqService.getRunningStatus();
+
+    if (!runningStatus.queueIsPaused) {


We only create a queue for the TriggerQueueService.

I guess that our only indication that the service is up is if it is not paused, right?

I wasn't able to find in the Bull MQ API any other way to know if the queue was running besides this property and to check if the queue was instantiated.

we should also check here is the worker is running by using the flag that you do return

Unfortunately we can't do it as only a queue is created in TriggerQueueService therefore the worker comes as undefined. 🙁

packages/application-generic/src/health/ws-queue.health-indicator.ts

packages/application-generic/src/services/bull-mq.service.ts

p-fernandez · 2023-05-04T08:25:21Z

packages/application-generic/src/services/ws-queue.service.ts

  constructor() {
    super('ws_socket_queue');
  }

-  public readonly bullMqService: BullmqService;


This is set by QueueService so this shouldn't be needed here.

BiswaViraj

🎉

p-fernandez · 2023-05-05T15:52:45Z

apps/worker/src/app/workflow/services/metric-queue.service.ts

-    }
+        return undefined;
+      })
+      .catch((error) => Logger.error('Metric Job Exists function errored', LOG_CONTEXT, error));


Catching to avoid unexpected unhandled exceptions.

p-fernandez · 2023-05-05T15:53:12Z

apps/worker/src/app/workflow/services/metric-queue.service.ts

+
+          return resolve();
+        } catch (error) {
+          return reject(error);


Catching to avoid unexpected unhandled exceptions. This could be logged only.

Should we log it here, or do we log it somewhere on the outer scope?

I expect this rejection to be caught by the failed event (line 35) and to be logged through jobHasFailed function. But not 100% sure.

@p-fernandez @djabarovgeorge yes, this is what will happen after reject

djabarovgeorge

Looks like a really big chunk of amazing work :) Left a couple of comments, apologies for the amount of questions I am a bit out of context so I was not sure about couple of things.

The most concerning comment was related to the initialization of InMemoryProviderService in InMemoryProviderServiceHealthIndicator.

djabarovgeorge · 2023-05-08T07:23:14Z

packages/application-generic/src/health/in-memory.health-indicator.ts

+export class InMemoryProviderServiceHealthIndicator extends HealthIndicator {
+  private INDICATOR_KEY = 'inMemory';
+
+  constructor(private inMemoryProviderService: InMemoryProviderService) {


I am a bit lacking in the current state of the project, how many instances we have of InMemoryProviderService at the moment, and how they are injected.
My main concern here is what health check we will check here.

We have 2 instances: one for the CacheService and the other for the DistributionLockService.
Internally, the service configures the in-memory connection for each of the services. This probably will need to change when implementing MemoryDB connection for the Worker.

djabarovgeorge · 2023-05-08T07:25:25Z

packages/application-generic/src/health/in-memory.health-indicator.ts

+
+    throw new HealthCheckError(
+      'In-memory Health',
+      this.getStatus(this.INDICATOR_KEY, false)


I wonder if should we add here a third object 'data' with the reason. in this case, we know that the client could not be ready.

Would be a good future improvement. 👍🏻

djabarovgeorge · 2023-05-08T07:28:52Z

apps/api/src/app/health/health.controller.ts

+      () => this.dalHealthIndicator.isHealthy(),
+      () => this.inMemoryHealthIndicator.isHealthy(),


isHealthy is an async function, should we make this function async as well and await the result?

NestJS Terminus package takes care of that behind the scenes: https://github.com/nestjs/terminus/blob/410e07bf5e96d38285bc244225563d633ee1b2b5/lib/health-check/health-check-executor.service.ts#L64

djabarovgeorge · 2023-05-08T07:31:18Z

packages/application-generic/src/health/trigger-queue.health-indicator.ts

+
+    throw new HealthCheckError(
+      'Trigger Queue Health',
+      this.getStatus(this.INDICATOR_KEY, false)


Same here regarding the extra data.

djabarovgeorge · 2023-05-08T07:42:46Z

packages/application-generic/src/health/trigger-queue.health-indicator.ts

+    const runningStatus =
+      await this.triggerQueueService.bullMqService.getRunningStatus();
+
+    if (!runningStatus.queueIsPaused) {


I guess that our only indication that the service is up is if it is not paused, right?

djabarovgeorge · 2023-05-08T07:54:54Z

apps/worker/src/app/workflow/services/metric-queue.service.ts

+
+          return resolve();
+        } catch (error) {
+          return reject(error);


Should we log it here, or do we log it somewhere on the outer scope?

djabarovgeorge · 2023-05-08T08:08:19Z

apps/ws/src/health/health.controller.ts


  @Get()
  @HealthCheck()
-  async healthCheck() {
+  async healthCheck(): Promise<HealthCheckResult> {


Should we add here WsQueueServiceHealthIndicator check?

packages/application-generic/src/services/bull-mq.service.ts

djabarovgeorge · 2023-05-08T08:16:44Z

packages/application-generic/src/services/bull-mq.service.ts

+    workerIsRunning: boolean;
+  }> {
+    const queueIsPaused =
+      (this._queue && (await this._queue.isPaused())) || undefined;


It is a bit hard to read in the browser but isn't it mean is queue falsy and isPaused falsy then undefined is returned? i wonder if it is by design or if should we return here 'false' on the case above.

The problem is that BullMQ service is used both for creating queues and workers. We have some services that only create a queue and other that only create a worker. I wanted to do a generic functionality that showed the running status of the service that instantiates the BullMQ service. Therefore the undefined for the values to return to handle those cases. Open to a better suggestion though!

Obviously not for this cycle but maybe we can think of splitting the logic into two classes that extend this service one producer and one consumer.

IMO we do always call createQueue when initializing any queue, and the workers are tied to the queue, and won't do anything without it... so my suggestion is to move the queue creation to the constructor of the BullMqService, but of course we can do this in the separate ticket in the cooldown...

I ignore the reasons why it was done this way and not like you suggest. Maybe @davidsoderberg could give more context.
Though I can infer from the PR he did that the intention was to provide flexibility to create multiple queues independently as we will only have one worker that will consume them all. If we move all the queue creation to the BullMQ service it would make harder to be able to configure every queue independently as we have implemented right now.
Also to be able to pass a groupId that at the end is to take advantage of BullMQ grouping feature when consuming jobs, that was one of the performance improvement actions we did in previous release.

djabarovgeorge · 2023-05-08T08:25:16Z

packages/application-generic/src/services/queue.service.ts

-  public async gracefulShutdown() {
-    // Right now we only want this for testing purposes
-    if (process.env.NODE_ENV === 'test') {
-      await this.bullMqService.queue.drain();


With the new change we will be missing the 'queue.drain()', is that ok on the testing env?

I was the one implementing that as a hack to make one Workflow Queue test pass. Seems that hack is not needed anymore.

Co-authored-by: George Djabarov <39195835+djabarovgeorge@users.noreply.github.com>

djabarovgeorge

💫

LetItRock · 2023-05-10T20:05:14Z

apps/api/src/app/shared/shared.module.ts

@@ -106,6 +107,7 @@ const distributedLockService = {
 const PROVIDERS = [
  cacheService,
  distributedLockService,
+  inMemoryProviderService,


Is there any reason why we are creating 2 instances of the InMemoryProviderService?
we can inject it like this:

{ provide: CacheService, useFactory: (inMemoryProviderService: InMemoryProviderService) => { return new CacheService(inMemoryProviderService); }, inject: [InMemoryProviderService], },

We are configuring the provider with auto pipelining for the CacheService and without for the DistributedLockService. That is one of the reasons. 🙁

LetItRock · 2023-05-10T20:06:28Z

apps/api/src/app/shared/shared.module.ts

@@ -141,6 +142,7 @@ const PROVIDERS = [
      return analyticsService;
    },
  },
+  TriggerQueueService,


isn't this done by mistake? we are using the TriggerHandlerQueueService in the events.module... we can change it to inject the TriggerQueueService instead if you wish ;)

I recall it was NestJS complaining as not set as provider due the health indicators. Let me try to remove TriggerHandlerQueueService as dependency and see it doesn't complain. But I have a feeling that will require it too.

I have just tried and we need this because the HealthModule where the health indicators are used depend on this SharedModule and TriggerQueueService is a dependency for TriggerQueueServiceHealthIndicator.
How would you suggest to do it instead? I can't think right now of other choice we have and also not sure what I understand from your suggestion would work.

LetItRock · 2023-05-10T20:15:59Z

apps/worker/src/app/workflow/services/metric-queue.service.ts

+
+          return resolve();
+        } catch (error) {
+          return reject(error);


@p-fernandez @djabarovgeorge yes, this is what will happen after reject

LetItRock · 2023-05-10T20:16:30Z

apps/worker/src/bootstrap.ts

@@ -64,6 +64,9 @@ export async function bootstrap(): Promise<INestApplication> {
  app.use(bodyParser.json());
  app.use(bodyParser.urlencoded({ extended: true }));

+  // Starts listening for shutdown hooks
+  app.enableShutdownHooks();


you told me some time ago to remove this 😛 hahah

I don't remember it 🙈 . This is needed for the graceful shutdown hooks to operate after the SIGTERM so I would need the context on why I could have said that. Maybe I mentioned what NestJS says about not optimal performance when enabled it?

LetItRock · 2023-05-10T20:33:49Z

packages/application-generic/src/services/bull-mq.service.ts

+    workerIsRunning: boolean;
+  }> {
+    const queueIsPaused =
+      (this._queue && (await this._queue.isPaused())) || undefined;


IMO we do always call createQueue when initializing any queue, and the workers are tied to the queue, and won't do anything without it... so my suggestion is to move the queue creation to the constructor of the BullMqService, but of course we can do this in the separate ticket in the cooldown...

LetItRock · 2023-05-10T20:38:15Z

packages/application-generic/src/health/trigger-queue.health-indicator.ts

+    const runningStatus =
+      await this.triggerQueueService.bullMqService.getRunningStatus();
+
+    if (!runningStatus.queueIsPaused) {


we should also check here is the worker is running by using the flag that you do return

p-fernandez requested review from davidsoderberg, LetItRock, ainouzgali, scopsy, BiswaViraj, djabarovgeorge and Cliftonz May 4, 2023 08:07

github-actions bot added @novu/api @novu/worker @novu/ws labels May 4, 2023

p-fernandez force-pushed the nv-2137-graceful-shutdown-and-bootstrap-for branch from 2830897 to 13f78aa Compare May 4, 2023 08:14