Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(infra): graceful shutdown for bull mq #3326

Merged
merged 6 commits into from
May 12, 2023

Conversation

p-fernandez
Copy link
Contributor

What change does this PR introduce?

Introduces graceful shutdown for the Bull MQ and all the services dependant on it.

Why was this change needed?

Reliability on system scaling up/down, avoiding to lose jobs while closing. These changes should help us to avoid information when shutting down an instance.

Other information (Screenshots)

@linear
Copy link

linear bot commented May 4, 2023

NV-2137 Graceful shutdown and bootstrap for workers

Why? (Context)

ECS issues a SIGTERM command prior to removing a particular task from the service. We need to make sure we gracefully close the worker from receiving events, and finish currently processing events prior to exiting.

https://docs.nestjs.com/fundamentals/lifecycle-events

https://docs.bullmq.io/guide/workers/graceful-shutdown

https://aws.amazon.com/blogs/containers/graceful-shutdowns-with-ecs/

What?

  • On SIGTERM with nestjs, call the worker.close method and wait until promise finished

Definition of Done

@@ -4,7 +4,7 @@ import { getRedisPrefix } from '@novu/shared';
import { InboundEmailParse } from '../usecases/inbound-email-parse/inbound-email-parse.usecase';
import { InboundEmailParseCommand } from '../usecases/inbound-email-parse/inbound-email-parse.command';
import { ConnectionOptions } from 'tls';
import { BullmqService } from '@novu/application-generic';
import { BullMqService } from '@novu/application-generic';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming based on pattern.

@@ -1,3 +1,2 @@
export const QUEUE_SERVICE = 'QueueService';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused. Leftover from refactor to application-generic.

@@ -118,7 +120,6 @@ const PROVIDERS = [
return dalService;
},
},
cacheService,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplication.

Comment on lines -58 to -64
public async gracefulShutdown() {
// Right now we only want this for testing purposes
if (process.env.NODE_ENV === 'test') {
await this.bullMqService.queue.drain();
await this.bullMqService.worker.close();
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used just for testing. Now this will be implemented in the BullMqService that this service is dependant.

@@ -1,2 +1 @@
export const QUEUE_SERVICE = 'QueueService';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused. Leftover from refactor to application-generic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to application generic for reusability.

const runningStatus =
await this.triggerQueueService.bullMqService.getRunningStatus();

if (!runningStatus.queueIsPaused) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only create a queue for the TriggerQueueService.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that our only indication that the service is up is if it is not paused, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to find in the Bull MQ API any other way to know if the queue was running besides this property and to check if the queue was instantiated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also check here is the worker is running by using the flag that you do return

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately we can't do it as only a queue is created in TriggerQueueService therefore the worker comes as undefined. 🙁

constructor() {
super('ws_socket_queue');
}

public readonly bullMqService: BullmqService;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is set by QueueService so this shouldn't be needed here.

@p-fernandez p-fernandez force-pushed the nv-2137-graceful-shutdown-and-bootstrap-for branch 2 times, most recently from 09fe7b5 to cdbd2f8 Compare May 4, 2023 08:30
Copy link
Contributor

@BiswaViraj BiswaViraj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@p-fernandez p-fernandez force-pushed the nv-2137-graceful-shutdown-and-bootstrap-for branch 2 times, most recently from 2de4455 to 169fe91 Compare May 4, 2023 13:24
@p-fernandez p-fernandez force-pushed the nv-2137-graceful-shutdown-and-bootstrap-for branch 2 times, most recently from 52844ed to b4bacf2 Compare May 5, 2023 15:48
@p-fernandez p-fernandez force-pushed the nv-2137-graceful-shutdown-and-bootstrap-for branch from b4bacf2 to 50b1316 Compare May 5, 2023 15:50
}
return undefined;
})
.catch((error) => Logger.error('Metric Job Exists function errored', LOG_CONTEXT, error));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching to avoid unexpected unhandled exceptions.


return resolve();
} catch (error) {
return reject(error);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching to avoid unexpected unhandled exceptions. This could be logged only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log it here, or do we log it somewhere on the outer scope?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect this rejection to be caught by the failed event (line 35) and to be logged through jobHasFailed function. But not 100% sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-fernandez @djabarovgeorge yes, this is what will happen after reject

Copy link
Contributor

@djabarovgeorge djabarovgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a really big chunk of amazing work :) Left a couple of comments, apologies for the amount of questions I am a bit out of context so I was not sure about couple of things.

The most concerning comment was related to the initialization of InMemoryProviderService in InMemoryProviderServiceHealthIndicator.

export class InMemoryProviderServiceHealthIndicator extends HealthIndicator {
private INDICATOR_KEY = 'inMemory';

constructor(private inMemoryProviderService: InMemoryProviderService) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit lacking in the current state of the project, how many instances we have of InMemoryProviderService at the moment, and how they are injected.
My main concern here is what health check we will check here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have 2 instances: one for the CacheService and the other for the DistributionLockService.
Internally, the service configures the in-memory connection for each of the services. This probably will need to change when implementing MemoryDB connection for the Worker.


throw new HealthCheckError(
'In-memory Health',
this.getStatus(this.INDICATOR_KEY, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if should we add here a third object 'data' with the reason. in this case, we know that the client could not be ready.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be a good future improvement. 👍🏻

Comment on lines +29 to +30
() => this.dalHealthIndicator.isHealthy(),
() => this.inMemoryHealthIndicator.isHealthy(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isHealthy is an async function, should we make this function async as well and await the result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


throw new HealthCheckError(
'Trigger Queue Health',
this.getStatus(this.INDICATOR_KEY, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here regarding the extra data.

const runningStatus =
await this.triggerQueueService.bullMqService.getRunningStatus();

if (!runningStatus.queueIsPaused) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that our only indication that the service is up is if it is not paused, right?


return resolve();
} catch (error) {
return reject(error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we log it here, or do we log it somewhere on the outer scope?


@Get()
@HealthCheck()
async healthCheck() {
async healthCheck(): Promise<HealthCheckResult> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add here WsQueueServiceHealthIndicator check?

workerIsRunning: boolean;
}> {
const queueIsPaused =
(this._queue && (await this._queue.isPaused())) || undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit hard to read in the browser but isn't it mean is queue falsy and isPaused falsy then undefined is returned? i wonder if it is by design or if should we return here 'false' on the case above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that BullMQ service is used both for creating queues and workers. We have some services that only create a queue and other that only create a worker. I wanted to do a generic functionality that showed the running status of the service that instantiates the BullMQ service. Therefore the undefined for the values to return to handle those cases. Open to a better suggestion though!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously not for this cycle but maybe we can think of splitting the logic into two classes that extend this service one producer and one consumer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we do always call createQueue when initializing any queue, and the workers are tied to the queue, and won't do anything without it... so my suggestion is to move the queue creation to the constructor of the BullMqService, but of course we can do this in the separate ticket in the cooldown...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ignore the reasons why it was done this way and not like you suggest. Maybe @davidsoderberg could give more context.
Though I can infer from the PR he did that the intention was to provide flexibility to create multiple queues independently as we will only have one worker that will consume them all. If we move all the queue creation to the BullMQ service it would make harder to be able to configure every queue independently as we have implemented right now.
Also to be able to pass a groupId that at the end is to take advantage of BullMQ grouping feature when consuming jobs, that was one of the performance improvement actions we did in previous release.

public async gracefulShutdown() {
// Right now we only want this for testing purposes
if (process.env.NODE_ENV === 'test') {
await this.bullMqService.queue.drain();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new change we will be missing the 'queue.drain()', is that ok on the testing env?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was the one implementing that as a hack to make one Workflow Queue test pass. Seems that hack is not needed anymore.

p-fernandez and others added 2 commits May 9, 2023 14:51
@p-fernandez p-fernandez force-pushed the nv-2137-graceful-shutdown-and-bootstrap-for branch from 168bc8b to 4484ece Compare May 9, 2023 13:51
Copy link
Contributor

@djabarovgeorge djabarovgeorge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💫

@@ -106,6 +107,7 @@ const distributedLockService = {
const PROVIDERS = [
cacheService,
distributedLockService,
inMemoryProviderService,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we are creating 2 instances of the InMemoryProviderService?
we can inject it like this:

{
    provide: CacheService,
    useFactory: (inMemoryProviderService: InMemoryProviderService) => {
      return new CacheService(inMemoryProviderService);
    },
    inject: [InMemoryProviderService],
  },

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are configuring the provider with auto pipelining for the CacheService and without for the DistributedLockService. That is one of the reasons. 🙁

@@ -141,6 +142,7 @@ const PROVIDERS = [
return analyticsService;
},
},
TriggerQueueService,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this done by mistake? we are using the TriggerHandlerQueueService in the events.module... we can change it to inject the TriggerQueueService instead if you wish ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall it was NestJS complaining as not set as provider due the health indicators. Let me try to remove TriggerHandlerQueueService as dependency and see it doesn't complain. But I have a feeling that will require it too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have just tried and we need this because the HealthModule where the health indicators are used depend on this SharedModule and TriggerQueueService is a dependency for TriggerQueueServiceHealthIndicator.
How would you suggest to do it instead? I can't think right now of other choice we have and also not sure what I understand from your suggestion would work.


return resolve();
} catch (error) {
return reject(error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-fernandez @djabarovgeorge yes, this is what will happen after reject

@@ -64,6 +64,9 @@ export async function bootstrap(): Promise<INestApplication> {
app.use(bodyParser.json());
app.use(bodyParser.urlencoded({ extended: true }));

// Starts listening for shutdown hooks
app.enableShutdownHooks();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you told me some time ago to remove this 😛 hahah

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember it 🙈 . This is needed for the graceful shutdown hooks to operate after the SIGTERM so I would need the context on why I could have said that. Maybe I mentioned what NestJS says about not optimal performance when enabled it?

workerIsRunning: boolean;
}> {
const queueIsPaused =
(this._queue && (await this._queue.isPaused())) || undefined;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we do always call createQueue when initializing any queue, and the workers are tied to the queue, and won't do anything without it... so my suggestion is to move the queue creation to the constructor of the BullMqService, but of course we can do this in the separate ticket in the cooldown...

const runningStatus =
await this.triggerQueueService.bullMqService.getRunningStatus();

if (!runningStatus.queueIsPaused) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also check here is the worker is running by using the flag that you do return

@p-fernandez p-fernandez added this pull request to the merge queue May 12, 2023
Merged via the queue into next with commit 87fc26e May 12, 2023
32 checks passed
@p-fernandez p-fernandez deleted the nv-2137-graceful-shutdown-and-bootstrap-for branch May 12, 2023 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants