Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): performance service #2878

Merged
merged 4 commits into from
Mar 9, 2023
Merged

feat(api): performance service #2878

merged 4 commits into from
Mar 9, 2023

Conversation

p-fernandez
Copy link
Contributor

What change does this PR introduce?

Implements a service that allows to gather performance data and Node.js internals data for certain areas of the API.
Also adds a performance test to understand the duration of different executions under different loads.

Why was this change needed?

Will be helpful to study performance of the sensible parts of the Trigger Event and what is the overhead of the distributed locking mechanism. Also can help to set the base for load performance test suites (in a way).

Other information (Screenshots)

Screenshot 2023-02-21 at 12 32 20

@SebastianStehle
Copy link

Why are you not just using OTLP?

@p-fernandez
Copy link
Contributor Author

p-fernandez commented Feb 23, 2023

Why are you not just using OTLP?

I guess you are referring to OpenTelemetry, if not I don't know what you might be talking about. Acronyms are hard.

To be honest I didn't think about it because this had a clear purpose of measuring execution times and was not meant to be able to export data. Mostly to add a quick tool for @tsssdev and his request about how to be able to measure Trigger Event execution performance.
@Cliftonz is working on general analytics so I don't want to step on his work. In case he decides to integrate OpenTelemetry, downstream this implementation perf_hooks could be easily replaced and integrate it.

@SebastianStehle
Copy link

Yes, I am about open telemetry.

The most important primitive of OTLP is the span. It represents a unit of work and you annotate your code to track spans. For example method calls, db calls, http requests and so on. Spans have a correlation Ids to represent the parent span and there is a protocol to add this correlation IDs to http requests for distributed tracing.

Most implementations for like this:

  1. You have a way to create spans.
  2. You define one or many outputs (exporters), e.g. NewRelic, OTLP (Open Telemetry protocol) and so on.
  3. You define one or more processors, e.g. to enrich the spans or to filter them.

So what I would do:

  1. Configure Open Telemetry.
  2. Annotate your code with custom spans.
  3. Write a custom exporter that provides the performance data (e.g. for tests) or so.

You can then get rid of a lot of code and the performance data can also be used for self hosting, e.g. in Azure Application Insights or Google Stackdriver and so on.

@Cliftonz
Copy link
Contributor

Cliftonz commented Feb 23, 2023

@p-fernandez I agree, I think this will be fine for now.

However in the long run I agree with @SebastianStehle that we should be doing open-telemetry, However doing this in the trigger engine and mail parser is much harder then the nest based apps as there are easy OS modules for it.

https://github.com/pragmaticivan/nestjs-otel

@SebastianStehle I will adding open-telemetry into our backlog.
This would also be accompanied by setting up Prometheus metrics too.

Comment on lines +91 to +80
public buildDigestFilterStepsMark(
transactionId: string,
templateId: string,
notificationId: string,
subscriberId: string
): IMark {
const mark = {
id: `${MarkFunctionNameEnum.DIGEST_FILTER_STEPS}:event:${transactionId}:template:${templateId}:notification:${notificationId}:subscriber:${subscriberId}:steps`,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to pass the arguments here if we only calculate the averages by the MarkFunctionNameEnum?
couldn't we just generate the random id if the concurrent requests is the problem here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are needed for having individual marks and to help to debug different executions if needed. And to have the right data as they are being accumulated I needed to make it unique, if not they would get overriden everytime it was executed. I saw your mention regarding generating a random UUID but that would make more difficult to debug any of the executions to be honest.
Right now the publishing of the individual marks is disabled as it was cluttering the log output and don't need it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't be the request scope random id the solution here?

@@ -69,6 +71,8 @@ export class EventsController {
@UserSession() user: IJwtPayload,
@Body() body: TriggerEventRequestDto
): Promise<TriggerEventResponseDto> {
const mark = this.performanceService.buildEndpointTriggerEventMark(body.transactionId as string);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should hold all this logic in the decorator? I can imagine it like this:

@MeasurePerformance('EventsController.trackEvent')
async trackEvent(

then I don't have to deal with the service

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@LetItRock This is something nest-otel can cover for us

}

@Injectable()
export class EventsPerformanceService {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we need to create an events-specific performance service, I feel like all this might be generic code, that can be reused in any context. This way we won't end up writing per context performance services. So I see it with the decorator as I mentioned before.

The publishResults function can be defined as the helper function that will be called from anywhere we need to get the stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to think of the different modules as if they were its own microservice, so making this domain related service allows to:

  • Keep isolated in the domain the specific mark generation.
  • Have their own kind of calculations and what publish and what not (think in a different domain we might want to have different calculations, instead of average other algorithm, publish different results, etc)

That doesn't mean that's going to happen, to be reused elsewhere, so I can understand why you are suggesting something more generic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know to me it would be more to maintain... but yeah maybe let's start with something simple and then adjust to the new requirements in the future...

return this.setStart(mark);
}

public buildEndpointTriggerEventMark(transactionId: string): IMark {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these all won't be needed if we will create generic function with mark like:

const mark = {
      id: `${functionName}:${uuid}`,
    };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressing this comment here: #2878 (comment)

@tsssdev
Copy link
Contributor

tsssdev commented Feb 25, 2023

Why are you not just using OTLP?

I guess you are referring to OpenTelemetry, if not I don't know what you might be talking about. Acronyms are hard.

To be honest I didn't think about it because this had a clear purpose of measuring execution times and was not meant to be able to export data. Mostly to add a quick tool for @tsssdev and his request about how to be able to measure Trigger Event execution performance. @Cliftonz is working on general analytics so I don't want to step on his work. In case he decides to integrate OpenTelemetry, downstream this implementation perf_hooks could be easily replaced and integrate it.

@p-fernandez How we generate load? do we have load testing infrastructure in place? Under decent to high load only we can see real issues with locking or with out locking.. if we have Dev/QA type of environment that can take decent load then we can implement quick load tests with K6(https://github.com/grafana/k6 ) or any other similar tool.

@Cliftonz
Copy link
Contributor

@tsssdev Our testing bed is dev.web.novu.co and we are looking to add k6 load tests in the api e2e test folder. If you want to create a pr for that we would be more then happy to look at it and merge it.

@LetItRock LetItRock self-requested a review March 6, 2023 17:23
Copy link
Contributor

@Cliftonz Cliftonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start, lets get it merged in and we can go from here.

p-fernandez and others added 4 commits March 9, 2023 16:58
Co-authored-by: Paweł Tymczuk <LetItRock@users.noreply.github.com>
Co-authored-by: Paweł Tymczuk <LetItRock@users.noreply.github.com>
@p-fernandez p-fernandez added this pull request to the merge queue Mar 9, 2023
Merged via the queue into next with commit f827a89 Mar 9, 2023
@p-fernandez p-fernandez deleted the feat-performance-service branch March 9, 2023 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants