Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update telemetry for getDeltas from db #16348

Closed
wants to merge 5 commits into from

Conversation

shubhi1092
Copy link
Contributor

@shubhi1092 shubhi1092 commented Jul 12, 2023

Description

  • Adding metric to capture getDeltas method.
  • Adding "caller" property to the telemetry to identify internal vs external calls.

@shubhi1092 shubhi1092 requested a review from a team as a code owner July 12, 2023 22:37
@shubhi1092 shubhi1092 requested review from zhangxin511 and kekachmar and removed request for a team July 12, 2023 22:37
@github-actions github-actions bot added area: server Server related issues (routerlicious) base: main PRs targeted against main branch labels Jul 12, 2023
@shubhi1092 shubhi1092 marked this pull request as draft July 12, 2023 22:55
@@ -50,4 +50,6 @@ export enum LumberEventName {
ConnectionCountPerNode = "ConnectionCountPerNode",
RestoreFromCheckpoint = "RestoreFromCheckpoint",
ReprocessOps = "ReprocessOps",
GetDeltasFromDb = "GetDeltasFromDb",
GetDeltasFromStorage = "GetDeltasFromStorage",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes make sense for OSS, however, FRS is not using this piece of logic. Don't see the benefits the metrics added here.

FRS has a file called fallbackDeltaService.ts, which has almost all the metric/status you need. Maybe take a look over there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Xin, fallbackDeltaService in FRS extends from OSS's deltaService and calls super.getDeltas to get the deltas from db. So GetDeltasFromDb will be useful for both FRS as well as OSS. GetDeltasFromStorage also seemed similar but I was not sure if its indeed used or not, but there's no harm in adding that as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, make sense. I still don't think you need to add a metric here, the reason is it will double the metric/log numbers for each getdelta call. Instead, you can add a property flag in fallbackDeltaService, and it can indicate if it pass the super.getDeltas successfuly. If not, the fallbackDeltaService will still log the error, or just use the stage flag to indicate. Doing so will keep the log low and keep it simple to analysis (you have everything in one metric so you don't need to do the join correlate different metrics, what you think?

I don't think GetDeltasFromStorage is still being used anywhere, you can find the usage which is by one route, and there is no such route logs in our kusto for past 30 days. So I think this endpoint is an obsolete endpoint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think adding this metric here would help us in analysing the time taken for the db call during the whole get deltas API duration. We can still do that outside in the fallbackDeltaService, but I thought this would be cleaner since we also have a similar metric for GetDeltasFromFallback to capture that duration.

If amount of logs is a concern, I can alternately add the duration for both GetDeltasFromDb as well as GetDeltasFromFallback to the overall GetDeltaObjectStoreFallback metric and remove this metric from OSS then - what do you think?

I will remove the GetDeltasFromStorage metric since that is obsolete and not needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is up to you, I prefer to use single metrics instead of double the metrics due to

  1. The amount of logs, we currently already have too many logs crashing geneva buffer, which lead to lost logs. Also it leads to kusto ingestion delay. @hedasilv do you have any concerns? Given the GetDeltaObjectStoreFallback log is already number 5 on the list
  2. it is difficult to correlate the two metric from the same call. But if you can figure out a better way to correlate, I think it is fine.

Execute in [Web] [Desktop] [cluster('frspme.westus2.kusto.windows.net').database('FRS')]

FRSRouterlicious
| where eventTimestamp > ago(1d)
| summarize count() by eventName
| order by count_ desc
| take 10


Table0

eventName count_
LogMessage 1,417,995,212
RiddlerFetchTenantKey 926,002,068
RunWithRetry 138,144,104
ScriptoriumProcessBatch 127,897,144
GetDeltaObjectStoreFallback 84,785,599
ReprocessOps 67,206,731
DisconnectDocument 63,701,361
ConnectDocument 58,862,813
ConnectDocumentGetClients 58,857,795
ConnectDocumentAddClient 52,860,490

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @zhangxin511 , these are my thoughts and reasoning:

  1. I see the benefit of adding a new metric here to analyse the GetDeltasFromDb trend, its faster to check and plot the direct duration of a metric instead of extracting that from the properties of another metric.
  2. However I also see the benefit of adding this latency to the overall GetDeltaObjectStoreFallback's metric properties since its easier to check/correlate for a single call.

I am going to do # 2 anyways in FRS since we already have that metric. And since number of logs is a concern here, lets not do # 1 for now and will reconsider if needed after adding # 2.

@shubhi1092 shubhi1092 changed the title Add metric for getDeltas from db Update telemetry for getDeltas from db Jul 14, 2023
@@ -50,4 +50,6 @@ export enum LumberEventName {
ConnectionCountPerNode = "ConnectionCountPerNode",
RestoreFromCheckpoint = "RestoreFromCheckpoint",
ReprocessOps = "ReprocessOps",
GetDeltasFromDb = "GetDeltasFromDb",
GetDeltasFromStorage = "GetDeltasFromStorage",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is up to you, I prefer to use single metrics instead of double the metrics due to

  1. The amount of logs, we currently already have too many logs crashing geneva buffer, which lead to lost logs. Also it leads to kusto ingestion delay. @hedasilv do you have any concerns? Given the GetDeltaObjectStoreFallback log is already number 5 on the list
  2. it is difficult to correlate the two metric from the same call. But if you can figure out a better way to correlate, I think it is fine.

Execute in [Web] [Desktop] [cluster('frspme.westus2.kusto.windows.net').database('FRS')]

FRSRouterlicious
| where eventTimestamp > ago(1d)
| summarize count() by eventName
| order by count_ desc
| take 10


Table0

eventName count_
LogMessage 1,417,995,212
RiddlerFetchTenantKey 926,002,068
RunWithRetry 138,144,104
ScriptoriumProcessBatch 127,897,144
GetDeltaObjectStoreFallback 84,785,599
ReprocessOps 67,206,731
DisconnectDocument 63,701,361
ConnectDocument 58,862,813
ConnectDocumentGetClients 58,857,795
ConnectDocumentAddClient 52,860,490

@@ -24,23 +29,42 @@ export class DeltaService implements IDeltaService {
documentId: string,
from?: number,
to?: number,
caller?: string,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to pass this from fallbackservice

@shubhi1092 shubhi1092 closed this Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: server Server related issues (routerlicious) base: main PRs targeted against main branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants