Worker lambda metrics #734

DavidLawes · 2022-08-26T14:17:41Z

What does this change?

This adds metrics for the worker lambdas (the harvester and the senders) to give us a longer-term view of the performance of these services.

The change has been deployed to CODE and the metrics have been added to the notifications dashboard:

Harvester:

^ dimension of type allows us to view processingTime metric separately for breakingNews notifications and other types.

Sender lambdas:

^ dimensions of platform and type allow us to view processingTime metrics for ios and android, separated further by breakingNews notifications or other types.

- also add type dimension for harvester metric

DavidLawes · 2022-08-26T14:27:15Z

cc @michaelwmcnamara @jacobwinch

This is a first attempt at adding embedded metrics to the worker lambdas. The format of the _aws log entry is pretty much copy and pasted from your PR

The current state of this PR doesn't attempt to abstract the common functionality yet because I was faced with a strange error:

we can see that the metrics being successfully parsed by aws (we see the metrics and their values in the aws console)
in the cloudwatch logs I can see the corresponding log message
however, in kibana i never see the log message that contains the _aws object (which is disconcerting as a dev because there is certain info not being shown and made me second-guess myself as to whether something is wrong)

Did you come across this situation when creating your original PR?

EDIT: I've tested again this morning with the same code and I now see the logs in kibana and the metrics in aws so maybe there was a temporal issue in the elk stack somewhere (or I was being impatient)

DavidLawes · 2022-08-26T14:30:10Z

notificationworkerlambda/src/main/scala/com/gu/notifications/worker/SenderRequestHandler.scala

+        "Timestamp" -> end.toEpochMilli,
+        "CloudWatchMetrics" -> List(Map(
+          "Namespace" -> s"Notifications/${env.stage}/workers",
+          "Dimensions" -> List(List("platform")),


@waisingyiu @frankie297
Dimensions allow us to view metrics in greater granularity. For the sender lambdas, I wonder whether it could be useful to see the metric by breakingNews/other? E.g. we could get metrics:
ios + breakingNews
ios + other
android + breakingNews
android + other

Or, maybe just knowing the processing time for android vs ios will be enough. What do you think?

I think it may be useful when we need to look at the performance of a particular breaking news notification? We may show metrics for breaking news only, and we could associate the metric with a particular breaking news notification based on the time. Please ignore me if cloudwatch metric should not be used for this purpose.

Thanks @waisingyiu :) If I recall correctly, providing the notificationId as a metric dimension would incur additional cost (ref: https://docs.google.com/document/d/10AnOZ4MLjuTO7mXaySoVO2SwmroLmns4gGmsfY7egmw/edit#heading=h.ajh00ba9hm68). I think if we needed to analyse the performance of a specific notificationId we'd need to rely on our logs+kibana dashboard (I could've misinterpreted this though!).

For the metrics, I was hoping to generate a minimum (static) subset of dimensions that would allow us to analyse performance/trends. Based on your response I think there would be value in providing an additional dimension for the sender lambda metrics:

platform: ios or android

type: breakingNews or other

I think if we needed to analyse the performance of a specific notificationId we'd need to rely on our logs+kibana dashboard (I could've misinterpreted this though!).

I think this is the correct approach 👍

I also agree that adding platform and type as dimensions would give us some extra benefits without increasing costs too much.

jacobwinch · 2022-08-30T10:11:21Z

Did you come across this situation when creating your original PR?

I don't remember facing this problem. We'd expect all Lambda logs to show up in ELK pretty quickly, certainly within a minute or two.

EDIT: I've tested again this morning with the same code and I now see the logs in kibana and the metrics in aws so maybe there was a temporal issue in the elk stack somewhere (or I was being impatient)

Let's keep an eye on things when this is merged and investigate further if it happens again!

DavidLawes · 2022-08-30T10:51:14Z

notificationworkerlambda/src/main/scala/com/gu/notifications/worker/Harvester.scala

-          "harvester.notificationProcessingEndTime.string" -> end.toString,
-        ), "Finished processing notification event")
-      )
+      records.foreach {


At the moment the harvester batchSize is 1. I think if we increased this then logging in the finally block won't provide us with accurate metrics. Not something that I think I need to address now, just a consideration for the future

DavidLawes · 2022-08-30T10:56:40Z

I think for now I'd like to:

push these changes into prod so we can start collecting metrics
consider possible refactoring of the embedded cloudwatch metric as a subsequent PR (just abstracting the CloudWatchMetrics Map to a common function didn't increase readability, for me at least, and I wonder if there should be higher levels of abstraction to ensure that dimensions + metricName have corresponding keys in the log object too)

waisingyiu

Thank you for your great work! I have just two queries:

Will the app notifications dashboard in ELK continue to work after this PR?
I notice that the startTime come from the attributes sentTimestamp of the event record rather than being set as the start of the function. Does AWS populate this attribute, or the service upstream? What time exactly is it?

Happy to approve. Thanks David.

DavidLawes · 2022-08-30T13:08:38Z

Thank you for your great work! I have just two queries:

Will the app notifications dashboard in ELK continue to work after this PR?

I notice that the startTime come from the attributes sentTimestamp of the event record rather than being set as the start of the function. Does AWS populate this attribute, or the service upstream? What time exactly is it?

Happy to approve. Thanks David.

Thanks!

The apps notifications dashboard in ELK will continue to work as expected after this PR.

About the sentTimestamp - this is set by aws when the message lands on the queue. The suggestion was to use this time to measure the total time taken to process a message (e.g. total time = time spent on queue before processing + time spent processing message by a lambda). I think this way we'll get a better feel for how the overall system is performing (ref: https://docs.google.com/document/d/10AnOZ4MLjuTO7mXaySoVO2SwmroLmns4gGmsfY7egmw/edit#heading=h.n166j1upsg46)

Hope that makes sense!

DavidLawes added 13 commits August 23, 2022 17:47

add metrics for worker lambdas

68d6114

include sqs message attributes in test

e668d18

Merge branch 'main' into dlawes/worker-lambda-metrics

77be338

test removing _aws structured log

e3f5729

- also add type dimension for harvester metric

get or default from getAttributes

a705c6b

log sent timestamp in sender lambda

ff75a41

add extra logging

808317b

get string value of platform

7b046fd

extra logging

7f44f50

remove dateTime string

7c2325b

do not use io.delay logging results

ba20729

Camel case for cloudwatch metrics

aae19e4

Remove duplicate log message

b7c023c

DavidLawes commented Aug 26, 2022

View reviewed changes

add type dimension to sender lambda metrics

4ab54d2

DavidLawes commented Aug 30, 2022

View reviewed changes

remove unnecessary log line

693aac2

DavidLawes marked this pull request as ready for review August 30, 2022 10:56

DavidLawes requested review from jacobwinch, waisingyiu and frankie297 August 30, 2022 10:56

waisingyiu approved these changes Aug 30, 2022

View reviewed changes

DavidLawes merged commit 6ca8a1b into main Aug 31, 2022

DavidLawes deleted the dlawes/worker-lambda-metrics branch August 31, 2022 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker lambda metrics #734

Worker lambda metrics #734

DavidLawes commented Aug 26, 2022 •

edited

Loading

DavidLawes commented Aug 26, 2022 •

edited

Loading

DavidLawes Aug 26, 2022

waisingyiu Aug 26, 2022

DavidLawes Aug 30, 2022

jacobwinch Aug 30, 2022 •

edited

Loading

jacobwinch commented Aug 30, 2022

DavidLawes Aug 30, 2022

DavidLawes commented Aug 30, 2022

waisingyiu left a comment

DavidLawes commented Aug 30, 2022 •

edited

Loading

Worker lambda metrics #734

Worker lambda metrics #734

Conversation

DavidLawes commented Aug 26, 2022 • edited Loading

What does this change?

DavidLawes commented Aug 26, 2022 • edited Loading

DavidLawes Aug 26, 2022

Choose a reason for hiding this comment

waisingyiu Aug 26, 2022

Choose a reason for hiding this comment

DavidLawes Aug 30, 2022

Choose a reason for hiding this comment

jacobwinch Aug 30, 2022 • edited Loading

Choose a reason for hiding this comment

jacobwinch commented Aug 30, 2022

DavidLawes Aug 30, 2022

Choose a reason for hiding this comment

DavidLawes commented Aug 30, 2022

waisingyiu left a comment

Choose a reason for hiding this comment

DavidLawes commented Aug 30, 2022 • edited Loading

DavidLawes commented Aug 26, 2022 •

edited

Loading

DavidLawes commented Aug 26, 2022 •

edited

Loading

jacobwinch Aug 30, 2022 •

edited

Loading

DavidLawes commented Aug 30, 2022 •

edited

Loading