Add Prometheus Metrics extension for exposing and instrumenting LocalStack metrics#92
Add Prometheus Metrics extension for exposing and instrumenting LocalStack metrics#92gregfurman merged 18 commits intomainfrom
Conversation
dfangl
left a comment
There was a problem hiding this comment.
First review round :) I think this is already great work, I played around with the queries and it seems useful. Would be nice to simulate some load on ESMs, and try to create an example dashboard to show if the data we are collecting also tells us something in the bigger picture.
| event_target = get_event_target_from_procesor(self.processor) | ||
|
|
||
| try: | ||
| current_time_epoch = time.perf_counter() |
There was a problem hiding this comment.
Perf counter does not return the current time in any way, only relative values make sense with perf counter. I would rename this to "start" or something to be clearer.
There was a problem hiding this comment.
Going to rather just use the with operator here:
with LOCALSTACK_POLL_EVENTS_DURATION_SECONDS.time():
fn(self)There was a problem hiding this comment.
I would not call the root module "prometheus" here - it might interfere with other installed packages in the future, and is not clear enough.
Let's just try localstack_prometheus_extension or something like this. Or even localstack_extension_prometheus_metrics if we want to match the pypi package name.
| # Record the start time | ||
| context.start_time = time.perf_counter() |
There was a problem hiding this comment.
Why record here, if we do not try to read it if there is no service_operation set? Shouldn't this be moved after the if?
| ).inc(total_events) | ||
|
|
||
| try: | ||
| result = fn(self, original_events) |
There was a problem hiding this comment.
Do we perhaps want to track the duration of the invocation as well, while we are on it? This would also be end to end of the lambda service (including startup delays etc), in contrast to the extraction of the logs.
There was a problem hiding this comment.
Lambda invocation metrics are already available via the localstack_request_processing_seconds since all internal calls are tracked. So we could view this with:
localstack_request_processing_seconds{operation="Invoke", service="lambda"}
If we're OK with some redundancy, then I'm happy to track this as well. Wdyt?
There was a problem hiding this comment.
localstack_request_processing_seconds is end to end (including polling) right? What metric would be only the invocation?
There was a problem hiding this comment.
I ended up going with the below:
ee86b63 to
528fd60
Compare
dfangl
left a comment
There was a problem hiding this comment.
I think this is a good first iteration, and we can revisit certain parts. Looks good!
| ).inc(total_events) | ||
|
|
||
| try: | ||
| result = fn(self, original_events) |
There was a problem hiding this comment.
localstack_request_processing_seconds is end to end (including polling) right? What metric would be only the invocation?
@dfangl Alternatively, we could look at |
Motivation
The previous Platform Observaility extension does not support LocalStack
+v4. In addition, LocalStack does not currently expose metrics in a standardised way, making monitoring and observability solutions difficult.To address these issues, we propose a new Prometheus Metrics extension which has the following advantages:
Metrics
Currently, all metrics reside in the
metricspackage which contains:/core- Core metrics used for general request handling information (latency, in-flight, count, etc)/event_polling- Metrics for tracking poller operation (poll durations, batching efficiency, etc.)/event_processing- Metrics for tracking event processing (propagation delays, error tracking, total event info. etc.)A list of all metrics and their descriptions can be found when hitting the
/_extension/metricsendpoint -- otherwise see eachmetrics/*.pyclass for in-line documentation.Instrumentation
RequestContextinto the request handler chain.event_processing-LambdaSender.send_eventsis patched to record propagation delays and processing errorsevent_polling- Thepoll_eventsmethod ofKinesisPoller,DynamoDBStreamsPoller, andSqsPollerare all patched to record poll miss events, e2e latencies, and processing errors.get_recordsandhandle_messagesare patched to record event information as soon as it comes in.Pollerinterface that is used to fetch all metrics in apoll_eventscall.Testing
tests/aws/services/lambda_/event_source_mappingandlocalstack-pro-core/tests/aws/services/pipeswere succesfully run against a LocalStack pro container with the extension enabled.ESM test output:
Performance