Skip to content

Add Prometheus Metrics extension for exposing and instrumenting LocalStack metrics#92

Merged
gregfurman merged 18 commits intomainfrom
add/extension/prometheus
Mar 11, 2025
Merged

Add Prometheus Metrics extension for exposing and instrumenting LocalStack metrics#92
gregfurman merged 18 commits intomainfrom
add/extension/prometheus

Conversation

@gregfurman
Copy link
Copy Markdown
Contributor

@gregfurman gregfurman commented Mar 7, 2025

Motivation

The previous Platform Observaility extension does not support LocalStack +v4. In addition, LocalStack does not currently expose metrics in a standardised way, making monitoring and observability solutions difficult.

To address these issues, we propose a new Prometheus Metrics extension which has the following advantages:

  1. Prometheus metrics have been standardised via OpenMetrics and are compatible with other monitoring systems (i.e CloudWatch, OpenTelemetry, etc.).
  2. The Prometheus docker container comes with a pre-existing user-interface, making for easy visualisation and querying of metrics.
  3. The OpenMetrics support for Exemplars means that adding fine-grained event tracing instrumentation is possible.

Metrics

Currently, all metrics reside in the metrics package which contains:

  • /core - Core metrics used for general request handling information (latency, in-flight, count, etc)
  • /event_polling - Metrics for tracking poller operation (poll durations, batching efficiency, etc.)
  • /event_processing - Metrics for tracking event processing (propagation delays, error tracking, total event info. etc.)

A list of all metrics and their descriptions can be found when hitting the /_extension/metrics endpoint -- otherwise see each metrics/*.py class for in-line documentation.

Instrumentation

  • Core metrics are instrumented by injecting a custom RequestContext into the request handler chain.
  • event_processing - LambdaSender.send_events is patched to record propagation delays and processing errors
  • event_polling - The poll_events method of KinesisPoller, DynamoDBStreamsPoller, and SqsPoller are all patched to record poll miss events, e2e latencies, and processing errors.
    • In addition get_records and handle_messages are patched to record event information as soon as it comes in.
    • TODO: this is unideal and we should unify a single abstract method in the Poller interface that is used to fetch all metrics in a poll_events call.

Testing

  • Refer to the following docker-compose gist to quickly spin up LocalStack, with this metrics extension installed, and a Prometheus instance that periodically scrapes the extension's metrics endpoint.
  • All tests within tests/aws/services/lambda_/event_source_mapping and localstack-pro-core/tests/aws/services/pipes were succesfully run against a LocalStack pro container with the extension enabled.
  • There is, however, currently no test-suite to validate this extension.

ESM test output:

============================= slowest 10 durations =============================
63.30s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py::test_fifo_message_group_parallelism
28.94s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_kinesis.py::TestKinesisSource::test_disable_kinesis_event_source_mapping
27.73s call     tests/aws/services/lambda_/event_source_mapping/test_cfn_resource.py::test_adding_tags
26.23s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py::TestSQSEventSourceMapping::test_sqs_event_source_mapping_batching_window_size_override
25.88s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_kinesis.py::TestKinesisSource::test_kinesis_event_source_trim_horizon
20.43s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py::test_report_batch_item_failures
20.25s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py::test_redrive_policy_with_failing_lambda
19.87s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_kinesis.py::TestKinesisSource::test_kinesis_event_source_mapping_with_async_invocation
18.98s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_kinesis.py::TestKinesisSource::test_create_kinesis_event_source_mapping_multiple_lambdas_single_kinesis_event_stream
18.53s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_dynamodbstreams.py::TestDynamoDBEventSourceMapping::test_dynamodb_event_source_mapping
============ 93 passed, 3 skipped, 2 warnings in 989.20s (0:16:29) =============

Performance

  • Initially, the dimensionality + cardinality of the prometheus metrics was a lot larger which saw performance degredation on the internal prometheus server (not the localstack request processing).
  • Some follow-up changes were made that optimised this (i.e fewer buckets in histograms).
  • No noticable performance issues were observed in LocalStack as a result of these metrics being instrumented. However, we should consider multiprocess mode if this turns out to be a problem.

@gregfurman gregfurman self-assigned this Mar 7, 2025
Copy link
Copy Markdown
Member

@dfangl dfangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First review round :) I think this is already great work, I played around with the queries and it seems useful. Would be nice to simulate some load on ESMs, and try to create an example dashboard to show if the data we are collecting also tells us something in the bigger picture.

event_target = get_event_target_from_procesor(self.processor)

try:
current_time_epoch = time.perf_counter()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perf counter does not return the current time in any way, only relative values make sense with perf counter. I would rename this to "start" or something to be clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Going to rather just use the with operator here:

        with LOCALSTACK_POLL_EVENTS_DURATION_SECONDS.time():
            fn(self)

Comment thread prometheus/pyproject.toml
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not call the root module "prometheus" here - it might interfere with other installed packages in the future, and is not clear enough.
Let's just try localstack_prometheus_extension or something like this. Or even localstack_extension_prometheus_metrics if we want to match the pypi package name.

Comment thread prometheus/prometheus/instruments/poller.py Outdated
Comment thread prometheus/prometheus/instruments/util.py Outdated
Comment thread prometheus/prometheus/metrics/event_polling.py Outdated
Comment thread prometheus/prometheus/server.py Outdated
Comment thread prometheus/prometheus/server.py Outdated
Comment on lines +26 to +27
# Record the start time
context.start_time = time.perf_counter()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why record here, if we do not try to read it if there is no service_operation set? Shouldn't this be moved after the if?

Comment thread prometheus/prometheus/instruments/sender.py Outdated
).inc(total_events)

try:
result = fn(self, original_events)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we perhaps want to track the duration of the invocation as well, while we are on it? This would also be end to end of the lambda service (including startup delays etc), in contrast to the extraction of the logs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lambda invocation metrics are already available via the localstack_request_processing_seconds since all internal calls are tracked. So we could view this with:

localstack_request_processing_seconds{operation="Invoke", service="lambda"}

If we're OK with some redundancy, then I'm happy to track this as well. Wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localstack_request_processing_seconds is end to end (including polling) right? What metric would be only the invocation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up going with the below:

with LOCALSTACK_PROCESS_EVENT_DURATION_SECONDS.labels(
event_source=event_source, event_target=event_target
).time():
result = fn(self, original_events)

@gregfurman gregfurman force-pushed the add/extension/prometheus branch from ee86b63 to 528fd60 Compare March 11, 2025 10:18
@gregfurman gregfurman requested a review from dfangl March 11, 2025 13:55
Copy link
Copy Markdown
Member

@dfangl dfangl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good first iteration, and we can revisit certain parts. Looks good!

Comment thread prometheus/prometheus/instruments/poller.py Outdated
).inc(total_events)

try:
result = fn(self, original_events)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localstack_request_processing_seconds is end to end (including polling) right? What metric would be only the invocation?

Comment thread prometheus/pyproject.toml Outdated
@gregfurman
Copy link
Copy Markdown
Contributor Author

localstack_request_processing_seconds is end to end (including polling) right? What metric would be only the invocation?

@dfangl localstack_process_event_duration_seconds would track the total duration that the event is being processed by the target for. In the case of an ESM, this would be the invocation time of the target lambda.

Alternatively, we could look at localstack_request_processing_duration_seconds{service="lambda", operation="Invoke"} if we wanted to (generally) see the performance of our Lambda invocations being processed by the gateway.

@gregfurman gregfurman merged commit b0e5961 into main Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants