Add Prometheus Metrics extension for exposing and instrumenting LocalStack metrics by gregfurman · Pull Request #92 · localstack/localstack-extensions

gregfurman · 2025-03-07T16:30:30Z

Motivation

The previous Platform Observaility extension does not support LocalStack +v4. In addition, LocalStack does not currently expose metrics in a standardised way, making monitoring and observability solutions difficult.

To address these issues, we propose a new Prometheus Metrics extension which has the following advantages:

Prometheus metrics have been standardised via OpenMetrics and are compatible with other monitoring systems (i.e CloudWatch, OpenTelemetry, etc.).
The Prometheus docker container comes with a pre-existing user-interface, making for easy visualisation and querying of metrics.
The OpenMetrics support for Exemplars means that adding fine-grained event tracing instrumentation is possible.

Metrics

Currently, all metrics reside in the metrics package which contains:

/core - Core metrics used for general request handling information (latency, in-flight, count, etc)
/event_polling - Metrics for tracking poller operation (poll durations, batching efficiency, etc.)
/event_processing - Metrics for tracking event processing (propagation delays, error tracking, total event info. etc.)

A list of all metrics and their descriptions can be found when hitting the /_extension/metrics endpoint -- otherwise see each metrics/*.py class for in-line documentation.

Instrumentation

Core metrics are instrumented by injecting a custom RequestContext into the request handler chain.
event_processing - LambdaSender.send_events is patched to record propagation delays and processing errors
event_polling - The poll_events method of KinesisPoller, DynamoDBStreamsPoller, and SqsPoller are all patched to record poll miss events, e2e latencies, and processing errors.
- In addition get_records and handle_messages are patched to record event information as soon as it comes in.
- TODO: this is unideal and we should unify a single abstract method in the Poller interface that is used to fetch all metrics in a poll_events call.

Testing

Refer to the following docker-compose gist to quickly spin up LocalStack, with this metrics extension installed, and a Prometheus instance that periodically scrapes the extension's metrics endpoint.
All tests within tests/aws/services/lambda_/event_source_mapping and localstack-pro-core/tests/aws/services/pipes were succesfully run against a LocalStack pro container with the extension enabled.
There is, however, currently no test-suite to validate this extension.

ESM test output:

============================= slowest 10 durations =============================
63.30s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py::test_fifo_message_group_parallelism
28.94s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_kinesis.py::TestKinesisSource::test_disable_kinesis_event_source_mapping
27.73s call     tests/aws/services/lambda_/event_source_mapping/test_cfn_resource.py::test_adding_tags
26.23s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py::TestSQSEventSourceMapping::test_sqs_event_source_mapping_batching_window_size_override
25.88s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_kinesis.py::TestKinesisSource::test_kinesis_event_source_trim_horizon
20.43s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py::test_report_batch_item_failures
20.25s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_sqs.py::test_redrive_policy_with_failing_lambda
19.87s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_kinesis.py::TestKinesisSource::test_kinesis_event_source_mapping_with_async_invocation
18.98s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_kinesis.py::TestKinesisSource::test_create_kinesis_event_source_mapping_multiple_lambdas_single_kinesis_event_stream
18.53s call     tests/aws/services/lambda_/event_source_mapping/test_lambda_integration_dynamodbstreams.py::TestDynamoDBEventSourceMapping::test_dynamodb_event_source_mapping
============ 93 passed, 3 skipped, 2 warnings in 989.20s (0:16:29) =============

Performance

Initially, the dimensionality + cardinality of the prometheus metrics was a lot larger which saw performance degredation on the internal prometheus server (not the localstack request processing).
Some follow-up changes were made that optimised this (i.e fewer buckets in histograms).
No noticable performance issues were observed in LocalStack as a result of these metrics being instrumented. However, we should consider multiprocess mode if this turns out to be a problem.

dfangl

First review round :) I think this is already great work, I played around with the queries and it seems useful. Would be nice to simulate some load on ESMs, and try to create an example dashboard to show if the data we are collecting also tells us something in the bigger picture.

dfangl · 2025-03-10T12:51:43Z

+    event_target = get_event_target_from_procesor(self.processor)
+
+    try:
+        current_time_epoch = time.perf_counter()


Perf counter does not return the current time in any way, only relative values make sense with perf counter. I would rename this to "start" or something to be clearer.

Going to rather just use the with operator here:

with LOCALSTACK_POLL_EVENTS_DURATION_SECONDS.time(): fn(self)

dfangl · 2025-03-10T12:53:34Z

I would not call the root module "prometheus" here - it might interfere with other installed packages in the future, and is not clear enough.
Let's just try localstack_prometheus_extension or something like this. Or even localstack_extension_prometheus_metrics if we want to match the pypi package name.

dfangl · 2025-03-10T14:01:12Z

+        # Record the start time
+        context.start_time = time.perf_counter()


Why record here, if we do not try to read it if there is no service_operation set? Shouldn't this be moved after the if?

dfangl · 2025-03-10T14:26:41Z

+    ).inc(total_events)
+
+    try:
+        result = fn(self, original_events)


Do we perhaps want to track the duration of the invocation as well, while we are on it? This would also be end to end of the lambda service (including startup delays etc), in contrast to the extraction of the logs.

Lambda invocation metrics are already available via the localstack_request_processing_seconds since all internal calls are tracked. So we could view this with:

localstack_request_processing_seconds{operation="Invoke", service="lambda"}

If we're OK with some redundancy, then I'm happy to track this as well. Wdyt?

localstack_request_processing_seconds is end to end (including polling) right? What metric would be only the invocation?

I ended up going with the below:

localstack-extensions/prometheus/localstack_prometheus/instruments/sender.py

Lines 80 to 83 in 528fd60

with LOCALSTACK_PROCESS_EVENT_DURATION_SECONDS.labels(

event_source=event_source, event_target=event_target

).time():

result = fn(self, original_events)

…ents

dfangl

I think this is a good first iteration, and we can revisit certain parts. Looks good!

dfangl · 2025-03-11T07:48:40Z

+    ).inc(total_events)
+
+    try:
+        result = fn(self, original_events)


localstack_request_processing_seconds is end to end (including polling) right? What metric would be only the invocation?

gregfurman · 2025-03-11T14:08:07Z

localstack_request_processing_seconds is end to end (including polling) right? What metric would be only the invocation?

@dfangl localstack_process_event_duration_seconds would track the total duration that the event is being processed by the target for. In the case of an ESM, this would be the invocation time of the target lambda.

Alternatively, we could look at localstack_request_processing_duration_seconds{service="lambda", operation="Invoke"} if we wanted to (generally) see the performance of our Lambda invocations being processed by the gateway.

gregfurman added 11 commits March 6, 2025 18:15

WIP: add prometheus extension

21f6061

format

db4eab3

reduce cardinality and remove some metrics

4d4a802

final touches

da553eb

fix

a1a1f3a

fix stream

24cbc99

add README

4356c58

update localstack

8fcd911

Fix extension egg naming for pip

484a534

Document container name in endpoint

1923be1

Some TODOs

deb7766

gregfurman requested review from dfangl, dominikschubert and joe4dev March 7, 2025 16:30

gregfurman self-assigned this Mar 7, 2025

gregfurman added 2 commits March 7, 2025 18:31

remove un-needed name in init

be0b6f7

Add sender module

d294d03

dfangl requested changes Mar 10, 2025

View reviewed changes

gregfurman added 4 commits March 10, 2025 20:48

Simplify server approach; Add inflight events gauge; and address comm…

7851b99

…ents

Add labels to histogram

582ca1f

Rename inflight events gauge

43c6da6

Rename package to localstack_prometheus

528fd60

gregfurman force-pushed the add/extension/prometheus branch from ee86b63 to 528fd60 Compare March 11, 2025 10:18

gregfurman requested a review from dfangl March 11, 2025 13:55

dfangl approved these changes Mar 11, 2025

View reviewed changes

Update pytoml

e24581d

gregfurman merged commit b0e5961 into main Mar 11, 2025

		# Record the start time
		context.start_time = time.perf_counter()

	with LOCALSTACK_PROCESS_EVENT_DURATION_SECONDS.labels(
	event_source=event_source, event_target=event_target
	).time():
	result = fn(self, original_events)

Conversation

gregfurman commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Metrics

Instrumentation

Testing

Performance

Uh oh!

dfangl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dfangl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gregfurman commented Mar 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gregfurman commented Mar 7, 2025 •

edited

Loading