Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -133,3 +133,4 @@ dmypy.json
.vscode

node_modules/
.DS_Store
15 changes: 8 additions & 7 deletions prometheus/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,15 +115,16 @@ services:
- "./prometheus_config.yml:/etc/prometheus/prometheus.yml" # Assumes prometheus_config.yml exists in your CWD
```

## Available Metrics
## Metrics

The Prometheus extension exposes various LocalStack metrics through the `/_extension/metrics` endpoint, including:
- Request counts by service
- Request latencies
- Resource utilization
- Error rates
The Prometheus extension exposes various LocalStack and system metrics through the `/_extension/metrics` endpoint.

For a complete list of available metrics, visit the endpoint directly at `localhost.localstack.cloud:4566/_extension/metrics` when LocalStack is running.
For a complete list of available metrics, view the:
- [LocalStack Metrics documentation](./docs/localstack_metrics.md)
- [System Metrics documentation](./docs/system_metrics.md)
- Otherwise, visit the endpoint directly at `localhost.localstack.cloud:4566/_extension/metrics` when LocalStack is running.

We've also included a [collection of PromQL queries](./docs/event_analysis.md) that are useful for analyzing LocalStack event source mappings performance.

## Licensing

Expand Down
104 changes: 104 additions & 0 deletions prometheus/docs/event_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# PromQL Queries for Event Processing Statistics

The following queries can be used to analyse performance of LocalStack's event processing capabilties.

## Average Propagation Delay from Event Source to Poller

The average amount of time a record has to wait before being processed during the last 5 minutes. A high propagation delay indicates that our event pollers are taking too long to ingest new events from an event source.

```
rate(localstack_event_propagation_delay_seconds_sum[5m]) / rate(localstack_event_propagation_delay_seconds_count[5m])
```

**Example**:
![Average Propagation Delay](images/avg_propagation_delay.png)

## Batch Efficiency

A ratio showing how efficiently are our pollers retrieving records from an event source relative to how large their maximum batch size is. A higher number indicates that batch sizes could be increased.

```
rate(localstack_batch_size_efficiency_ratio_sum[1m]) / rate(localstack_batch_size_efficiency_ratio_count[1m])
```

Example:
![Batch Efficiency Ratio](images/batch_efficiency_ratio.png)

## Records Per Poll

The average number of records being pulled in by an event poller per minute. When used in conjunction with batch efficiency, you can interpret the performance of your batching configuration.

```
rate(localstack_records_per_poll_sum[1m]) / rate(localstack_records_per_poll_count[1m])
```

Example:

![Records Per Poll](images/records_per_poll.png)

## In-Flight Events

Gauges how many events are currently being processed by a target at a given point in time. If event processing is taking long, this is a good way of measuring back-pressure on the system.

```
localstack_in_flight_events
```

Example:
![In-Flight Events](images/in_flight_events.png)

## Event Processing Duration

The average duration per minute that targets are processing events for.

```
rate(localstack_process_event_duration_seconds_sum[1m]) / rate(localstack_process_event_duration_seconds_count[1m])
```

Example:

![Event Processing Duration](images/event_processing_duration.png)

## High Latency Event Processing

Retrieve the 95th percentile of processing times in a 5m interval grouped by LocalStack service and operation. Useful for analysing the tail-latency of event processing since this is likely where bottlenecks in performance start to show.

```
histogram_quantile(0.95, sum by(service, operation, le) (rate(localstack_request_processing_duration_seconds_bucket[5m])))
```

Example:
![High Latency Event Processing](images/high_latency_event_processing.png)

## Empty Poll Responses

The approximate number of empty poll requests in a 5 minute interval.

```
rate(localstack_poll_miss_total[5m]) * 60
```

Example:
![Empty Poll Responses](images/empty_poll_responses.png)

## Number of LocalStack requests Processed

The average number of request processed by the LocalStack gateway per minute. This is grouped by service type (i.e SQS) and operation type (i.e ReceiveMessage)

```
sum by(service, operation) (rate(localstack_request_processing_duration_seconds_count[1m]) * 60)
```

Example:
![Requests Processed](images/requests_processed.png)

## In-Flight Requests Against LocalStack Gateway

Measures how many requests the Kinesis, SQS, DynamoDB, and Lambda services are currently processing in a given minute interval. Useful for seeing how hard a given service is currently being hit and the operation type.

```
sum_over_time(localstack_in_flight_requests{service=~"dynamodb|kinesis|sqs|lambda"}[1m])
```

Example:
![In-Flight Requests](images/in_flight_requests.png)
Binary file added prometheus/docs/images/avg_propagation_delay.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added prometheus/docs/images/batch_efficiency_ratio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added prometheus/docs/images/empty_poll_responses.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added prometheus/docs/images/in_flight_events.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added prometheus/docs/images/in_flight_requests.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added prometheus/docs/images/records_per_poll.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added prometheus/docs/images/requests_processed.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
75 changes: 75 additions & 0 deletions prometheus/docs/localstack_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# LocalStack Metrics

## LocalStack Core/Request Handling Metrics

`localstack_request_processing_duration_seconds`

- **Description:** Time spent processing LocalStack service requests. This is done at the handler chain and is calculated as the duration from first *request handler* to the final *response handler*.
- **Labels:** `service`, `operation`, `status`, `status_code`
- **Type:** histogram

`localstack_in_flight_requests`

- **Description:** Total number of currently in-flight requests. This is a live number, and will be influenced by the scraping interval.
- **Labels:** `service`, `operation`
- **Type:** gauge

## LocalStack Event Poll Operation Metrics

`localstack_records_per_poll`

- **Description:** Number of records/events received in each poll operation
- **Labels:** `event_source`, `event_target`
- **Type:** histogram

`localstack_poll_events_duration_seconds`

- **Description:** Duration of each poll call in seconds
- **Labels:** `event_source`, `event_target`
- **Type:** histogram

`localstack_poll_miss_total`

- **Description:** Count of poll events with empty responses
- **Labels:** `event_source`, `event_target`
- **Type:** counter

`localstack_batch_size_efficiency_ratio`

- **Description:** Ratio of records received to configured maximum batch size
- **Labels:** `event_source`, `event_target`
- **Type:** histogram
- **Note:** This is useful for finding whether the configured batch size is efficiently pulling records. A higher number indicates that a configured `BatchSize` could be increased.

`localstack_batch_window_efficiency_ratio` (Not currently instrumented)

- **Description:** Ratio of poll duration to configured maximum batch window length
- **Labels:** `event_source`, `event_target`
- **Type:** histogram
- **Note:** Measures what proportion of the configured maximum batch window (set by `MaximumBatchingWindowInSeconds`) was actually used before returning. A lower ratio indicates that events were received quickly without needing to wait for the full window duration and that a window could be decreased.

## LocalStack Event Processing Metrics

`localstack_processed_events_total`

- **Description:** Total number of events processed
- **Labels:** `event_source`, `event_target`, `status`
- **Type:** counter

`localstack_in_flight_events`

- **Description:** Total number of event batches currently being processed by the target
- **Labels:** `event_source`, `event_target`
- **Type:** gauge

`localstack_event_propagation_delay_seconds`

- **Description:** End-to-end latency between event creation and processing
- **Labels:** `event_source`, `event_target`
- **Type:** histogram

`localstack_event_processing_errors_total`

- **Description:** Total number of event processing errors
- **Labels:** `event_source`, `event_target`, `error_type`
- **Type:** counter
67 changes: 67 additions & 0 deletions prometheus/docs/system_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
# Sytem-level Metrics

## Garbage Collection Metrics

`_gc_objects_collected_total`

- **Description:** Number of objects collected during garbage collection
- **Labels:** `generation`
- **Type:** counter

`_gc_objects_uncollectable_total`

- **Description:** Number of uncollectable objects found during garbage collection
- **Labels:** `generation`
- **Type:** counter

`_gc_collections_total`

- **Description:** Number of times this generation was collected
- **Labels:** `generation`
- **Type:** counter

## Environment Metrics

`_info`

- **Description:** platform information
- **Labels:** `implementation`, `major`, `minor`, `patchlevel`, `version`
- **Type:** gauge

## Process Metrics

`process_virtual_memory_bytes`

- **Description:** Virtual memory size in bytes
- **Labels:** none
- **Type:** gauge

`process_resident_memory_bytes`

- **Description:** Resident memory size in bytes
- **Labels:** none
- **Type:** gauge

`process_start_time_seconds`

- **Description:** Start time of the process since unix epoch in seconds
- **Labels:** none
- **Type:** gauge

`process_cpu_seconds_total`

- **Description:** Total user and system CPU time spent in seconds
- **Labels:** none
- **Type:** counter

`process_open_fds`

- **Description:** Number of open file descriptors
- **Labels:** none
- **Type:** gauge

`process_max_fds`

- **Description:** Maximum number of open file descriptors
- **Labels:** none
- **Type:** gauge