diff --git a/.gitignore b/.gitignore index f6f89d89..94664367 100644 --- a/.gitignore +++ b/.gitignore @@ -133,3 +133,4 @@ dmypy.json .vscode node_modules/ +.DS_Store \ No newline at end of file diff --git a/prometheus/README.md b/prometheus/README.md index 804ed856..c1c2d6aa 100644 --- a/prometheus/README.md +++ b/prometheus/README.md @@ -115,15 +115,16 @@ services: - "./prometheus_config.yml:/etc/prometheus/prometheus.yml" # Assumes prometheus_config.yml exists in your CWD ``` -## Available Metrics +## Metrics -The Prometheus extension exposes various LocalStack metrics through the `/_extension/metrics` endpoint, including: -- Request counts by service -- Request latencies -- Resource utilization -- Error rates +The Prometheus extension exposes various LocalStack and system metrics through the `/_extension/metrics` endpoint. -For a complete list of available metrics, visit the endpoint directly at `localhost.localstack.cloud:4566/_extension/metrics` when LocalStack is running. +For a complete list of available metrics, view the: +- [LocalStack Metrics documentation](./docs/localstack_metrics.md) +- [System Metrics documentation](./docs/system_metrics.md) +- Otherwise, visit the endpoint directly at `localhost.localstack.cloud:4566/_extension/metrics` when LocalStack is running. + +We've also included a [collection of PromQL queries](./docs/event_analysis.md) that are useful for analyzing LocalStack event source mappings performance. ## Licensing diff --git a/prometheus/docs/event_analysis.md b/prometheus/docs/event_analysis.md new file mode 100644 index 00000000..17a95058 --- /dev/null +++ b/prometheus/docs/event_analysis.md @@ -0,0 +1,104 @@ +# PromQL Queries for Event Processing Statistics + +The following queries can be used to analyse performance of LocalStack's event processing capabilties. + +## Average Propagation Delay from Event Source to Poller + +The average amount of time a record has to wait before being processed during the last 5 minutes. A high propagation delay indicates that our event pollers are taking too long to ingest new events from an event source. + +``` +rate(localstack_event_propagation_delay_seconds_sum[5m]) / rate(localstack_event_propagation_delay_seconds_count[5m]) +``` + +**Example**: +![Average Propagation Delay](images/avg_propagation_delay.png) + +## Batch Efficiency + +A ratio showing how efficiently are our pollers retrieving records from an event source relative to how large their maximum batch size is. A higher number indicates that batch sizes could be increased. + +``` +rate(localstack_batch_size_efficiency_ratio_sum[1m]) / rate(localstack_batch_size_efficiency_ratio_count[1m]) +``` + +Example: +![Batch Efficiency Ratio](images/batch_efficiency_ratio.png) + +## Records Per Poll + +The average number of records being pulled in by an event poller per minute. When used in conjunction with batch efficiency, you can interpret the performance of your batching configuration. + +``` +rate(localstack_records_per_poll_sum[1m]) / rate(localstack_records_per_poll_count[1m]) +``` + +Example: + +![Records Per Poll](images/records_per_poll.png) + +## In-Flight Events + +Gauges how many events are currently being processed by a target at a given point in time. If event processing is taking long, this is a good way of measuring back-pressure on the system. + +``` +localstack_in_flight_events +``` + +Example: +![In-Flight Events](images/in_flight_events.png) + +## Event Processing Duration + +The average duration per minute that targets are processing events for. + +``` +rate(localstack_process_event_duration_seconds_sum[1m]) / rate(localstack_process_event_duration_seconds_count[1m]) +``` + +Example: + +![Event Processing Duration](images/event_processing_duration.png) + +## High Latency Event Processing + +Retrieve the 95th percentile of processing times in a 5m interval grouped by LocalStack service and operation. Useful for analysing the tail-latency of event processing since this is likely where bottlenecks in performance start to show. + +``` +histogram_quantile(0.95, sum by(service, operation, le) (rate(localstack_request_processing_duration_seconds_bucket[5m]))) +``` + +Example: +![High Latency Event Processing](images/high_latency_event_processing.png) + +## Empty Poll Responses + +The approximate number of empty poll requests in a 5 minute interval. + +``` +rate(localstack_poll_miss_total[5m]) * 60 +``` + +Example: +![Empty Poll Responses](images/empty_poll_responses.png) + +## Number of LocalStack requests Processed + +The average number of request processed by the LocalStack gateway per minute. This is grouped by service type (i.e SQS) and operation type (i.e ReceiveMessage) + +``` +sum by(service, operation) (rate(localstack_request_processing_duration_seconds_count[1m]) * 60) +``` + +Example: +![Requests Processed](images/requests_processed.png) + +## In-Flight Requests Against LocalStack Gateway + +Measures how many requests the Kinesis, SQS, DynamoDB, and Lambda services are currently processing in a given minute interval. Useful for seeing how hard a given service is currently being hit and the operation type. + +``` +sum_over_time(localstack_in_flight_requests{service=~"dynamodb|kinesis|sqs|lambda"}[1m]) +``` + +Example: +![In-Flight Requests](images/in_flight_requests.png) \ No newline at end of file diff --git a/prometheus/docs/images/avg_propagation_delay.png b/prometheus/docs/images/avg_propagation_delay.png new file mode 100644 index 00000000..9e5b0748 Binary files /dev/null and b/prometheus/docs/images/avg_propagation_delay.png differ diff --git a/prometheus/docs/images/batch_efficiency_ratio.png b/prometheus/docs/images/batch_efficiency_ratio.png new file mode 100644 index 00000000..9c1ae296 Binary files /dev/null and b/prometheus/docs/images/batch_efficiency_ratio.png differ diff --git a/prometheus/docs/images/empty_poll_responses.png b/prometheus/docs/images/empty_poll_responses.png new file mode 100644 index 00000000..e34a7d80 Binary files /dev/null and b/prometheus/docs/images/empty_poll_responses.png differ diff --git a/prometheus/docs/images/event_processing_duration.png b/prometheus/docs/images/event_processing_duration.png new file mode 100644 index 00000000..efd93143 Binary files /dev/null and b/prometheus/docs/images/event_processing_duration.png differ diff --git a/prometheus/docs/images/high_latency_event_processing.png b/prometheus/docs/images/high_latency_event_processing.png new file mode 100644 index 00000000..4c3edf3c Binary files /dev/null and b/prometheus/docs/images/high_latency_event_processing.png differ diff --git a/prometheus/docs/images/in_flight_events.png b/prometheus/docs/images/in_flight_events.png new file mode 100644 index 00000000..58e3a1bd Binary files /dev/null and b/prometheus/docs/images/in_flight_events.png differ diff --git a/prometheus/docs/images/in_flight_requests.png b/prometheus/docs/images/in_flight_requests.png new file mode 100644 index 00000000..78342ebb Binary files /dev/null and b/prometheus/docs/images/in_flight_requests.png differ diff --git a/prometheus/docs/images/records_per_poll.png b/prometheus/docs/images/records_per_poll.png new file mode 100644 index 00000000..f80a901c Binary files /dev/null and b/prometheus/docs/images/records_per_poll.png differ diff --git a/prometheus/docs/images/requests_processed.png b/prometheus/docs/images/requests_processed.png new file mode 100644 index 00000000..8152bbe9 Binary files /dev/null and b/prometheus/docs/images/requests_processed.png differ diff --git a/prometheus/docs/localstack_metrics.md b/prometheus/docs/localstack_metrics.md new file mode 100644 index 00000000..96e74455 --- /dev/null +++ b/prometheus/docs/localstack_metrics.md @@ -0,0 +1,75 @@ +# LocalStack Metrics + +## LocalStack Core/Request Handling Metrics + +`localstack_request_processing_duration_seconds` + +- **Description:** Time spent processing LocalStack service requests. This is done at the handler chain and is calculated as the duration from first *request handler* to the final *response handler*. +- **Labels:** `service`, `operation`, `status`, `status_code` +- **Type:** histogram + +`localstack_in_flight_requests` + +- **Description:** Total number of currently in-flight requests. This is a live number, and will be influenced by the scraping interval. +- **Labels:** `service`, `operation` +- **Type:** gauge + +## LocalStack Event Poll Operation Metrics + +`localstack_records_per_poll` + +- **Description:** Number of records/events received in each poll operation +- **Labels:** `event_source`, `event_target` +- **Type:** histogram + +`localstack_poll_events_duration_seconds` + +- **Description:** Duration of each poll call in seconds +- **Labels:** `event_source`, `event_target` +- **Type:** histogram + +`localstack_poll_miss_total` + +- **Description:** Count of poll events with empty responses +- **Labels:** `event_source`, `event_target` +- **Type:** counter + +`localstack_batch_size_efficiency_ratio` + +- **Description:** Ratio of records received to configured maximum batch size +- **Labels:** `event_source`, `event_target` +- **Type:** histogram +- **Note:** This is useful for finding whether the configured batch size is efficiently pulling records. A higher number indicates that a configured `BatchSize` could be increased. + +`localstack_batch_window_efficiency_ratio` (Not currently instrumented) + +- **Description:** Ratio of poll duration to configured maximum batch window length +- **Labels:** `event_source`, `event_target` +- **Type:** histogram +- **Note:** Measures what proportion of the configured maximum batch window (set by `MaximumBatchingWindowInSeconds`) was actually used before returning. A lower ratio indicates that events were received quickly without needing to wait for the full window duration and that a window could be decreased. + +## LocalStack Event Processing Metrics + +`localstack_processed_events_total` + +- **Description:** Total number of events processed +- **Labels:** `event_source`, `event_target`, `status` +- **Type:** counter + +`localstack_in_flight_events` + +- **Description:** Total number of event batches currently being processed by the target +- **Labels:** `event_source`, `event_target` +- **Type:** gauge + +`localstack_event_propagation_delay_seconds` + +- **Description:** End-to-end latency between event creation and processing +- **Labels:** `event_source`, `event_target` +- **Type:** histogram + +`localstack_event_processing_errors_total` + +- **Description:** Total number of event processing errors +- **Labels:** `event_source`, `event_target`, `error_type` +- **Type:** counter diff --git a/prometheus/docs/system_metrics.md b/prometheus/docs/system_metrics.md new file mode 100644 index 00000000..97b8c810 --- /dev/null +++ b/prometheus/docs/system_metrics.md @@ -0,0 +1,67 @@ +# Sytem-level Metrics + +## Garbage Collection Metrics + +`_gc_objects_collected_total` + +- **Description:** Number of objects collected during garbage collection +- **Labels:** `generation` +- **Type:** counter + +`_gc_objects_uncollectable_total` + +- **Description:** Number of uncollectable objects found during garbage collection +- **Labels:** `generation` +- **Type:** counter + +`_gc_collections_total` + +- **Description:** Number of times this generation was collected +- **Labels:** `generation` +- **Type:** counter + +## Environment Metrics + +`_info` + +- **Description:** platform information +- **Labels:** `implementation`, `major`, `minor`, `patchlevel`, `version` +- **Type:** gauge + +## Process Metrics + +`process_virtual_memory_bytes` + +- **Description:** Virtual memory size in bytes +- **Labels:** none +- **Type:** gauge + +`process_resident_memory_bytes` + +- **Description:** Resident memory size in bytes +- **Labels:** none +- **Type:** gauge + +`process_start_time_seconds` + +- **Description:** Start time of the process since unix epoch in seconds +- **Labels:** none +- **Type:** gauge + +`process_cpu_seconds_total` + +- **Description:** Total user and system CPU time spent in seconds +- **Labels:** none +- **Type:** counter + +`process_open_fds` + +- **Description:** Number of open file descriptors +- **Labels:** none +- **Type:** gauge + +`process_max_fds` + +- **Description:** Maximum number of open file descriptors +- **Labels:** none +- **Type:** gauge