Add AWS CloudWatch alerting for RDS, NAT Gateway, and Load Balancers by ian-flores · Pull Request #139 · posit-dev/ptd

ian-flores · 2026-02-23T21:10:35Z

Description

Add cloud-level Grafana alerting for AWS resources that were previously unmonitored.
PTD provisions significant AWS infrastructure but CloudWatch alerting only covered
FSx and EC2 network metrics. This closes the most critical gaps.

Issue

ptd-config#2778

New alert rules

Resource	Alert	Threshold	Duration
RDS	CPU Utilization High	>80%	10m
RDS	Free Storage Low	<5 GiB	5m
RDS	Freeable Memory Low	<256 MiB	10m
RDS	Read Latency High	>100ms	10m
RDS	Database Connections High	>80	5m
NAT Gateway	Port Allocation Errors	>0	1m
NAT Gateway	Packets Dropped	>100	5m
ALB	Target 5XX Errors High	>10	5m
ALB	Unhealthy Targets	>0	5m
NLB	Unhealthy Targets	>0	5m
ALB	Response Latency High	>2s	10m

Other changes

Add WriteLatency and Deadlocks to RDS Alloy scraping (dashboard visibility)
Add CloudWatch discovery blocks for NAT Gateway, ALB, and NLB
Tag ALBs/NLBs with PTD workload identifiers via LB Controller annotations
so Alloy discovery can scope metrics per workload
Fix existing cloudwatch.yaml threshold query.params referencing non-existent refId

Verification

just test-python-pulumi passes (111 tests)
Deploy to staging, verify metrics arrive in Mimir, check alert rules appear in
Grafana → Alerting → "Posit Alerts" folder

Category of change

New feature (non-breaking change which adds functionality)

Add alert rules for RDS (CPU, storage, memory, latency, connections), NAT Gateway (port allocation errors, dropped packets), and ALB/NLB (5XX errors, unhealthy targets, response latency). Add WriteLatency and Deadlocks metrics to RDS Alloy scraping, and add new CloudWatch discovery blocks for NAT Gateway, ALB, and NLB. Ref: ptd-config#2778

Add alb.ingress.kubernetes.io/tags annotation to workload ALBs and aws-load-balancer-additional-resource-tags to control room NLBs so Alloy CloudWatch discovery can scope metrics to the correct workload via search_tags instead of scraping all LBs in the region.

Fix RDS DatabaseConnections statistic from Sum to Average (it's a gauge, not a counter). Fix query.params referencing non-existent refId C in all alert YAML threshold conditions (including existing cloudwatch.yaml). Make NLB tagging unconditional to match ALB path. Add input validation helpers for LB tag annotation strings. Document NAT Gateway tag inheritance from VPC and dashboard-only metrics.

All tests pass. Here's a summary of the changes: --- Changes: - Extract `format_lb_tags` into `pulumi_resources/lib.py` to eliminate duplication between `aws_workload_helm.py` and `traefik.py` - Replace `_format_lb_tags` in `aws_workload_helm.py` with import from `lib.py` - Replace `_format_nlb_tags` in `traefik.py` with import from `lib.py` - Fix `KeyError` risk in `traefik.py`: use `.get()` with explicit `ValueError` if required cluster tags are missing - Raise `rds_database_connections_high` threshold from 80 to 500 with explanatory comment referencing instance type max - Update delete instructions in `rds.yaml`, `loadbalancer.yaml`, and `natgateway.yaml` to list all rule UIDs

All tests pass. Here's the summary: Changes: - Add empty-string validation for keys and values in `format_lb_tags` to prevent silent `key=` tags rejected by AWS LB Controller - Add unit tests for `format_lb_tags` covering normal output, comma/equals error paths, and empty-string edge cases - Raise `rds_freeable_memory_low` threshold from 256 MiB to 512 MiB to reduce noise on healthy mid-sized instances; update alert description to match - Update `rds_database_connections_high` comment to document that the 500-connection threshold is calibrated for `db.r5.large` and will never fire for small instances like `db.t3.small` - Add comment in `aws_workload_helm.py` documenting that `true_name`, `environment`, and `compound_name` are plain strings loaded from YAML config, not Pulumi Outputs

All 120 tests pass. Here's the summary: Changes: - Add empty-dict guard in `format_lb_tags`: raises `ValueError("tags must not be empty")` when called with `{}` - Add `test_format_lb_tags_empty_dict` test case covering the new guard - Add `self.cluster.tags is None` guard in `traefik.py` before calling `.get()`, preventing `AttributeError` from masking the descriptive `ValueError` - Add calibration comments to `rds_free_storage_low` (5 GiB) and `rds_freeable_memory_low` (512 MiB) thresholds in `rds.yaml` noting they are sized for specific instance classes and must be adjusted for others

Changes: - Extend `nat_gateway_port_allocation_errors` alert `for` duration from `1m` to `5m` to reduce transient noise (consistent with other NAT alert) - Add space validation to `format_lb_tags()` for both keys and values, since AWS tag keys must not contain spaces - Add `test_format_lb_tags_space_in_key` and `test_format_lb_tags_space_in_value` test cases for the new validation - Document in `rds_database_connections_high` description annotation that the 500-connection threshold will not fire for small instances (e.g. db.t3.small) - Add comment in `traefik.py` clarifying that `cluster.name`, `true_name`, and `environment` are plain `str` values (not Pulumi Outputs), so `format_lb_tags()` checks work correctly at graph-construction time

All 130 tests pass. Here's the summary: Changes: - `lib.py`: Extended whitespace validation in `format_lb_tags` to reject tabs (`\t`) and newlines (`\n`) in addition to spaces, for both keys and values - `traefik.py`: Extracted inline NLB tag-building logic from `_deploy()` into a module-level `_build_nlb_tag_string(tags, cluster_name)` function - `test_lib.py`: Updated space-check test match strings to `"whitespace"` (matching new message) and added 4 new tests for tab/newline in keys and values - `tests/test_traefik_nlb_tags.py`: New file with 4 tests covering the happy path, `tags is None`, missing `posit.team/true-name`, and missing `posit.team/environment`

All tests pass. Here's the summary: Changes: - `traefik.py`: Split the combined `true_name`/`environment` null check into separate checks so the error message identifies exactly which required tag is missing - `test_traefik_nlb_tags.py`: Added test covering invalid `cluster_name` (comma in name) passed to `_build_nlb_tag_string` - `grafana_alloy.py`: Added comments to ALB and NLB discovery blocks documenting that pre-existing LBs won't be discovered until the cluster is redeployed (migration gap)

All 131 tests pass. Changes: - Add `match="comma or equals"` to `test_build_nlb_tag_string_invalid_cluster_name` so the test verifies the specific error from `format_lb_tags`, not just any `ValueError` - Add FIXME comments in `grafana_alloy.py` for both ALB and NLB discovery blocks with a concrete AWS CLI command for tagging existing load balancers without redeploying

Changes: - `lib.py`: Add AWS tag length validation — raise `ValueError` when key > 128 chars or value > 256 chars, surfacing violations at deploy time with a clear message instead of an opaque AWS API error - `tests/test_lib.py`: Add tests for key and value length limit validation - `grafana_alloy.py`: Add migration comment on `DatabaseConnections` stat change noting the Prometheus metric name changed from `_sum` to `_average` and that dashboards querying the old name need to be updated

All 133 tests pass. The Go build also succeeds. Changes: - `natgateway.yaml`: Add calibration comment to `nat_gateway_packets_dropped` threshold of 100 packets, noting it may need adjustment for high-throughput gateways - `lib.py`: Update `format_lb_tags` docstring to explicitly document that whitespace rejection in tag values is a deliberate constraint (not an AWS limit), since AWS tag values do permit spaces

All 134 tests pass (including the new `test_format_lb_tags_aws_reserved_prefix` test). Changes: - Add `aws:` reserved prefix validation to `format_lb_tags` in `lib.py`, surfacing AWS API errors at graph-construction time - Add test for `aws:` prefix rejection in `test_lib.py` - Collect both `Sum` and `Average` statistics for `DatabaseConnections` in `grafana_alloy.py` to preserve the `_sum` metric during dashboard migration

All 134 tests pass. Changes: - `rds.yaml`: Add instance-class caveat to `rds_freeable_memory_low` description to warn about false positives on small instances (e.g. db.t3.micro, db.t3.small) - `lib.py`: Improve error message for None/empty tag values from "must not be empty" to "must not be None or empty" - `grafana_alloy.py`: Change `UnHealthyHostCount` collection period from `1m` to `5m` for both ALB and NLB to match the alert group interval - `test_lib.py`: Update `test_format_lb_tags_empty_value` assertion to match the updated error message

All 135 tests pass. Changes: - Replaced exact-string assertion in `test_build_nlb_tag_string_happy_path` with a parsed key-value dict assertion to avoid coupling the test to dict insertion order - Added `test_build_nlb_tag_string_empty_cluster_name` to cover the empty `cluster_name=""` edge case, which raises `ValueError` with "must not be None or empty"

All changes verified. 136 tests pass. --- Changes: - Increase `alb_unhealthy_targets` and `nlb_unhealthy_targets` alert `for` from `5m` to `10m` to reduce false positives during rolling deployments/node drains - Add `"\r"` (carriage return) to whitespace rejection checks in `format_lb_tags()` for both keys and values - Add test `test_build_nlb_tag_string_extra_tags_are_dropped` to document that extra tags passed to `_build_nlb_tag_string` are intentionally discarded

All tests pass. Here's a summary of the changes made: Changes: - Fix `traefik.py` `_build_nlb_tag_string` to use `msg = ...; raise ValueError(msg)` pattern consistently with `lib.py` (finding 4) - Extract `_build_alb_tag_string(true_name, environment, compound_name)` module-level helper from `_define_ingress_alb_annotations` in `aws_workload_helm.py` (finding 3) - Update `_define_ingress_alb_annotations` to call the new `_build_alb_tag_string` helper - Add `tests/test_alb_tags.py` with 4 tests covering happy path, tag key presence, invalid compound_name, and empty compound_name (finding 3)

Changes: - Set `noDataState: Alerting` on `rds_free_storage_low` and `rds_freeable_memory_low` in `rds.yaml` so metric scraping failures surface as alerts rather than silently hiding - Set `noDataState: Alerting` on `nat_gateway_port_allocation_errors` in `natgateway.yaml` for the same reason - Added comment on the `alb_target_5xx_errors_high` threshold clarifying it is an absolute count (not an error rate) and noting its traffic-volume sensitivity - Added `test_build_nlb_tag_string_invalid_true_name_value` and `test_build_nlb_tag_string_invalid_environment_value` tests to `test_traefik_nlb_tags.py` to cover invalid characters in tag values extracted from cluster tags - Changed `test_format_lb_tags_normal` in `test_lib.py` to use round-trip parse instead of exact insertion-order string assertion - Added a `TODO` comment to the `DatabaseConnections` dual-collection block in `grafana_alloy.py` to track the pending cleanup

Changes: - Validate `true_name` against `[a-zA-Z0-9._-]+` before interpolating into Alloy River config f-strings in `grafana_alloy.py`, preventing malformed config from special characters - Change `nat_gateway_packets_dropped` `noDataState` from `NoData` to `Alerting` in `natgateway.yaml` to match `nat_gateway_port_allocation_errors` (both use the same metric source) - Add `\r` (carriage return) test cases for both key and value in `test_lib.py` to complete whitespace rejection coverage

All 149 tests pass. Here's a summary of the changes: Changes: - `natgateway.yaml`: Revert both NAT gateway rules from `noDataState: Alerting` to `noDataState: NoData` to match RDS/ALB/NLB rules and prevent false positives on exporter restarts - `rds.yaml`: Add `instance_size_dependent: "true"` label to `rds_freeable_memory_low` so operators can create targeted silences for known-small instance classes - `grafana_alloy.py`: Extract `_validate_alloy_true_name()` module-level function from the inline `re.match` check in `__init__` - `test_grafana_alloy.py`: Add `TestValidateAlloyTrueName` class covering valid names and rejections of `"`, `{`, `}`, and space characters - `grafana_alloy.py`: Replace vague "TODO: Open a tracking issue" with an actionable comment describing exactly what condition triggers cleanup and the cost implication

All tests pass. Here's a summary: Changes: - Confirmed `import re` is already present in `grafana_alloy.py` (no change needed) - Added `noDataState` rationale comments to all 5 rules in `rds.yaml` (Alerting for storage/memory exhaustion; NoData for performance metrics) - Changed `nat_gateway_port_allocation_errors` in `natgateway.yaml` from `noDataState: NoData` to `noDataState: Alerting` to match its "critical" annotation - Added boundary tests `test_format_lb_tags_key_at_limit` (128 chars) and `test_format_lb_tags_value_at_limit` (256 chars) to `test_lib.py` - Added `test_empty_string_rejected` test to `TestValidateAlloyTrueName` in `test_grafana_alloy.py` - Added `# TODO: Open a GitHub issue to track removal of Sum once dashboards are updated.` to the `DatabaseConnections` dual-statistics comment in `grafana_alloy.py`

All 154 tests pass. Changes: - Added `instance_size_dependent: "true"` label to `rds_database_connections_high` in `rds.yaml` for consistency with `rds_freeable_memory_low`, enabling operators to silence the alert for known-small instance classes - Added `test_build_alb_tag_string_invalid_true_name_value` and `test_build_alb_tag_string_invalid_environment_value` tests to `test_alb_tags.py` to match the invalidity coverage already present in `test_traefik_nlb_tags.py`

All tests pass. Here's a summary of the changes made: Changes: - Paused `rds_freeable_memory_low` and `rds_database_connections_high` alerts (`isPaused: true`) until Grafana silence/Alertmanager inhibit config is in place for the `instance_size_dependent` label - Changed `nat_gateway_port_allocation_errors` `noDataState` from `Alerting` to `NoData` to prevent spurious pages during scrape outages, with a comment to add a dedicated "CloudWatch scrape down" alert - Extracted CloudWatch config generation from `_define_config_map` into a new `_define_cloudwatch_config` method to enable unit testing - Replaced the floating TODO comment about the DatabaseConnections Sum cost with a NOTE that is less likely to be overlooked - Added `TestDefineCloudwatchConfig` tests asserting that `AWS/NATGateway`, `AWS/ApplicationELB`, and `AWS/NetworkELB` discovery blocks appear in the rendered config for AWS, and that non-AWS returns an empty string

All 160 tests pass and no new lint issues were introduced. --- Changes: - Validate `compound_name` in `_define_cloudwatch_config` with `_validate_alloy_true_name` to close the injection gap alongside `true_name` - Add test `test_invalid_true_name_raises_value_error` and `test_invalid_compound_name_raises_value_error` in `TestDefineCloudwatchConfig` to cover the validation code path - Add provisioning false-positive note to `rds_free_storage_low` annotation documenting expected `noDataState=Alerting` behavior during CloudWatch scrape initialization - Replace open-ended "Create a GitHub issue" comment with `TODO:` prefix on the `DatabaseConnections` Sum metric to make it searchable - Fix line length of `_make_alloy_for_cloudwatch` test helper signature (split across multiple lines)

All 163 tests pass and Go builds cleanly. Changes: - `tests/test_lib.py`: Add test for `format_lb_tags` with `/` in key, exercising the real production key format (`posit.team/true-name`) - `tests/test_grafana_alloy.py`: Strengthen `TestDefineCloudwatchConfig` tests with assertions that interpolated `true_name` and `compound_name` values appear in the correct `search_tags` positions; add two dedicated tests for these - `aws_workload_helm.py`: Add docstring comment to `_build_alb_tag_string` clarifying that `format_lb_tags` validates for annotation safety only, and that Alloy River injection safety is enforced separately by `_validate_alloy_true_name` in `grafana_alloy.py` - `grafana_alerts/loadbalancer.yaml`, `natgateway.yaml`, `rds.yaml`: Add comment documenting that `{{$labels.cluster}}` in alert annotations is injected by the `prometheus.relabel.default` block in Alloy config, and will be blank if relabeling is absent

Set SupportsManagedByCondition to false for all EventBridge actions. The managed-by tag condition prevents managing EventBridge rules that don't already have the posit.team/managed-by tag, blocking operations like DescribeRule and TagResource. Account-scoping is sufficient.

ian-flores · 2026-02-25T20:35:33Z

Staging Verification

Deployed to ganso01-staging and main01-staging (control room).

Workload (`ganso01-staging`)

Ran ptd ensure ganso01-staging --only-steps helm --auto-apply
Alloy ConfigMap updated with new CloudWatch discovery blocks (NAT Gateway, ALB, NLB) and additional RDS metrics (WriteLatency, Deadlocks)
Ingress updated with alb.ingress.kubernetes.io/tags annotation — ALB now tagged with posit.team/true-name, posit.team/environment, and Name

Control Room (`main01-staging`)

Ran ptd ensure main01-staging --only-steps cluster --auto-apply
All alert configmaps deployed to Grafana namespace:

NAME                          DATA   AGE
grafana-applications-alerts   1      5d2h
grafana-cloudwatch-alerts     1      2m56s
grafana-healthchecks-alerts   1      20d
grafana-loadbalancer-alerts   1      3m1s
grafana-natgateway-alerts     1      3m2s
grafana-nodes-alerts          1      264d
grafana-pods-alerts           1      2m57s
grafana-rds-alerts            1      3m1s

EventBridge IAM fix

The clusters step on ganso01-staging was failing due to EventBridge SupportsManagedByCondition being set to true — this created a chicken-and-egg problem where rules without the posit.team/managed-by tag couldn't be described or tagged. After fixing lib/aws/iam.go and re-applying the CFN template, the clusters step completed successfully (4 EventBridge rules updated).

ian-flores added 27 commits February 23, 2026 12:55

Fix Python lint and formatting errors

fade18d

ian-flores marked this pull request as ready for review February 25, 2026 20:34

ian-flores requested a review from a team as a code owner February 25, 2026 20:34

ian-flores requested a review from amdove February 25, 2026 20:36

amdove approved these changes Feb 25, 2026

View reviewed changes

ian-flores added this pull request to the merge queue Feb 26, 2026

Merged via the queue into main with commit a4368ef Feb 26, 2026
6 checks passed

ian-flores deleted the aws-alerting-gaps branch February 26, 2026 16:16

This was referenced Feb 26, 2026

Add Azure Monitor alerting for cloud-level resources #143

Draft

refine CloudWatch alert thresholds and severity labels #152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AWS CloudWatch alerting for RDS, NAT Gateway, and Load Balancers#139

Add AWS CloudWatch alerting for RDS, NAT Gateway, and Load Balancers#139
ian-flores merged 27 commits intomainfrom
aws-alerting-gaps

ian-flores commented Feb 23, 2026

Uh oh!

ian-flores commented Feb 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ian-flores commented Feb 23, 2026

Description

Issue

New alert rules

Other changes

Verification

Category of change

Uh oh!

ian-flores commented Feb 25, 2026

Staging Verification

Workload (ganso01-staging)

Control Room (main01-staging)

EventBridge IAM fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Workload (`ganso01-staging`)

Control Room (`main01-staging`)