Skip to content

Add AWS CloudWatch alerting for RDS, NAT Gateway, and Load Balancers#139

Merged
ian-flores merged 27 commits intomainfrom
aws-alerting-gaps
Feb 26, 2026
Merged

Add AWS CloudWatch alerting for RDS, NAT Gateway, and Load Balancers#139
ian-flores merged 27 commits intomainfrom
aws-alerting-gaps

Conversation

@ian-flores
Copy link
Contributor

Description

Add cloud-level Grafana alerting for AWS resources that were previously unmonitored.
PTD provisions significant AWS infrastructure but CloudWatch alerting only covered
FSx and EC2 network metrics. This closes the most critical gaps.

Issue

ptd-config#2778

New alert rules

Resource Alert Threshold Duration
RDS CPU Utilization High >80% 10m
RDS Free Storage Low <5 GiB 5m
RDS Freeable Memory Low <256 MiB 10m
RDS Read Latency High >100ms 10m
RDS Database Connections High >80 5m
NAT Gateway Port Allocation Errors >0 1m
NAT Gateway Packets Dropped >100 5m
ALB Target 5XX Errors High >10 5m
ALB Unhealthy Targets >0 5m
NLB Unhealthy Targets >0 5m
ALB Response Latency High >2s 10m

Other changes

  • Add WriteLatency and Deadlocks to RDS Alloy scraping (dashboard visibility)
  • Add CloudWatch discovery blocks for NAT Gateway, ALB, and NLB
  • Tag ALBs/NLBs with PTD workload identifiers via LB Controller annotations
    so Alloy discovery can scope metrics per workload
  • Fix existing cloudwatch.yaml threshold query.params referencing non-existent refId

Verification

  • just test-python-pulumi passes (111 tests)
  • Deploy to staging, verify metrics arrive in Mimir, check alert rules appear in
    Grafana → Alerting → "Posit Alerts" folder

Category of change

  • New feature (non-breaking change which adds functionality)

Add alert rules for RDS (CPU, storage, memory, latency, connections),
NAT Gateway (port allocation errors, dropped packets), and ALB/NLB
(5XX errors, unhealthy targets, response latency). Add WriteLatency
and Deadlocks metrics to RDS Alloy scraping, and add new CloudWatch
discovery blocks for NAT Gateway, ALB, and NLB.

Ref: ptd-config#2778
Add alb.ingress.kubernetes.io/tags annotation to workload ALBs and
aws-load-balancer-additional-resource-tags to control room NLBs so
Alloy CloudWatch discovery can scope metrics to the correct workload
via search_tags instead of scraping all LBs in the region.
Fix RDS DatabaseConnections statistic from Sum to Average (it's a
gauge, not a counter). Fix query.params referencing non-existent refId
C in all alert YAML threshold conditions (including existing
cloudwatch.yaml). Make NLB tagging unconditional to match ALB path.
Add input validation helpers for LB tag annotation strings. Document
NAT Gateway tag inheritance from VPC and dashboard-only metrics.
All tests pass. Here's a summary of the changes:
---
Changes:
- Extract `format_lb_tags` into `pulumi_resources/lib.py` to eliminate duplication between `aws_workload_helm.py` and `traefik.py`
- Replace `_format_lb_tags` in `aws_workload_helm.py` with import from `lib.py`
- Replace `_format_nlb_tags` in `traefik.py` with import from `lib.py`
- Fix `KeyError` risk in `traefik.py`: use `.get()` with explicit `ValueError` if required cluster tags are missing
- Raise `rds_database_connections_high` threshold from 80 to 500 with explanatory comment referencing instance type max
- Update delete instructions in `rds.yaml`, `loadbalancer.yaml`, and `natgateway.yaml` to list all rule UIDs
All tests pass. Here's the summary:
Changes:
- Add empty-string validation for keys and values in `format_lb_tags` to prevent silent `key=` tags rejected by AWS LB Controller
- Add unit tests for `format_lb_tags` covering normal output, comma/equals error paths, and empty-string edge cases
- Raise `rds_freeable_memory_low` threshold from 256 MiB to 512 MiB to reduce noise on healthy mid-sized instances; update alert description to match
- Update `rds_database_connections_high` comment to document that the 500-connection threshold is calibrated for `db.r5.large` and will never fire for small instances like `db.t3.small`
- Add comment in `aws_workload_helm.py` documenting that `true_name`, `environment`, and `compound_name` are plain strings loaded from YAML config, not Pulumi Outputs
All 120 tests pass. Here's the summary:
Changes:
- Add empty-dict guard in `format_lb_tags`: raises `ValueError("tags must not be empty")` when called with `{}`
- Add `test_format_lb_tags_empty_dict` test case covering the new guard
- Add `self.cluster.tags is None` guard in `traefik.py` before calling `.get()`, preventing `AttributeError` from masking the descriptive `ValueError`
- Add calibration comments to `rds_free_storage_low` (5 GiB) and `rds_freeable_memory_low` (512 MiB) thresholds in `rds.yaml` noting they are sized for specific instance classes and must be adjusted for others
Changes:
- Extend `nat_gateway_port_allocation_errors` alert `for` duration from `1m` to `5m` to reduce transient noise (consistent with other NAT alert)
- Add space validation to `format_lb_tags()` for both keys and values, since AWS tag keys must not contain spaces
- Add `test_format_lb_tags_space_in_key` and `test_format_lb_tags_space_in_value` test cases for the new validation
- Document in `rds_database_connections_high` description annotation that the 500-connection threshold will not fire for small instances (e.g. db.t3.small)
- Add comment in `traefik.py` clarifying that `cluster.name`, `true_name`, and `environment` are plain `str` values (not Pulumi Outputs), so `format_lb_tags()` checks work correctly at graph-construction time
All 130 tests pass. Here's the summary:
Changes:
- `lib.py`: Extended whitespace validation in `format_lb_tags` to reject tabs (`\t`) and newlines (`\n`) in addition to spaces, for both keys and values
- `traefik.py`: Extracted inline NLB tag-building logic from `_deploy()` into a module-level `_build_nlb_tag_string(tags, cluster_name)` function
- `test_lib.py`: Updated space-check test match strings to `"whitespace"` (matching new message) and added 4 new tests for tab/newline in keys and values
- `tests/test_traefik_nlb_tags.py`: New file with 4 tests covering the happy path, `tags is None`, missing `posit.team/true-name`, and missing `posit.team/environment`
All tests pass. Here's the summary:
Changes:
- `traefik.py`: Split the combined `true_name`/`environment` null check into separate checks so the error message identifies exactly which required tag is missing
- `test_traefik_nlb_tags.py`: Added test covering invalid `cluster_name` (comma in name) passed to `_build_nlb_tag_string`
- `grafana_alloy.py`: Added comments to ALB and NLB discovery blocks documenting that pre-existing LBs won't be discovered until the cluster is redeployed (migration gap)
All 131 tests pass.
Changes:
- Add `match="comma or equals"` to `test_build_nlb_tag_string_invalid_cluster_name` so the test verifies the specific error from `format_lb_tags`, not just any `ValueError`
- Add FIXME comments in `grafana_alloy.py` for both ALB and NLB discovery blocks with a concrete AWS CLI command for tagging existing load balancers without redeploying
Changes:
- `lib.py`: Add AWS tag length validation — raise `ValueError` when key > 128 chars or value > 256 chars, surfacing violations at deploy time with a clear message instead of an opaque AWS API error
- `tests/test_lib.py`: Add tests for key and value length limit validation
- `grafana_alloy.py`: Add migration comment on `DatabaseConnections` stat change noting the Prometheus metric name changed from `_sum` to `_average` and that dashboards querying the old name need to be updated
All 133 tests pass. The Go build also succeeds.
Changes:
- `natgateway.yaml`: Add calibration comment to `nat_gateway_packets_dropped` threshold of 100 packets, noting it may need adjustment for high-throughput gateways
- `lib.py`: Update `format_lb_tags` docstring to explicitly document that whitespace rejection in tag values is a deliberate constraint (not an AWS limit), since AWS tag values do permit spaces
All 134 tests pass (including the new `test_format_lb_tags_aws_reserved_prefix` test).
Changes:
- Add `aws:` reserved prefix validation to `format_lb_tags` in `lib.py`, surfacing AWS API errors at graph-construction time
- Add test for `aws:` prefix rejection in `test_lib.py`
- Collect both `Sum` and `Average` statistics for `DatabaseConnections` in `grafana_alloy.py` to preserve the `_sum` metric during dashboard migration
All 134 tests pass.
Changes:
- `rds.yaml`: Add instance-class caveat to `rds_freeable_memory_low` description to warn about false positives on small instances (e.g. db.t3.micro, db.t3.small)
- `lib.py`: Improve error message for None/empty tag values from "must not be empty" to "must not be None or empty"
- `grafana_alloy.py`: Change `UnHealthyHostCount` collection period from `1m` to `5m` for both ALB and NLB to match the alert group interval
- `test_lib.py`: Update `test_format_lb_tags_empty_value` assertion to match the updated error message
All 135 tests pass.
Changes:
- Replaced exact-string assertion in `test_build_nlb_tag_string_happy_path` with a parsed key-value dict assertion to avoid coupling the test to dict insertion order
- Added `test_build_nlb_tag_string_empty_cluster_name` to cover the empty `cluster_name=""` edge case, which raises `ValueError` with "must not be None or empty"
All changes verified. 136 tests pass.
---
Changes:
- Increase `alb_unhealthy_targets` and `nlb_unhealthy_targets` alert `for` from `5m` to `10m` to reduce false positives during rolling deployments/node drains
- Add `"\r"` (carriage return) to whitespace rejection checks in `format_lb_tags()` for both keys and values
- Add test `test_build_nlb_tag_string_extra_tags_are_dropped` to document that extra tags passed to `_build_nlb_tag_string` are intentionally discarded
All tests pass. Here's a summary of the changes made:
Changes:
- Fix `traefik.py` `_build_nlb_tag_string` to use `msg = ...; raise ValueError(msg)` pattern consistently with `lib.py` (finding 4)
- Extract `_build_alb_tag_string(true_name, environment, compound_name)` module-level helper from `_define_ingress_alb_annotations` in `aws_workload_helm.py` (finding 3)
- Update `_define_ingress_alb_annotations` to call the new `_build_alb_tag_string` helper
- Add `tests/test_alb_tags.py` with 4 tests covering happy path, tag key presence, invalid compound_name, and empty compound_name (finding 3)
Changes:
- Set `noDataState: Alerting` on `rds_free_storage_low` and `rds_freeable_memory_low` in `rds.yaml` so metric scraping failures surface as alerts rather than silently hiding
- Set `noDataState: Alerting` on `nat_gateway_port_allocation_errors` in `natgateway.yaml` for the same reason
- Added comment on the `alb_target_5xx_errors_high` threshold clarifying it is an absolute count (not an error rate) and noting its traffic-volume sensitivity
- Added `test_build_nlb_tag_string_invalid_true_name_value` and `test_build_nlb_tag_string_invalid_environment_value` tests to `test_traefik_nlb_tags.py` to cover invalid characters in tag values extracted from cluster tags
- Changed `test_format_lb_tags_normal` in `test_lib.py` to use round-trip parse instead of exact insertion-order string assertion
- Added a `TODO` comment to the `DatabaseConnections` dual-collection block in `grafana_alloy.py` to track the pending cleanup
Changes:
- Validate `true_name` against `[a-zA-Z0-9._-]+` before interpolating into Alloy River config f-strings in `grafana_alloy.py`, preventing malformed config from special characters
- Change `nat_gateway_packets_dropped` `noDataState` from `NoData` to `Alerting` in `natgateway.yaml` to match `nat_gateway_port_allocation_errors` (both use the same metric source)
- Add `\r` (carriage return) test cases for both key and value in `test_lib.py` to complete whitespace rejection coverage
All 149 tests pass. Here's a summary of the changes:
Changes:
- `natgateway.yaml`: Revert both NAT gateway rules from `noDataState: Alerting` to `noDataState: NoData` to match RDS/ALB/NLB rules and prevent false positives on exporter restarts
- `rds.yaml`: Add `instance_size_dependent: "true"` label to `rds_freeable_memory_low` so operators can create targeted silences for known-small instance classes
- `grafana_alloy.py`: Extract `_validate_alloy_true_name()` module-level function from the inline `re.match` check in `__init__`
- `test_grafana_alloy.py`: Add `TestValidateAlloyTrueName` class covering valid names and rejections of `"`, `{`, `}`, and space characters
- `grafana_alloy.py`: Replace vague "TODO: Open a tracking issue" with an actionable comment describing exactly what condition triggers cleanup and the cost implication
All tests pass. Here's a summary:
Changes:
- Confirmed `import re` is already present in `grafana_alloy.py` (no change needed)
- Added `noDataState` rationale comments to all 5 rules in `rds.yaml` (Alerting for storage/memory exhaustion; NoData for performance metrics)
- Changed `nat_gateway_port_allocation_errors` in `natgateway.yaml` from `noDataState: NoData` to `noDataState: Alerting` to match its "critical" annotation
- Added boundary tests `test_format_lb_tags_key_at_limit` (128 chars) and `test_format_lb_tags_value_at_limit` (256 chars) to `test_lib.py`
- Added `test_empty_string_rejected` test to `TestValidateAlloyTrueName` in `test_grafana_alloy.py`
- Added `# TODO: Open a GitHub issue to track removal of Sum once dashboards are updated.` to the `DatabaseConnections` dual-statistics comment in `grafana_alloy.py`
All 154 tests pass.
Changes:
- Added `instance_size_dependent: "true"` label to `rds_database_connections_high` in `rds.yaml` for consistency with `rds_freeable_memory_low`, enabling operators to silence the alert for known-small instance classes
- Added `test_build_alb_tag_string_invalid_true_name_value` and `test_build_alb_tag_string_invalid_environment_value` tests to `test_alb_tags.py` to match the invalidity coverage already present in `test_traefik_nlb_tags.py`
All tests pass. Here's a summary of the changes made:
Changes:
- Paused `rds_freeable_memory_low` and `rds_database_connections_high` alerts (`isPaused: true`) until Grafana silence/Alertmanager inhibit config is in place for the `instance_size_dependent` label
- Changed `nat_gateway_port_allocation_errors` `noDataState` from `Alerting` to `NoData` to prevent spurious pages during scrape outages, with a comment to add a dedicated "CloudWatch scrape down" alert
- Extracted CloudWatch config generation from `_define_config_map` into a new `_define_cloudwatch_config` method to enable unit testing
- Replaced the floating TODO comment about the DatabaseConnections Sum cost with a NOTE that is less likely to be overlooked
- Added `TestDefineCloudwatchConfig` tests asserting that `AWS/NATGateway`, `AWS/ApplicationELB`, and `AWS/NetworkELB` discovery blocks appear in the rendered config for AWS, and that non-AWS returns an empty string
All 160 tests pass and no new lint issues were introduced.
---
Changes:
- Validate `compound_name` in `_define_cloudwatch_config` with `_validate_alloy_true_name` to close the injection gap alongside `true_name`
- Add test `test_invalid_true_name_raises_value_error` and `test_invalid_compound_name_raises_value_error` in `TestDefineCloudwatchConfig` to cover the validation code path
- Add provisioning false-positive note to `rds_free_storage_low` annotation documenting expected `noDataState=Alerting` behavior during CloudWatch scrape initialization
- Replace open-ended "Create a GitHub issue" comment with `TODO:` prefix on the `DatabaseConnections` Sum metric to make it searchable
- Fix line length of `_make_alloy_for_cloudwatch` test helper signature (split across multiple lines)
All 163 tests pass and Go builds cleanly.
Changes:
- `tests/test_lib.py`: Add test for `format_lb_tags` with `/` in key, exercising the real production key format (`posit.team/true-name`)
- `tests/test_grafana_alloy.py`: Strengthen `TestDefineCloudwatchConfig` tests with assertions that interpolated `true_name` and `compound_name` values appear in the correct `search_tags` positions; add two dedicated tests for these
- `aws_workload_helm.py`: Add docstring comment to `_build_alb_tag_string` clarifying that `format_lb_tags` validates for annotation safety only, and that Alloy River injection safety is enforced separately by `_validate_alloy_true_name` in `grafana_alloy.py`
- `grafana_alerts/loadbalancer.yaml`, `natgateway.yaml`, `rds.yaml`: Add comment documenting that `{{$labels.cluster}}` in alert annotations is injected by the `prometheus.relabel.default` block in Alloy config, and will be blank if relabeling is absent
Set SupportsManagedByCondition to false for all EventBridge actions.
The managed-by tag condition prevents managing EventBridge rules that
don't already have the posit.team/managed-by tag, blocking operations
like DescribeRule and TagResource. Account-scoping is sufficient.
@ian-flores ian-flores marked this pull request as ready for review February 25, 2026 20:34
@ian-flores ian-flores requested a review from a team as a code owner February 25, 2026 20:34
@ian-flores
Copy link
Contributor Author

Staging Verification

Deployed to ganso01-staging and main01-staging (control room).

Workload (ganso01-staging)

  • Ran ptd ensure ganso01-staging --only-steps helm --auto-apply
  • Alloy ConfigMap updated with new CloudWatch discovery blocks (NAT Gateway, ALB, NLB) and additional RDS metrics (WriteLatency, Deadlocks)
  • Ingress updated with alb.ingress.kubernetes.io/tags annotation — ALB now tagged with posit.team/true-name, posit.team/environment, and Name

Control Room (main01-staging)

  • Ran ptd ensure main01-staging --only-steps cluster --auto-apply
  • All alert configmaps deployed to Grafana namespace:
NAME                          DATA   AGE
grafana-applications-alerts   1      5d2h
grafana-cloudwatch-alerts     1      2m56s
grafana-healthchecks-alerts   1      20d
grafana-loadbalancer-alerts   1      3m1s
grafana-natgateway-alerts     1      3m2s
grafana-nodes-alerts          1      264d
grafana-pods-alerts           1      2m57s
grafana-rds-alerts            1      3m1s

EventBridge IAM fix

The clusters step on ganso01-staging was failing due to EventBridge SupportsManagedByCondition being set to true — this created a chicken-and-egg problem where rules without the posit.team/managed-by tag couldn't be described or tagged. After fixing lib/aws/iam.go and re-applying the CFN template, the clusters step completed successfully (4 EventBridge rules updated).

@ian-flores ian-flores requested a review from amdove February 25, 2026 20:36
@ian-flores ian-flores added this pull request to the merge queue Feb 26, 2026
Merged via the queue into main with commit a4368ef Feb 26, 2026
6 checks passed
@ian-flores ian-flores deleted the aws-alerting-gaps branch February 26, 2026 16:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants