Add AWS CloudWatch alerting for RDS, NAT Gateway, and Load Balancers#139
Merged
ian-flores merged 27 commits intomainfrom Feb 26, 2026
Merged
Add AWS CloudWatch alerting for RDS, NAT Gateway, and Load Balancers#139ian-flores merged 27 commits intomainfrom
ian-flores merged 27 commits intomainfrom
Conversation
Add alert rules for RDS (CPU, storage, memory, latency, connections), NAT Gateway (port allocation errors, dropped packets), and ALB/NLB (5XX errors, unhealthy targets, response latency). Add WriteLatency and Deadlocks metrics to RDS Alloy scraping, and add new CloudWatch discovery blocks for NAT Gateway, ALB, and NLB. Ref: ptd-config#2778
Add alb.ingress.kubernetes.io/tags annotation to workload ALBs and aws-load-balancer-additional-resource-tags to control room NLBs so Alloy CloudWatch discovery can scope metrics to the correct workload via search_tags instead of scraping all LBs in the region.
Fix RDS DatabaseConnections statistic from Sum to Average (it's a gauge, not a counter). Fix query.params referencing non-existent refId C in all alert YAML threshold conditions (including existing cloudwatch.yaml). Make NLB tagging unconditional to match ALB path. Add input validation helpers for LB tag annotation strings. Document NAT Gateway tag inheritance from VPC and dashboard-only metrics.
All tests pass. Here's a summary of the changes: --- Changes: - Extract `format_lb_tags` into `pulumi_resources/lib.py` to eliminate duplication between `aws_workload_helm.py` and `traefik.py` - Replace `_format_lb_tags` in `aws_workload_helm.py` with import from `lib.py` - Replace `_format_nlb_tags` in `traefik.py` with import from `lib.py` - Fix `KeyError` risk in `traefik.py`: use `.get()` with explicit `ValueError` if required cluster tags are missing - Raise `rds_database_connections_high` threshold from 80 to 500 with explanatory comment referencing instance type max - Update delete instructions in `rds.yaml`, `loadbalancer.yaml`, and `natgateway.yaml` to list all rule UIDs
All tests pass. Here's the summary: Changes: - Add empty-string validation for keys and values in `format_lb_tags` to prevent silent `key=` tags rejected by AWS LB Controller - Add unit tests for `format_lb_tags` covering normal output, comma/equals error paths, and empty-string edge cases - Raise `rds_freeable_memory_low` threshold from 256 MiB to 512 MiB to reduce noise on healthy mid-sized instances; update alert description to match - Update `rds_database_connections_high` comment to document that the 500-connection threshold is calibrated for `db.r5.large` and will never fire for small instances like `db.t3.small` - Add comment in `aws_workload_helm.py` documenting that `true_name`, `environment`, and `compound_name` are plain strings loaded from YAML config, not Pulumi Outputs
All 120 tests pass. Here's the summary:
Changes:
- Add empty-dict guard in `format_lb_tags`: raises `ValueError("tags must not be empty")` when called with `{}`
- Add `test_format_lb_tags_empty_dict` test case covering the new guard
- Add `self.cluster.tags is None` guard in `traefik.py` before calling `.get()`, preventing `AttributeError` from masking the descriptive `ValueError`
- Add calibration comments to `rds_free_storage_low` (5 GiB) and `rds_freeable_memory_low` (512 MiB) thresholds in `rds.yaml` noting they are sized for specific instance classes and must be adjusted for others
Changes: - Extend `nat_gateway_port_allocation_errors` alert `for` duration from `1m` to `5m` to reduce transient noise (consistent with other NAT alert) - Add space validation to `format_lb_tags()` for both keys and values, since AWS tag keys must not contain spaces - Add `test_format_lb_tags_space_in_key` and `test_format_lb_tags_space_in_value` test cases for the new validation - Document in `rds_database_connections_high` description annotation that the 500-connection threshold will not fire for small instances (e.g. db.t3.small) - Add comment in `traefik.py` clarifying that `cluster.name`, `true_name`, and `environment` are plain `str` values (not Pulumi Outputs), so `format_lb_tags()` checks work correctly at graph-construction time
All 130 tests pass. Here's the summary: Changes: - `lib.py`: Extended whitespace validation in `format_lb_tags` to reject tabs (`\t`) and newlines (`\n`) in addition to spaces, for both keys and values - `traefik.py`: Extracted inline NLB tag-building logic from `_deploy()` into a module-level `_build_nlb_tag_string(tags, cluster_name)` function - `test_lib.py`: Updated space-check test match strings to `"whitespace"` (matching new message) and added 4 new tests for tab/newline in keys and values - `tests/test_traefik_nlb_tags.py`: New file with 4 tests covering the happy path, `tags is None`, missing `posit.team/true-name`, and missing `posit.team/environment`
All tests pass. Here's the summary: Changes: - `traefik.py`: Split the combined `true_name`/`environment` null check into separate checks so the error message identifies exactly which required tag is missing - `test_traefik_nlb_tags.py`: Added test covering invalid `cluster_name` (comma in name) passed to `_build_nlb_tag_string` - `grafana_alloy.py`: Added comments to ALB and NLB discovery blocks documenting that pre-existing LBs won't be discovered until the cluster is redeployed (migration gap)
All 131 tests pass. Changes: - Add `match="comma or equals"` to `test_build_nlb_tag_string_invalid_cluster_name` so the test verifies the specific error from `format_lb_tags`, not just any `ValueError` - Add FIXME comments in `grafana_alloy.py` for both ALB and NLB discovery blocks with a concrete AWS CLI command for tagging existing load balancers without redeploying
Changes: - `lib.py`: Add AWS tag length validation — raise `ValueError` when key > 128 chars or value > 256 chars, surfacing violations at deploy time with a clear message instead of an opaque AWS API error - `tests/test_lib.py`: Add tests for key and value length limit validation - `grafana_alloy.py`: Add migration comment on `DatabaseConnections` stat change noting the Prometheus metric name changed from `_sum` to `_average` and that dashboards querying the old name need to be updated
All 133 tests pass. The Go build also succeeds. Changes: - `natgateway.yaml`: Add calibration comment to `nat_gateway_packets_dropped` threshold of 100 packets, noting it may need adjustment for high-throughput gateways - `lib.py`: Update `format_lb_tags` docstring to explicitly document that whitespace rejection in tag values is a deliberate constraint (not an AWS limit), since AWS tag values do permit spaces
All 134 tests pass (including the new `test_format_lb_tags_aws_reserved_prefix` test). Changes: - Add `aws:` reserved prefix validation to `format_lb_tags` in `lib.py`, surfacing AWS API errors at graph-construction time - Add test for `aws:` prefix rejection in `test_lib.py` - Collect both `Sum` and `Average` statistics for `DatabaseConnections` in `grafana_alloy.py` to preserve the `_sum` metric during dashboard migration
All 134 tests pass. Changes: - `rds.yaml`: Add instance-class caveat to `rds_freeable_memory_low` description to warn about false positives on small instances (e.g. db.t3.micro, db.t3.small) - `lib.py`: Improve error message for None/empty tag values from "must not be empty" to "must not be None or empty" - `grafana_alloy.py`: Change `UnHealthyHostCount` collection period from `1m` to `5m` for both ALB and NLB to match the alert group interval - `test_lib.py`: Update `test_format_lb_tags_empty_value` assertion to match the updated error message
All 135 tests pass. Changes: - Replaced exact-string assertion in `test_build_nlb_tag_string_happy_path` with a parsed key-value dict assertion to avoid coupling the test to dict insertion order - Added `test_build_nlb_tag_string_empty_cluster_name` to cover the empty `cluster_name=""` edge case, which raises `ValueError` with "must not be None or empty"
All changes verified. 136 tests pass. --- Changes: - Increase `alb_unhealthy_targets` and `nlb_unhealthy_targets` alert `for` from `5m` to `10m` to reduce false positives during rolling deployments/node drains - Add `"\r"` (carriage return) to whitespace rejection checks in `format_lb_tags()` for both keys and values - Add test `test_build_nlb_tag_string_extra_tags_are_dropped` to document that extra tags passed to `_build_nlb_tag_string` are intentionally discarded
All tests pass. Here's a summary of the changes made: Changes: - Fix `traefik.py` `_build_nlb_tag_string` to use `msg = ...; raise ValueError(msg)` pattern consistently with `lib.py` (finding 4) - Extract `_build_alb_tag_string(true_name, environment, compound_name)` module-level helper from `_define_ingress_alb_annotations` in `aws_workload_helm.py` (finding 3) - Update `_define_ingress_alb_annotations` to call the new `_build_alb_tag_string` helper - Add `tests/test_alb_tags.py` with 4 tests covering happy path, tag key presence, invalid compound_name, and empty compound_name (finding 3)
Changes: - Set `noDataState: Alerting` on `rds_free_storage_low` and `rds_freeable_memory_low` in `rds.yaml` so metric scraping failures surface as alerts rather than silently hiding - Set `noDataState: Alerting` on `nat_gateway_port_allocation_errors` in `natgateway.yaml` for the same reason - Added comment on the `alb_target_5xx_errors_high` threshold clarifying it is an absolute count (not an error rate) and noting its traffic-volume sensitivity - Added `test_build_nlb_tag_string_invalid_true_name_value` and `test_build_nlb_tag_string_invalid_environment_value` tests to `test_traefik_nlb_tags.py` to cover invalid characters in tag values extracted from cluster tags - Changed `test_format_lb_tags_normal` in `test_lib.py` to use round-trip parse instead of exact insertion-order string assertion - Added a `TODO` comment to the `DatabaseConnections` dual-collection block in `grafana_alloy.py` to track the pending cleanup
Changes: - Validate `true_name` against `[a-zA-Z0-9._-]+` before interpolating into Alloy River config f-strings in `grafana_alloy.py`, preventing malformed config from special characters - Change `nat_gateway_packets_dropped` `noDataState` from `NoData` to `Alerting` in `natgateway.yaml` to match `nat_gateway_port_allocation_errors` (both use the same metric source) - Add `\r` (carriage return) test cases for both key and value in `test_lib.py` to complete whitespace rejection coverage
All 149 tests pass. Here's a summary of the changes:
Changes:
- `natgateway.yaml`: Revert both NAT gateway rules from `noDataState: Alerting` to `noDataState: NoData` to match RDS/ALB/NLB rules and prevent false positives on exporter restarts
- `rds.yaml`: Add `instance_size_dependent: "true"` label to `rds_freeable_memory_low` so operators can create targeted silences for known-small instance classes
- `grafana_alloy.py`: Extract `_validate_alloy_true_name()` module-level function from the inline `re.match` check in `__init__`
- `test_grafana_alloy.py`: Add `TestValidateAlloyTrueName` class covering valid names and rejections of `"`, `{`, `}`, and space characters
- `grafana_alloy.py`: Replace vague "TODO: Open a tracking issue" with an actionable comment describing exactly what condition triggers cleanup and the cost implication
All tests pass. Here's a summary: Changes: - Confirmed `import re` is already present in `grafana_alloy.py` (no change needed) - Added `noDataState` rationale comments to all 5 rules in `rds.yaml` (Alerting for storage/memory exhaustion; NoData for performance metrics) - Changed `nat_gateway_port_allocation_errors` in `natgateway.yaml` from `noDataState: NoData` to `noDataState: Alerting` to match its "critical" annotation - Added boundary tests `test_format_lb_tags_key_at_limit` (128 chars) and `test_format_lb_tags_value_at_limit` (256 chars) to `test_lib.py` - Added `test_empty_string_rejected` test to `TestValidateAlloyTrueName` in `test_grafana_alloy.py` - Added `# TODO: Open a GitHub issue to track removal of Sum once dashboards are updated.` to the `DatabaseConnections` dual-statistics comment in `grafana_alloy.py`
All 154 tests pass. Changes: - Added `instance_size_dependent: "true"` label to `rds_database_connections_high` in `rds.yaml` for consistency with `rds_freeable_memory_low`, enabling operators to silence the alert for known-small instance classes - Added `test_build_alb_tag_string_invalid_true_name_value` and `test_build_alb_tag_string_invalid_environment_value` tests to `test_alb_tags.py` to match the invalidity coverage already present in `test_traefik_nlb_tags.py`
All tests pass. Here's a summary of the changes made: Changes: - Paused `rds_freeable_memory_low` and `rds_database_connections_high` alerts (`isPaused: true`) until Grafana silence/Alertmanager inhibit config is in place for the `instance_size_dependent` label - Changed `nat_gateway_port_allocation_errors` `noDataState` from `Alerting` to `NoData` to prevent spurious pages during scrape outages, with a comment to add a dedicated "CloudWatch scrape down" alert - Extracted CloudWatch config generation from `_define_config_map` into a new `_define_cloudwatch_config` method to enable unit testing - Replaced the floating TODO comment about the DatabaseConnections Sum cost with a NOTE that is less likely to be overlooked - Added `TestDefineCloudwatchConfig` tests asserting that `AWS/NATGateway`, `AWS/ApplicationELB`, and `AWS/NetworkELB` discovery blocks appear in the rendered config for AWS, and that non-AWS returns an empty string
All 160 tests pass and no new lint issues were introduced. --- Changes: - Validate `compound_name` in `_define_cloudwatch_config` with `_validate_alloy_true_name` to close the injection gap alongside `true_name` - Add test `test_invalid_true_name_raises_value_error` and `test_invalid_compound_name_raises_value_error` in `TestDefineCloudwatchConfig` to cover the validation code path - Add provisioning false-positive note to `rds_free_storage_low` annotation documenting expected `noDataState=Alerting` behavior during CloudWatch scrape initialization - Replace open-ended "Create a GitHub issue" comment with `TODO:` prefix on the `DatabaseConnections` Sum metric to make it searchable - Fix line length of `_make_alloy_for_cloudwatch` test helper signature (split across multiple lines)
All 163 tests pass and Go builds cleanly.
Changes:
- `tests/test_lib.py`: Add test for `format_lb_tags` with `/` in key, exercising the real production key format (`posit.team/true-name`)
- `tests/test_grafana_alloy.py`: Strengthen `TestDefineCloudwatchConfig` tests with assertions that interpolated `true_name` and `compound_name` values appear in the correct `search_tags` positions; add two dedicated tests for these
- `aws_workload_helm.py`: Add docstring comment to `_build_alb_tag_string` clarifying that `format_lb_tags` validates for annotation safety only, and that Alloy River injection safety is enforced separately by `_validate_alloy_true_name` in `grafana_alloy.py`
- `grafana_alerts/loadbalancer.yaml`, `natgateway.yaml`, `rds.yaml`: Add comment documenting that `{{$labels.cluster}}` in alert annotations is injected by the `prometheus.relabel.default` block in Alloy config, and will be blank if relabeling is absent
Set SupportsManagedByCondition to false for all EventBridge actions. The managed-by tag condition prevents managing EventBridge rules that don't already have the posit.team/managed-by tag, blocking operations like DescribeRule and TagResource. Account-scoping is sufficient.
Contributor
Author
Staging VerificationDeployed to Workload (
|
amdove
approved these changes
Feb 25, 2026
This was referenced Feb 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Add cloud-level Grafana alerting for AWS resources that were previously unmonitored.
PTD provisions significant AWS infrastructure but CloudWatch alerting only covered
FSx and EC2 network metrics. This closes the most critical gaps.
Issue
ptd-config#2778
New alert rules
Other changes
WriteLatencyandDeadlocksto RDS Alloy scraping (dashboard visibility)so Alloy discovery can scope metrics per workload
cloudwatch.yamlthresholdquery.paramsreferencing non-existent refIdVerification
just test-python-pulumipasses (111 tests)Grafana → Alerting → "Posit Alerts" folder
Category of change