Skip to content

KMS-661: Add Monitoring and Metrics for event publishing (KMS)#102

Merged
cgokey merged 4 commits intomainfrom
KMS-661
Apr 22, 2026
Merged

KMS-661: Add Monitoring and Metrics for event publishing (KMS)#102
cgokey merged 4 commits intomainfrom
KMS-661

Conversation

@cgokey
Copy link
Copy Markdown
Contributor

@cgokey cgokey commented Apr 16, 2026

Overview

What is the feature?

Adds keyword sync observability to publisher by emitting CloudWatch metrics for keyword change detection and SNS keyword event publishing. This also adds a CloudWatch alarm for publish failures and optional email notifications when that alarm fires.

What is the Solution?

This branch adds keyword sync metrics under the KMS/KeywordSync namespace and emits them from the publisher flow during three points in the publish lifecycle:

  • after keyword diff analysis:
    • KeywordChangesDetected
  • after keyword event creation:
    • KeywordEventsGenerated
  • after SNS publish attempts complete:
    • KeywordEventsPublished
    • KeywordEventPublishFailures

The implementation includes:

  • a new shared metric helper for emitting CloudWatch metrics from publisher
  • publisher changes to count and emit metrics without making metric emission itself block publish
  • IAM updates to allow cloudwatch:PutMetricData
  • LocalStack CloudWatch support so the metrics path can be exercised locally
  • a CloudWatch alarm on KeywordEventPublishFailures
  • optional SNS email subscriptions for that alarm via KEYWORD_SYNC_ALARM_EMAILS
  • Bamboo/CDK wiring for the new monitoring configuration

What areas of the application does this impact?

  • Publisher runtime and tests
  • Shared serverless metric utilities
  • KMS IAM/CDK infrastructure
  • LocalStack startup for local verification
  • Bamboo deploy-time environment plumbing

Testing

  1. Local verification
    Start LocalStack and local KMS:
npm run localstack:start
npm run start-local
./scripts/local/create_unique_keyword.sh
curl -X POST 'http://127.0.0.1:3013/publish?name=v1.0.0'
aws --endpoint-url=http://localhost:4566 cloudwatch list-metrics --namespace KMS/KeywordSync
./scripts/local/show_keyword_sync_metrics.js
  1. AWS verification
    Create or update a draft keyword in the target environment
    Publish through the normal AWS flow
    In CloudWatch, open KMS/KeywordSync -> Metrics with no dimensions
    Verify these metric names appear and receive datapoints:
    KeywordChangesDetected
    KeywordEventsGenerated
    KeywordEventsPublished
    KeywordEventPublishFailures
    Use Statistic = Sum and a time window that includes the publish run

  2. Alarm verification
    If KEYWORD_SYNC_ALARM_EMAILS is configured, confirm SNS email subscriptions are confirmed
    You will get a email confirming you want this subscription as well, so look for that.

aws cloudwatch set-alarm-state \
  --alarm-name kms-sit-keyword-event-publish-failures \
  --state-value ALARM \
  --state-reason "Manual KMS-661 email test"

Checklist

  • I have added automated tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.67%. Comparing base (76ed137) to head (f28986a).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #102   +/-   ##
=======================================
  Coverage   99.67%   99.67%           
=======================================
  Files         153      154    +1     
  Lines        3070     3108   +38     
  Branches      737      741    +4     
=======================================
+ Hits         3060     3098   +38     
  Misses          9        9           
  Partials        1        1           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread serverless/src/shared/emitPublisherMetrics.js
Comment thread bin/deploy-bamboo.sh
--env "EDL_PASSWORD=$bamboo_EDL_PASSWORD" \
--env "CMR_BASE_URL=$bamboo_CMR_BASE_URL" \
--env "BLOCK_PUBLISH_ON_KEYWORD_DIFF_FAILURE=${bamboo_BLOCK_PUBLISH_ON_KEYWORD_DIFF_FAILURE:-false}" \
--env "KEYWORD_SYNC_ALARM_EMAILS=${bamboo_KEYWORD_SYNC_ALARM_EMAILS:-}" \
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like if we are gonna do this we should use the support email which will go to Zendesk platform. I don't think KMS has a direct one but, we could have one made or otherwise use MMTs at least for prod

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense to me.

Comment thread cdk/app/lib/helper/KeywordSyncMonitoringSetup.ts
namespace: KEYWORD_SYNC_METRIC_NAMESPACE,
metricName: KEYWORD_EVENT_PUBLISH_FAILURES_METRIC,
statistic: 'Sum',
period: cdk.Duration.minutes(5)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be noisy at 5 minutes
cdk.Duration.days(1)
Shouldn't it match the alarm time anyways?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I changed the failure alarm to aggregate over a one-day period instead of five minutes, and pulled the period into a named constant so the intended alarm window is clearer.

Comment thread serverless/src/shared/emitPublisherMetrics.js
Comment on lines +37 to +44
const countKeywordChanges = (keywordChangesMap) => Array.from(keywordChangesMap.values()).reduce(
(total, changes) => total
+ changes.addedKeywords.size
+ changes.removedKeywords.size
+ changes.changedKeywords.size,
0
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const countKeywordChanges = (keywordChangesMap) => Array.from(keywordChangesMap.values()).reduce(
(total, changes) => total
+ changes.addedKeywords.size
+ changes.removedKeywords.size
+ changes.changedKeywords.size,
0
)
const countKeywordChanges = (keywordChangesMap) => Array.from(keywordChangesMap.values()).reduce(
(total, { addedKeywords, removedKeywords, changedKeywords }) => total
+ addedKeywords.size
+ removedKeywords.size
+ changedKeywords.size,
0
)

I tried a for of which I thought was easier to read too but, the linter complained

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion. Made this change

Comment thread serverless/src/publisher/handler.js Outdated
Comment on lines +510 to +518
await emitPublisherMetricsSafely(
[
{
metricName: PUBLISHER_METRIC_NAMES.KEYWORD_EVENTS_GENERATED,
value: keywordEventsGenerated
}
],
'keyword event generation'
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just batch these together to have a single call only?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I batched these into a single metric emission after the SNS publish summary is known, so all four keyword sync metrics are sent together in one CloudWatch call.

@cgokey cgokey merged commit d38be2e into main Apr 22, 2026
6 checks passed
@cgokey cgokey deleted the KMS-661 branch April 22, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants