Skip to content

Fix API Error CloudWatch alarms by specifying API name correctly#103

Merged
ezhangy merged 4 commits intodevfrom
fix-api-error-alarms
Oct 27, 2025
Merged

Fix API Error CloudWatch alarms by specifying API name correctly#103
ezhangy merged 4 commits intodevfrom
fix-api-error-alarms

Conversation

@ezhangy
Copy link
Copy Markdown
Contributor

@ezhangy ezhangy commented Oct 24, 2025

Description

Fixes the 4XXError and 5XXError alarms by specifying the ApiName correctly in the alarm config.

Steps to Test

  1. Deploy the changes to Innov-Platform-Dev
  2. Send invalid requests to the Feedback API (i.e. send a request to the /rating endpoint with an empty request body).
  3. Wait a few minutes, then open the CloudWatch alarm in the AWS console and check that errors appear on the graph.

Note: I was only able to verify the 5XXError alarm since the Feedback API currently only returns 500 or 400 responses. But since the alarms are configured identically besides the metric, I believe this is sufficient to ensure that the 4XXError alarm would work as well.

@ezhangy ezhangy changed the title Fix API Error CloudWatch alarms by specifying stage Fix API Error CloudWatch alarms by specifying API name and stage Oct 24, 2025
@ezhangy ezhangy changed the title Fix API Error CloudWatch alarms by specifying API name and stage Fix API Error CloudWatch alarms by specifying API name correctly Oct 24, 2025
dimensionsMap: {
Name: 'ApiName',
Value: restApiName
ApiName: restApiName
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[DUST] Question for my understanding.

Is the Dimension Key ApiName configured in a special way? From what I'm reading in the CDK docs, it seems like this is user defined and can be an arbitrary value. It looks like we were previously passing "restApiName" with the key "Value".

Now, we're passing "restApiName" with the key "ApiName", which is more specific. How does this affect (or "fix") the behavior of the alarm though?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest, I'm not quite sure either! My approach for this ticket was to find the right config by first creating the alarm in the the AWS console. I did this by finding the 5XXerror metric in the API Gateway monitoring dashboard and clicking the option to create an alarm based off that metric.

Once an alarm is created, you can view the "source" which gives the CloudFormation JSON, and this is where I found the dimensionsMap:

{
    "Type": "AWS::CloudWatch::Alarm",
    "Properties": {
        ...
        "MetricName": "5XXError",
        "Namespace": "AWS/ApiGateway",
        "Statistic": "Sum",
        "Dimensions": [
            {
                "Name": "ApiName",
                "Value": "Feedback API"
            },
            {
                "Name": "Stage",
                "Value": "prod"
            }
        ],
     ...
    }
}

Which I then used to create the alarm in CDK. Through trial-and-error (sending test requests to the API) I figured out that the alarm still works as long as you include the ApiName dimension.

I did some brief internet searching and didn't find too much more useful. If I had to guess how this worked, I believe the dimensions are user-defined, like you said, for custom metrics. However, since the 5XXError and 4XXError metrics are built-in, they come with predefined dimensions (how we're supposed to know what they are without inspecting the metric from the API Gateway dashboard, I'm not sure).

Sorry for the lack of a better answer here!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for taking the time to explain! Definitely agree with the approach you took here of testing + validating to see if changes worked, I find that's often the only way to figure things out in AWS.

Here's some additional context from the AWS docs for Cloudwatch Metric, in case it helps your understanding.

"The metric is a combination of a metric identifier (namespace, name and dimensions) and an aggregation function (statistic, period and unit)."

If the metric identifier is incorrect, AWS won't be able to find/define the alarm properly. Breaking down the different components:
Namespace - the service the metric needs to be associated with
Name - A metric name that can be either user-defined OR one of the default metrics published within the service namespace (5XXError and 4XXError are the latter, as you've noted).
Dimension - Additional attributes associated with the metric. When using Metrics published by AWS, they come with Dimension already defined (in this case, the ApiName and Stage, as we can see in your view source CloudFormation). If a user specifies a metric they can also tag the metric with custom dimensions.

The important part, in my understanding, is that the metric you specify in the alarm with Namespace, Name, and Dimension MUST EXIST. Previously, we were alarming on a metric that did not exist, so the alarm did not work. Here, we've fixed that.

I'm curious why Stage is defined as a Dimension for these metrics but doesn't seem necessary to create the Alarm. Perhaps since Namespace + Name + Dimension: ApiName is sufficient to identify a metric, no additional dimensions are necessary?

Anyway, hope this explainer helps.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see! This is super helpful, thank you so much for taking the time to write out an explanation!!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for detailing how you tracked this down. Nice work!

I found this doc specific to Amazon API Gateway dimensions and metrics. Perhaps, we can reference it for future use cases.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for finding that AnJu! I think that answers John's question about why Stage isn't necessary — it's because there are certain combinations of dimensions you can use to filter the metrics

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! AnJu, thanks for finding this!

@ezhangy ezhangy merged commit d6e14bc into dev Oct 27, 2025
1 check passed
@ezhangy ezhangy deleted the fix-api-error-alarms branch October 27, 2025 15:00
AnJuHyppolite added a commit that referenced this pull request Nov 7, 2025
<!-- Please complete the following sections as necessary. -->

### Description

<!-- Summary of the changes, related issue, relevant motivation, and
context -->
Merges the following PRs to main: 
- #101
- #103
- #104
-  #107
- #106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants