Fix API Error CloudWatch alarms by specifying API name correctly#103
Fix API Error CloudWatch alarms by specifying API name correctly#103
Conversation
| dimensionsMap: { | ||
| Name: 'ApiName', | ||
| Value: restApiName | ||
| ApiName: restApiName |
There was a problem hiding this comment.
[DUST] Question for my understanding.
Is the Dimension Key ApiName configured in a special way? From what I'm reading in the CDK docs, it seems like this is user defined and can be an arbitrary value. It looks like we were previously passing "restApiName" with the key "Value".
Now, we're passing "restApiName" with the key "ApiName", which is more specific. How does this affect (or "fix") the behavior of the alarm though?
There was a problem hiding this comment.
To be honest, I'm not quite sure either! My approach for this ticket was to find the right config by first creating the alarm in the the AWS console. I did this by finding the 5XXerror metric in the API Gateway monitoring dashboard and clicking the option to create an alarm based off that metric.
Once an alarm is created, you can view the "source" which gives the CloudFormation JSON, and this is where I found the dimensionsMap:
{
"Type": "AWS::CloudWatch::Alarm",
"Properties": {
...
"MetricName": "5XXError",
"Namespace": "AWS/ApiGateway",
"Statistic": "Sum",
"Dimensions": [
{
"Name": "ApiName",
"Value": "Feedback API"
},
{
"Name": "Stage",
"Value": "prod"
}
],
...
}
}Which I then used to create the alarm in CDK. Through trial-and-error (sending test requests to the API) I figured out that the alarm still works as long as you include the ApiName dimension.
I did some brief internet searching and didn't find too much more useful. If I had to guess how this worked, I believe the dimensions are user-defined, like you said, for custom metrics. However, since the 5XXError and 4XXError metrics are built-in, they come with predefined dimensions (how we're supposed to know what they are without inspecting the metric from the API Gateway dashboard, I'm not sure).
Sorry for the lack of a better answer here!
There was a problem hiding this comment.
Got it, thanks for taking the time to explain! Definitely agree with the approach you took here of testing + validating to see if changes worked, I find that's often the only way to figure things out in AWS.
Here's some additional context from the AWS docs for Cloudwatch Metric, in case it helps your understanding.
"The metric is a combination of a metric identifier (namespace, name and dimensions) and an aggregation function (statistic, period and unit)."
If the metric identifier is incorrect, AWS won't be able to find/define the alarm properly. Breaking down the different components:
Namespace - the service the metric needs to be associated with
Name - A metric name that can be either user-defined OR one of the default metrics published within the service namespace (5XXError and 4XXError are the latter, as you've noted).
Dimension - Additional attributes associated with the metric. When using Metrics published by AWS, they come with Dimension already defined (in this case, the ApiName and Stage, as we can see in your view source CloudFormation). If a user specifies a metric they can also tag the metric with custom dimensions.
The important part, in my understanding, is that the metric you specify in the alarm with Namespace, Name, and Dimension MUST EXIST. Previously, we were alarming on a metric that did not exist, so the alarm did not work. Here, we've fixed that.
I'm curious why Stage is defined as a Dimension for these metrics but doesn't seem necessary to create the Alarm. Perhaps since Namespace + Name + Dimension: ApiName is sufficient to identify a metric, no additional dimensions are necessary?
Anyway, hope this explainer helps.
There was a problem hiding this comment.
I see! This is super helpful, thank you so much for taking the time to write out an explanation!!
There was a problem hiding this comment.
Thanks for detailing how you tracked this down. Nice work!
I found this doc specific to Amazon API Gateway dimensions and metrics. Perhaps, we can reference it for future use cases.
There was a problem hiding this comment.
Thanks so much for finding that AnJu! I think that answers John's question about why Stage isn't necessary — it's because there are certain combinations of dimensions you can use to filter the metrics
Description
Fixes the 4XXError and 5XXError alarms by specifying the ApiName correctly in the alarm config.
Steps to Test
Innov-Platform-Dev/ratingendpoint with an empty request body).Note: I was only able to verify the 5XXError alarm since the Feedback API currently only returns 500 or 400 responses. But since the alarms are configured identically besides the metric, I believe this is sufficient to ensure that the 4XXError alarm would work as well.