You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a limit of 50 TPS on CreateLogStream operations, after which transactions are throttled [1], described in [2] -
ThrottlingException
The request was denied due to request throttling.
HTTP Status Code: 400
In the awslogs log driver, errors handled by requests to the CreateLogStream / CreateLogGroup are listed here: [3]
In non-blocking mode, any err from createLogStream() that doesn't match the list of known errors is retried 5 times over the span of a minute [4]. However, in the default blocking mode, any throttle error received from the CreateLogStream/CreateLogGroup API causes the log driver to quit immediately.
"Task failed to start( CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: ThrottlingException: Rate exceeded status code: 400, request id: "
This is happening in batch job situations where many containers are spun up in a short period of time, especially in conjunction with a bunch of lambda functions. When seeing this message, the common strategy has been to request a TPS increase to the CreateLogStream throttle limit - but doing so just delays/reduces the frequency of the error, rather than preventing it from happening. Another option is using non-blocking mode instead of blocking mode, but that's a hacky workaround (and doesn't actually work in fargate 1.4 because of a second bug in the shim logger used by fargate [5][6] - long story). Anyway, adding a retry loop to blocking mode will almost entirely eliminate the frequency of these situations.
A colleague pointed out that aws sdk's implement retry logic by default, which can be configured when making the client object.
upon further investigation, we found that the default 3 retries are happening, but they're too close together:
After the original request was throttled, the first retry attempt was made .867 second later. The final 2 requests were made following a delay of 1.79 seconds and 2.616 seconds. There were 3 other log streams that were using the same code, and similarly failed to create properly during this time period.
I've made an issue in aws-sdk-go to increase the default delay duration [1]; I'll leave this issue open, however, in case that issue is denied/closed/not actioned, or in case further retry configuration is deemed necessary for awslogs.
Description
There is a limit of 50 TPS on CreateLogStream operations, after which transactions are throttled [1], described in [2] -
ThrottlingException
The request was denied due to request throttling.
HTTP Status Code: 400
In the awslogs log driver, errors handled by requests to the CreateLogStream / CreateLogGroup are listed here: [3]
In non-blocking mode, any err from createLogStream() that doesn't match the list of known errors is retried 5 times over the span of a minute [4]. However, in the default blocking mode, any throttle error received from the CreateLogStream/CreateLogGroup API causes the log driver to quit immediately.
"Task failed to start( CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: ThrottlingException: Rate exceeded status code: 400, request id: "
This is happening in batch job situations where many containers are spun up in a short period of time, especially in conjunction with a bunch of lambda functions. When seeing this message, the common strategy has been to request a TPS increase to the CreateLogStream throttle limit - but doing so just delays/reduces the frequency of the error, rather than preventing it from happening. Another option is using non-blocking mode instead of blocking mode, but that's a hacky workaround (and doesn't actually work in fargate 1.4 because of a second bug in the shim logger used by fargate [5][6] - long story). Anyway, adding a retry loop to blocking mode will almost entirely eliminate the frequency of these situations.
[1] https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_CreateLogStream.html
[2] https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/CommonErrors.html
[3] https://github.com/moby/moby/blob/master/daemon/logger/awslogs/cloudwatchlogs.go#L60-L63
[4] https://github.com/moby/moby/blob/master/daemon/logger/awslogs/cloudwatchlogs.go#L140-L199
[5] https://github.com/aws/amazon-ecs-shim-loggers-for-containerd/blob/master/logger/awslogs/logger.go#L84
[6] https://aws.amazon.com/blogs/containers/under-the-hood-fargate-data-plane/
Steps to reproduce the issue:
Describe the results you received:
"Task failed to start( CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: ThrottlingException: Rate exceeded status code: 400, request id: "
Describe the results you expected:
At least a few retries, like how it's done in non-blocking mode
Additional information you deem important (e.g. issue happens only occasionally):
Output of
docker version
:N/A
Output of
docker info
:N/A
Additional environment details (AWS, VirtualBox, physical, etc.):
This bug is observable on anything that uses awslogs for CloudWatch logging
The text was updated successfully, but these errors were encountered: