In awslogs blocking mode, throttles from CreateLogStream API requests are not retried #42088

bbright024 · 2021-02-26T20:55:52Z

Description

There is a limit of 50 TPS on CreateLogStream operations, after which transactions are throttled [1], described in [2] -

ThrottlingException
The request was denied due to request throttling.
HTTP Status Code: 400

In the awslogs log driver, errors handled by requests to the CreateLogStream / CreateLogGroup are listed here: [3]
In non-blocking mode, any err from createLogStream() that doesn't match the list of known errors is retried 5 times over the span of a minute [4]. However, in the default blocking mode, any throttle error received from the CreateLogStream/CreateLogGroup API causes the log driver to quit immediately.

"Task failed to start( CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: ThrottlingException: Rate exceeded status code: 400, request id: "

This is happening in batch job situations where many containers are spun up in a short period of time, especially in conjunction with a bunch of lambda functions. When seeing this message, the common strategy has been to request a TPS increase to the CreateLogStream throttle limit - but doing so just delays/reduces the frequency of the error, rather than preventing it from happening. Another option is using non-blocking mode instead of blocking mode, but that's a hacky workaround (and doesn't actually work in fargate 1.4 because of a second bug in the shim logger used by fargate [5][6] - long story). Anyway, adding a retry loop to blocking mode will almost entirely eliminate the frequency of these situations.

[1] https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_CreateLogStream.html
[2] https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/CommonErrors.html
[3] https://github.com/moby/moby/blob/master/daemon/logger/awslogs/cloudwatchlogs.go#L60-L63
[4] https://github.com/moby/moby/blob/master/daemon/logger/awslogs/cloudwatchlogs.go#L140-L199
[5] https://github.com/aws/amazon-ecs-shim-loggers-for-containerd/blob/master/logger/awslogs/logger.go#L84
[6] https://aws.amazon.com/blogs/containers/under-the-hood-fargate-data-plane/

Steps to reproduce the issue:

Use blocking mode with awslogs log driver
Spin up enough containers in a short period of time to overload the 50 TPS limit on CreateLogStream
See which random containers fail/which random log drivers fail

Describe the results you received:

"Task failed to start( CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: ThrottlingException: Rate exceeded status code: 400, request id: "

Describe the results you expected:

At least a few retries, like how it's done in non-blocking mode

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

N/A

Output of docker info:

N/A

Additional environment details (AWS, VirtualBox, physical, etc.):

This bug is observable on anything that uses awslogs for CloudWatch logging

The text was updated successfully, but these errors were encountered:

bbright024 · 2021-03-02T18:58:29Z

A colleague pointed out that aws sdk's implement retry logic by default, which can be configured when making the client object.
upon further investigation, we found that the default 3 retries are happening, but they're too close together:

| @timestamp | Throttle | StatusCode | Exception
| 2021-02-03 19:02:17.142 | 1 | 400 | ThrottlingException
| 2021-02-03 19:02:18.009 | 1 | 400 | ThrottlingException
| 2021-02-03 19:02:19.799 | 1 | 400 | ThrottlingException
| 2021-02-03 19:02:22.415 | 1 | 400 | ThrottlingException

After the original request was throttled, the first retry attempt was made .867 second later. The final 2 requests were made following a delay of 1.79 seconds and 2.616 seconds. There were 3 other log streams that were using the same code, and similarly failed to create properly during this time period.

I've made an issue in aws-sdk-go to increase the default delay duration [1]; I'll leave this issue open, however, in case that issue is denied/closed/not actioned, or in case further retry configuration is deemed necessary for awslogs.

[1] aws/aws-sdk-go#3803

shughes-uk · 2023-04-05T19:25:43Z

Seems like this now falls under not actioned? The sdk issue has been closed as stale

thaJeztah · 2023-04-05T20:58:59Z

/cc @kzys @austinvazquez

shughes-uk · 2023-05-08T19:39:20Z

For reference when we boot a 400 node cluster, we lost 19 nodes to this issue today

bbright024 mentioned this issue Mar 2, 2021

The default min delay in the retryer logic is too short when throttling is done on a per-second level aws/aws-sdk-go#3803

Closed

thaJeztah added the area/logging label Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In awslogs blocking mode, throttles from CreateLogStream API requests are not retried #42088

In awslogs blocking mode, throttles from CreateLogStream API requests are not retried #42088

bbright024 commented Feb 26, 2021 •

edited

bbright024 commented Mar 2, 2021

shughes-uk commented Apr 5, 2023

thaJeztah commented Apr 5, 2023

shughes-uk commented May 8, 2023

In awslogs blocking mode, throttles from CreateLogStream API requests are not retried #42088

In awslogs blocking mode, throttles from CreateLogStream API requests are not retried #42088

Comments

bbright024 commented Feb 26, 2021 • edited

bbright024 commented Mar 2, 2021

shughes-uk commented Apr 5, 2023

thaJeztah commented Apr 5, 2023

shughes-uk commented May 8, 2023

bbright024 commented Feb 26, 2021 •

edited