Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In awslogs blocking mode, throttles from CreateLogStream API requests are not retried #42088

Open
bbright024 opened this issue Feb 26, 2021 · 4 comments

Comments

@bbright024
Copy link

bbright024 commented Feb 26, 2021

Description

There is a limit of 50 TPS on CreateLogStream operations, after which transactions are throttled [1], described in [2] -

ThrottlingException
The request was denied due to request throttling.
HTTP Status Code: 400

In the awslogs log driver, errors handled by requests to the CreateLogStream / CreateLogGroup are listed here: [3]
In non-blocking mode, any err from createLogStream() that doesn't match the list of known errors is retried 5 times over the span of a minute [4]. However, in the default blocking mode, any throttle error received from the CreateLogStream/CreateLogGroup API causes the log driver to quit immediately.

"Task failed to start( CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: ThrottlingException: Rate exceeded status code: 400, request id: "

This is happening in batch job situations where many containers are spun up in a short period of time, especially in conjunction with a bunch of lambda functions. When seeing this message, the common strategy has been to request a TPS increase to the CreateLogStream throttle limit - but doing so just delays/reduces the frequency of the error, rather than preventing it from happening. Another option is using non-blocking mode instead of blocking mode, but that's a hacky workaround (and doesn't actually work in fargate 1.4 because of a second bug in the shim logger used by fargate [5][6] - long story). Anyway, adding a retry loop to blocking mode will almost entirely eliminate the frequency of these situations.

[1] https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_CreateLogStream.html
[2] https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/CommonErrors.html
[3] https://github.com/moby/moby/blob/master/daemon/logger/awslogs/cloudwatchlogs.go#L60-L63
[4] https://github.com/moby/moby/blob/master/daemon/logger/awslogs/cloudwatchlogs.go#L140-L199
[5] https://github.com/aws/amazon-ecs-shim-loggers-for-containerd/blob/master/logger/awslogs/logger.go#L84
[6] https://aws.amazon.com/blogs/containers/under-the-hood-fargate-data-plane/

Steps to reproduce the issue:

  1. Use blocking mode with awslogs log driver
  2. Spin up enough containers in a short period of time to overload the 50 TPS limit on CreateLogStream
  3. See which random containers fail/which random log drivers fail

Describe the results you received:

"Task failed to start( CannotStartContainerError: Error response from daemon: failed to initialize logging driver: failed to create Cloudwatch log stream: ThrottlingException: Rate exceeded status code: 400, request id: "

Describe the results you expected:

At least a few retries, like how it's done in non-blocking mode

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

N/A

Output of docker info:

N/A

Additional environment details (AWS, VirtualBox, physical, etc.):

This bug is observable on anything that uses awslogs for CloudWatch logging

@bbright024
Copy link
Author

A colleague pointed out that aws sdk's implement retry logic by default, which can be configured when making the client object.
upon further investigation, we found that the default 3 retries are happening, but they're too close together:

| @timestamp | Throttle | StatusCode | Exception
| 2021-02-03 19:02:17.142 | 1 | 400 | ThrottlingException
| 2021-02-03 19:02:18.009 | 1 | 400 | ThrottlingException
| 2021-02-03 19:02:19.799 | 1 | 400 | ThrottlingException
| 2021-02-03 19:02:22.415 | 1 | 400 | ThrottlingException

After the original request was throttled, the first retry attempt was made .867 second later. The final 2 requests were made following a delay of 1.79 seconds and 2.616 seconds. There were 3 other log streams that were using the same code, and similarly failed to create properly during this time period.

I've made an issue in aws-sdk-go to increase the default delay duration [1]; I'll leave this issue open, however, in case that issue is denied/closed/not actioned, or in case further retry configuration is deemed necessary for awslogs.

[1] aws/aws-sdk-go#3803

@shughes-uk
Copy link

Seems like this now falls under not actioned? The sdk issue has been closed as stale

@thaJeztah
Copy link
Member

/cc @kzys @austinvazquez

@shughes-uk
Copy link

For reference when we boot a 400 node cluster, we lost 19 nodes to this issue today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants