Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in RetryableSendRequest #1519

Closed
zrzka opened this issue May 16, 2018 · 5 comments
Closed

Stuck in RetryableSendRequest #1519

zrzka opened this issue May 16, 2018 · 5 comments
Labels
A-client Area: client. C-bug Category: bug. Something is wrong. This is bad!

Comments

@zrzka
Copy link

zrzka commented May 16, 2018

Description

We have an application named recorder, which handles 0.5 - 1 Gbps of incoming data, process them (basically creates TAR archives) and uploads them to S3. The network load is +- same for tx & rx. If anything bad happens, logs are sent to Loggly via our open sourced rust-slog-loggly crate. This application is running on EC2 instances (m5.large, r4.large).

This load is created by cameras and quite huge amount of requests:

  • download playlist á 6s (per camera, small one, few bytes)
  • download segments (6s duration, several MBs)
  • create TAR archive from them (60s duration, 10 segments)
  • upload TAR archive

From time to time, application stops. Looks like freezed ...

network

... but it's still running and CPU user time goes up, rest goes down ...

cpu

... more info from top -H -p ... ...

top

... and here's what the app is doing ...

gdb

Dependencies

Relevant dependencies.

futures = "0.1.21"
futures-cpupool = "0.1.8"
hyper = "0.11.26"
hyper-tls = "0.1.3"
tokio-core = "0.1.17"
tokio-timer = "0.1.2"

Tokio

99% of processing is done on the main thread within one Tokio loop. There's also CPU pool where slow operations are offloaded (TAR archive creation, ...).

Every single request is wrapped with Tokio timer timeout.

Reproducibility

I can "reproduce" this on Linux machines (EC2, with kernel 4.4.0-1054-aws) and om my local macOS machine.

Why "reproduce" in quotes? It's hard. Sometimes the application runs for a day, for 5 hours or for just 5 minutes without issues.

Workaround

We have to disable keep_alive and retry_canceled_requests.

No more issues (for several hours) since this change. I'll confirm this again later when the app will be up for at least two days.

@seanmonstar seanmonstar added C-bug Category: bug. Something is wrong. This is bad! A-client Area: client. labels May 16, 2018
@seanmonstar
Copy link
Member

I think I may have seen this myself, but didn't realize the issue, or rather, assumed it would clean itself up pretty quickly. Seems not!

@seanmonstar
Copy link
Member

You should be able to keep keep_alive enabled, and simple disable retry_canceled_requests, right?

@zrzka
Copy link
Author

zrzka commented May 16, 2018

Disabled both, as a start to test again & confirm what I have found, but I think it's enough to disable retry canceled requests. Will test this (keep alive enabled) under some load as well, but can't say when I confirm it, because of the reproducibility.

BTW app is still running (~5h). We decided that it will pass internally if it will run for at least a week, because it will be under heavier load later.

@seanmonstar
Copy link
Member

I found that this had been fixed on master, so I've backported the changes to the 0.11.x branch. A fix with this has just been published in v0.11.17.

@zrzka
Copy link
Author

zrzka commented May 16, 2018

Thanks! For future readers, it was published as v0.11.27.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-client Area: client. C-bug Category: bug. Something is wrong. This is bad!
Projects
None yet
Development

No branches or pull requests

2 participants