-
Notifications
You must be signed in to change notification settings - Fork 3
fix: stream connections longer than 5 minutes are dropped #244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
This pull request has been linked to Shortcut Story #216849: Fix stream timing out after 5 minutes. |
| : level_(Defaults<AnySDK>::LogLevel()), tag_(Defaults<AnySDK>::LogTag()) {} | ||
| : level_(GetLogLevelEnum(std::getenv("LD_LOG_LEVEL"), | ||
| Defaults<AnySDK>::LogLevel())), | ||
| tag_(Defaults<AnySDK>::LogTag()) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we can just run LD_LOG_LEVEL=debug ./binary but the Console Logger constructed in the client passes in a default value, so the environment variable isn't inspected.
Now it works like you'd want.
| * The timestamp of the last read is initialized after receiving successful | ||
| * headers, but before initiating the first body read. Future reads then | ||
| * update the timestamp. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this helper so we can more easily tell how fast / when data is being received. For example, this clearly shows heartbeats arriving every 180 seconds.
| connect_timeout_(connect_timeout), | ||
| read_timeout_(read_timeout), | ||
| write_timeout_(write_timeout), | ||
| req_(std::move(req)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re-arranging to match declaration order.
| if (ec == boost::asio::error::operation_aborted) { | ||
| logger_("read HTTP response body aborted"); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the key: if the parser indicates that the chunked encoding is finished, initiate backoff instead of attempting to keep reading the body.
| std::move(completion))); | ||
| } | ||
|
|
||
| void do_shutdown(std::function<void()> completion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was causing unnecessary hanging when the test service tells the client to close. By cancelling the backoff timer, that operation can complete immediately.
937b304 to
ed5ff7f
Compare
| boost::ignore_unused(amount); | ||
| if (ec == boost::asio::error::operation_aborted) { | ||
| logger_("read HTTP response body aborted"); | ||
| if (ec == boost::asio::error::operation_aborted && shutting_down_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If on_read_body() can ever be executed from a context that is different from where do_shutdown() is executed then we may have a race between modifying & accessing shutting_down_.
Edit: Never mind, I see this was removed in the last commit :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good callout, especially since I had to add the bool back.
Although the public async_shutdown method will post's to the session_'s executor, the run routine did not - I've fixed that.
Since those are the only two public entrypoints, we should be safe.
01bdc8b to
5e28860
Compare
6d9ca86 to
24fcdfc
Compare
| client_builder.read_timeout( | ||
| std::chrono::milliseconds(*params.readTimeoutMs)); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This builder option was missing, but it's already implemented. Easy win to speed up tests.
23262e3 to
d8e7b1d
Compare
| */ | ||
| void stop(); | ||
|
|
||
| private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lints.
| return std::nullopt; | ||
| } | ||
|
|
||
| client->run(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed run to async_connect to make it clear this is kicking off an asynchronous operation.
| auto const kOutboxCapacity = 1023; | ||
|
|
||
| EventOutbox::EventOutbox(net::any_io_executor executor, | ||
| std::string callback_url) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lints.
|
|
||
| req.set(http::field::host, callback_host_); | ||
| req.method(http::verb::get); | ||
| req.method(http::verb::post); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get was just plain wrong, but accepted by the contract tests.
| srv.add_capability("report"); | ||
| srv.add_capability("post"); | ||
| srv.add_capability("reconnection"); | ||
| srv.add_capability("read-timeout"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We were already passing the parameter along, just hadn't advertised the capability.
| errors_(std::move(errors)) { | ||
| backoff_timer_(std::move(executor)), | ||
| last_read_(std::nullopt), | ||
| shutting_down_(false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Session construction now takes place in create_session(), so the logic can be reused for each new connection.
| } | ||
|
|
||
| void run() override { | ||
| session_.async_connect( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After renaming run, I wrapped it in a post so we can ensure there is only one thread accessing session. This is the same as async_shutdown.
🤖 I have created a release *beep* *boop* --- <details><summary>launchdarkly-cpp-client: 3.0.8</summary> ## [3.0.8](launchdarkly-cpp-client-v3.0.7...launchdarkly-cpp-client-v3.0.8) (2023-09-13) ### Bug Fixes * stream connections longer than 5 minutes are dropped ([#244](#244)) ([e12664f](e12664f)) ### Dependencies * The following workspace dependencies were updated * dependencies * launchdarkly-cpp-internal bumped from 0.1.8 to 0.1.9 * launchdarkly-cpp-common bumped from 0.3.5 to 0.3.6 * launchdarkly-cpp-sse-client bumped from 0.1.2 to 0.1.3 </details> <details><summary>launchdarkly-cpp-common: 0.3.6</summary> ## [0.3.6](launchdarkly-cpp-common-v0.3.5...launchdarkly-cpp-common-v0.3.6) (2023-09-13) ### Bug Fixes * stream connections longer than 5 minutes are dropped ([#244](#244)) ([e12664f](e12664f)) </details> <details><summary>launchdarkly-cpp-internal: 0.1.9</summary> ### Dependencies * The following workspace dependencies were updated * dependencies * launchdarkly-cpp-common bumped from 0.3.5 to 0.3.6 </details> <details><summary>launchdarkly-cpp-sse-client: 0.1.3</summary> ## [0.1.3](launchdarkly-cpp-sse-client-v0.1.2...launchdarkly-cpp-sse-client-v0.1.3) (2023-09-13) ### Bug Fixes * stream connections longer than 5 minutes are dropped ([#244](#244)) ([e12664f](e12664f)) </details> --- This PR was generated with [Release Please](https://github.com/googleapis/release-please). See [documentation](https://github.com/googleapis/release-please#release-please). Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
This fixes streaming connection dropping after 5 minutes.
The problem was applying a 5 minute timeout to an
async_readoperation, when the actual intent was to apply it to the act of receiving any data.The timeout exists to prevent the client from hanging infinitely if the server stops responding. LaunchDarkly sends heartbeats to clear the timer.
With
async_read, the operation won't complete until the response body is finished. Since the body will never finish until the client shuts down (or server interrupts it), it's not possible to institute a timeout.The solution is to use
async_read_some, which completes whenever data arrives.This way, we can assign the 5 minute timeout as an upper bound on receiving any data, as was originally intended.
A side effect of this change was breaking the
reconnectionSSE contract tests, which proved to be somewhat of a rabbithole.The upshot is that we weren't handling chunked encoding properly when chunked encoding ends. The original code assumes we either terminate an async read with an error, or it goes on forever.
In fact it should perform the backoff algorithm if we detect the end of chunked encoding.
I've also included more comments, more debug logging, and renamed some symbols for clarity. This should aid in debugging issues like this more efficiently.