Raise NetworkError and reconnect when multi_response_nonblock is unexpectedly called #783

mervync · 2021-09-30T00:27:58Z

When an elasticache node is rebooted, it seems to send invalid response headers that cause dalli to think that the multiget response has completed, when there is actually still data queued up in the socket.

This causes dalli to raise RuntimeError, "multi_response has completed", which is acceptable behavior. However, the problem is that the data queued up in the socket then leaks into subsequent responses.

It is unclear if this is an elasticache-specific bug. This may be related to the issue reported in #390.

To work around this issue, raise NetworkError and reconnect whenever we run into this state. The multiget response will be incomplete but it seems like this is being addressed in #754.

Other possible solutions:

Call down! - this also works, but may cause the server to be marked as down for a longer time than necessary.
Attempt to drain the socket when this happens - seems like a rather more complicated fix.

…pectedly called

petergoldstein · 2021-09-30T15:42:18Z

@mervync in your opinion would we need to ensure that #754 is merged along with this submission for correct behavior? Or are the two independent?

And are there other known situations where we're likely to run into this error?

mervync · 2021-09-30T19:01:22Z

Other networking-related issues could cause the issue described in #754 to arise, so the two issues seem independent to me.

I didn't notice any other cases where this error could arise. That said, the code around multi gets is quite complex...

petergoldstein · 2021-10-01T16:37:35Z

@mervync This looks right to me.

The issue seems to potentially arise when either i) There's a client side abort or ii) The server sends a bad key/value header. In either case the natural thing to do is reconnect the socket, since the state of data on the socket is corrupt or subject to a possible race condition. "Normal" operations should be unaffected by this change.

As your tests note, there's also the case where multi_response_start hasn't been called beforehand. One could argue this case should be handled with an error (as the socket isn't even connected), but given the usage of that method I'm satisfied with the proposed approach.

Thanks for the contribution.

mervync · 2021-10-01T20:24:56Z

@petergoldstein Thanks for reviewing and merging. Would greatly appreciate a patch release 🙏

petergoldstein · 2021-10-01T20:44:04Z

@mervync I need to review changes since last patch, and I want to see if we can get one more PR in, but I should do a patch release reasonably soon.

Raise NetworkError and reconnect when multi_response_nonblock is unex…

4e918d3

…pectedly called

petergoldstein merged commit 5215719 into petergoldstein:master Oct 1, 2021

mervync deleted the mervyn/reconnect_on_unexpected_multi_response branch October 1, 2021 20:20

tierra mentioned this pull request Nov 11, 2021

fix: resolve Dalli::Server deprecation in 3.0+ open-telemetry/opentelemetry-ruby#1015

Merged

mervync mentioned this pull request Mar 16, 2022

Revert "[CCORE-371] Fall back to database on 'multi_response has completed' RuntimeError" zendesk/kasket#69

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise NetworkError and reconnect when multi_response_nonblock is unexpectedly called #783

Raise NetworkError and reconnect when multi_response_nonblock is unexpectedly called #783

mervync commented Sep 30, 2021 •

edited

petergoldstein commented Sep 30, 2021

mervync commented Sep 30, 2021

petergoldstein commented Oct 1, 2021

mervync commented Oct 1, 2021

petergoldstein commented Oct 1, 2021

Raise NetworkError and reconnect when multi_response_nonblock is unexpectedly called #783

Raise NetworkError and reconnect when multi_response_nonblock is unexpectedly called #783

Conversation

mervync commented Sep 30, 2021 • edited

petergoldstein commented Sep 30, 2021

mervync commented Sep 30, 2021

petergoldstein commented Oct 1, 2021

mervync commented Oct 1, 2021

petergoldstein commented Oct 1, 2021

mervync commented Sep 30, 2021 •

edited