Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise NetworkError and reconnect when multi_response_nonblock is unexpectedly called #783

Conversation

mervync
Copy link
Contributor

@mervync mervync commented Sep 30, 2021

When an elasticache node is rebooted, it seems to send invalid response headers that cause dalli to think that the multiget response has completed, when there is actually still data queued up in the socket.

This causes dalli to raise RuntimeError, "multi_response has completed", which is acceptable behavior. However, the problem is that the data queued up in the socket then leaks into subsequent responses.

It is unclear if this is an elasticache-specific bug. This may be related to the issue reported in #390.

To work around this issue, raise NetworkError and reconnect whenever we run into this state. The multiget response will be incomplete but it seems like this is being addressed in #754.

Other possible solutions:

  • Call down! - this also works, but may cause the server to be marked as down for a longer time than necessary.
  • Attempt to drain the socket when this happens - seems like a rather more complicated fix.

@petergoldstein
Copy link
Owner

@mervync in your opinion would we need to ensure that #754 is merged along with this submission for correct behavior? Or are the two independent?

And are there other known situations where we're likely to run into this error?

@mervync
Copy link
Contributor Author

mervync commented Sep 30, 2021

Other networking-related issues could cause the issue described in #754 to arise, so the two issues seem independent to me.

I didn't notice any other cases where this error could arise. That said, the code around multi gets is quite complex...

@petergoldstein
Copy link
Owner

@mervync This looks right to me.

The issue seems to potentially arise when either i) There's a client side abort or ii) The server sends a bad key/value header. In either case the natural thing to do is reconnect the socket, since the state of data on the socket is corrupt or subject to a possible race condition. "Normal" operations should be unaffected by this change.

As your tests note, there's also the case where multi_response_start hasn't been called beforehand. One could argue this case should be handled with an error (as the socket isn't even connected), but given the usage of that method I'm satisfied with the proposed approach.

Thanks for the contribution.

@petergoldstein petergoldstein merged commit 5215719 into petergoldstein:master Oct 1, 2021
@mervync mervync deleted the mervyn/reconnect_on_unexpected_multi_response branch October 1, 2021 20:20
@mervync
Copy link
Contributor Author

mervync commented Oct 1, 2021

@petergoldstein Thanks for reviewing and merging. Would greatly appreciate a patch release 🙏

@petergoldstein
Copy link
Owner

@mervync I need to review changes since last patch, and I want to see if we can get one more PR in, but I should do a patch release reasonably soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants