Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce safe_get option which ensures key:value integrity even with socket corruption #959

Closed
wants to merge 4 commits into from

Conversation

cornu-ammonis
Copy link

@cornu-ammonis cornu-ammonis commented Mar 10, 2023

[Edit] - Note that while the maintainer is focused on errors in my original issue submission, I am focused on the fact that multiple users have reported that Dalli sometimes returns incorrect values, and the get implementation does not validate against this. It is a pretty unacceptable failure case so we have pursued an implementation similar to this PR which rules it out, see aha-app@f6da276

--

Thinking more about #956 - I have high confidence that socket corruption explains the incorrect behavior I observed. But I do not have high confidence I found the exact place socket corruption occurred (unfortunately I do not have the stack trace) or that there are no other potential places. It seems in general that connections do not get locked between write ops and reading responses; an error or timeout between any of those could potentially lead to socket corruption. That may be worth fixing and I think my first PR is still worth considering, but I have thought of a more robust approach.

With safe_get: true, we will issue getk instead of get ops to memcached, and ensure that the returned key matches the requested key. This guarantees that even in the case of socket corruption, we cannot return incorrect values for requested keys. The connection is closed if keys do not match, so that the connection manager will eventually re-establish a connection and recover gracefully.

Using getk vs get comes at the cost of some performance overhead due to key retrieval and comparison. This performance cost would be most significant when caching a large number of small values with comparatively large keys. That is why I lean towards making this an opt-in change - but certainly for our purposes and likely for many other teams, key:value integrity and safety would far outweigh the marginal performance cost.

@petergoldstein
Copy link
Owner

I'm not going to merge this. I don't think it's merited or architecturally sound.

@cornu-ammonis
Copy link
Author

I'm not going to merge this. I don't think it's merited or architecturally sound.

Could you elaborate on why? Do you have any thoughts on the original issue #956?

@petergoldstein
Copy link
Owner

See my notes on #956. I don't actually believe this is an issue based on the evidence provided.

That said, ultimately this is not architecturally sound because:

  1. It attempts to solve a supposed thread / connection issue with a work-around at the application level, breaking layer separation.
  2. It breaks protocol transparency (binary vs. meta) without any compelling reason
  3. It treats "socket corruption" as a bizarre special case. A socket that is corrupt is no different from one that can't reach the server anymore, that timed out during a request, or had any other network level error. It's just a network error.

@cornu-ammonis
Copy link
Author

  1. It attempts to solve a supposed thread / connection issue with a work-around at the application level, breaking layer separation.

Is there a lower-level place you'd find more appropriate to address this?

  1. It breaks protocol transparency (binary vs. meta) without any compelling reason

I can definitely address that - I just wanted to get feedback on the approach before implementing in both protocols.

  1. It treats "socket corruption" as a bizarre special case. A socket that is corrupt is no different from one that can't reach the server anymore, that timed out during a request, or had any other network level error. It's just a network error.

The issue is that the get operation does not detect this network error. If the get operation could be sure the socket is in a valid state before proceeding - for example checking that it is empty - then I agree my approach wouldn't be necessary. But it wasn't clear to me that approach would be feasible or preferable to this one.

@petergoldstein
Copy link
Owner

As noted in #965 this entire discussion seems to be predicated on a basic misunderstanding of the code/lifecycle for requests. I don't really think it's productive to discuss how to fix something that is conceptually in error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants