Introduce safe_get option which ensures key:value integrity even with socket corruption #959

cornu-ammonis · 2023-03-10T17:46:55Z

[Edit] - Note that while the maintainer is focused on errors in my original issue submission, I am focused on the fact that multiple users have reported that Dalli sometimes returns incorrect values, and the get implementation does not validate against this. It is a pretty unacceptable failure case so we have pursued an implementation similar to this PR which rules it out, see aha-app@f6da276

--

Thinking more about #956 - I have high confidence that socket corruption explains the incorrect behavior I observed. But I do not have high confidence I found the exact place socket corruption occurred (unfortunately I do not have the stack trace) or that there are no other potential places. It seems in general that connections do not get locked between write ops and reading responses; an error or timeout between any of those could potentially lead to socket corruption. That may be worth fixing and I think my first PR is still worth considering, but I have thought of a more robust approach.

With safe_get: true, we will issue getk instead of get ops to memcached, and ensure that the returned key matches the requested key. This guarantees that even in the case of socket corruption, we cannot return incorrect values for requested keys. The connection is closed if keys do not match, so that the connection manager will eventually re-establish a connection and recover gracefully.

Using getk vs get comes at the cost of some performance overhead due to key retrieval and comparison. This performance cost would be most significant when caching a large number of small values with comparatively large keys. That is why I lean towards making this an opt-in change - but certainly for our purposes and likely for many other teams, key:value integrity and safety would far outweigh the marginal performance cost.

petergoldstein · 2023-03-10T17:56:04Z

I'm not going to merge this. I don't think it's merited or architecturally sound.

cornu-ammonis · 2023-03-10T17:57:16Z

I'm not going to merge this. I don't think it's merited or architecturally sound.

Could you elaborate on why? Do you have any thoughts on the original issue #956?

petergoldstein · 2023-03-10T18:31:12Z

See my notes on #956. I don't actually believe this is an issue based on the evidence provided.

That said, ultimately this is not architecturally sound because:

It attempts to solve a supposed thread / connection issue with a work-around at the application level, breaking layer separation.
It breaks protocol transparency (binary vs. meta) without any compelling reason
It treats "socket corruption" as a bizarre special case. A socket that is corrupt is no different from one that can't reach the server anymore, that timed out during a request, or had any other network level error. It's just a network error.

cornu-ammonis · 2023-03-10T20:00:12Z

It attempts to solve a supposed thread / connection issue with a work-around at the application level, breaking layer separation.

Is there a lower-level place you'd find more appropriate to address this?

It breaks protocol transparency (binary vs. meta) without any compelling reason

I can definitely address that - I just wanted to get feedback on the approach before implementing in both protocols.

It treats "socket corruption" as a bizarre special case. A socket that is corrupt is no different from one that can't reach the server anymore, that timed out during a request, or had any other network level error. It's just a network error.

The issue is that the get operation does not detect this network error. If the get operation could be sure the socket is in a valid state before proceeding - for example checking that it is empty - then I agree my approach wouldn't be necessary. But it wasn't clear to me that approach would be feasible or preferable to this one.

petergoldstein · 2023-03-10T20:22:02Z

As noted in #965 this entire discussion seems to be predicated on a basic misunderstanding of the code/lifecycle for requests. I don't really think it's productive to discuss how to fix something that is conceptually in error.

cornu-ammonis added 4 commits March 9, 2023 12:35

implement getk with key integrity check

1d43e4e

safe_get option using refined getk implementation

0a6574d

safe_get/getk improvements

cc4582b

replace present? and simplify condition

118cf5a

petergoldstein closed this Mar 10, 2023

cornu-ammonis mentioned this pull request Mar 10, 2023

Dalli sometimes returns incorrect values #956

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce safe_get option which ensures key:value integrity even with socket corruption #959

Introduce safe_get option which ensures key:value integrity even with socket corruption #959

cornu-ammonis commented Mar 10, 2023 •

edited

Loading

petergoldstein commented Mar 10, 2023

cornu-ammonis commented Mar 10, 2023

petergoldstein commented Mar 10, 2023

cornu-ammonis commented Mar 10, 2023

petergoldstein commented Mar 10, 2023

Introduce safe_get option which ensures key:value integrity even with socket corruption #959

Introduce safe_get option which ensures key:value integrity even with socket corruption #959

Conversation

cornu-ammonis commented Mar 10, 2023 • edited Loading

petergoldstein commented Mar 10, 2023

cornu-ammonis commented Mar 10, 2023

petergoldstein commented Mar 10, 2023

cornu-ammonis commented Mar 10, 2023

petergoldstein commented Mar 10, 2023

cornu-ammonis commented Mar 10, 2023 •

edited

Loading