-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http.send cache concurrency issues #5359
Comments
Thanks for writing this up.
Yes, definitely 🙃 So I can't pinpoint this right away, the only thing I notices is that if you go from c.mtx.Lock()
defer c.mtx.Unlock()
return c.unsafeInsert(k, v) to c.mtx.Lock()
defer c.mtx.Unlock()
fmt.Println(time.Now().Second(), "unsafeInsert", c.usage, dropped)
return c.unsafeInsert(k, v) you will never see dropped != 0. Maybe it would be interesting to do this instead c.mtx.Lock()
defer c.mtx.Unlock()
dropped = c.unsafeInsert(k, v)
fmt.Println(time.Now().Second(), "unsafeInsert", c.usage, dropped)
return |
I think you're looking at the wrong line? Edit: I've updated OP with a permalink just to be safe :) |
So please correct me if I'm wrong but there are two things here to think about, I suppose:
|
yes I was 😳 sorry about that. please ignore. |
I'm guessing there's no locking mechanism on identical It would be nice that OPA didn't make these concurrent identical requests, but my big issue is that when the cache eventually becomes full the cache barely seems to work at all |
So you got me thinking -- the two requests shouldn't each be stored in the cache, but only one of them. But looking at the code, that's just half the story: Lines 141 to 142 in e5bc273
L141 will overwrite, but L142 will append. |
Similarly when that has happened, the extra copy is never removed: the c.l.Remove() in L136 will only remove one copy of it from the list: Line 136 in e5bc273
|
But none of this should happen in the first place: Lines 101 to 105 in e5bc273
Even for concurrent requests, the mutex lock on the cache should mean that for two concurrent requests of the same key (i.e. http.send args), only one should ever be made 🤔 Where does it go wrong...? |
It should be removed when the cache is full here: https://github.com/open-policy-agent/opa/blob/main/topdown/cache/cache.go#L136
It only prevents them from accessing the cache at the same time, but they will both insert the cache item. Which will cause the size to increase twice. But when the cache item is eventually removed, the size is only decreased once. ... Right? Did I just figure it out? 😅 |
OK I think I see the problem: While each Get and Insert method of the cache is guarded by a mutex, the caller doesn't do that: in Line 1254 in e5bc273
and then we insert the result.... your response has just popped up, and yes, I believe you've figured it out 😄 |
Now, how will we solve this? 😅 |
Personally I might prefer if |
Of course, the very easy fix is to simply make sure we subtract the size of the old value before replacing it and adding the size of the new value, as done in #5361. I tested with 10 concurrent requests, and while I'd still prefer multiple HTTP requests not be made, at least this fixes a bad bug 🙂 Click me!
|
@asleire thanks for an excellent bug report! We'll look into it 👍 EDIT: Oh, I saw you've made a PR already. Great! |
Let's try to fix this by investigating if we can lock the cache check - execute HTTP req - insert in cache actions. I can look this. |
This change adds a lock on per key cache access to handle cases where concurrent queries could result in the same key being unnecessarily overwritten multiple times and fetched from the server even though a fresh entry existed in the cache. Fixes: open-policy-agent#5359 Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
Signed-off-by: Aleksander <Alekken@live.no>
When will there be a new releaes containing this fix? |
Tomorrow. 🤞 |
Short description
The
http.send
interquery cache breaks during concurrent OPA requests.Symptoms include:
http.send
requests being made by the policy, when theusage
counter reaches max size something breaks in a way that an outgoing http request is made on every single request to OPASteps To Reproduce
In short:
http.send
requests force cached for 1 secondIn long:
I don't know Go so I wouldn't know how to reproduce this directly in the OPA code. Initially I ran load tests against our OPA server and detected through our metrics system that OPA suddenly started making a lot of outgoing requests.
My minimal reproduction is as follows:
fmt.Println(time.Now().Second(), "unsafeInsert", c.usage, dropped)
which would tell me when something was inserted into the cache, and the size of the cache.http.send
requests, each with a 1 second forced duration. See below for my policy. Note that I hid the URLs as I don't want just anyone to spam them :) Any URL that returns some amount of JSON data should do.go run main.go run policy -s --set=caching.inter_query_builtin_cache.max_size_bytes=1000000 --set=decision_logs.console=false --log-level error
My OPA policy:
Click me
My locust script:
Click me
Locust test run with a single user:
This works just like one would expect. The output from OPA is as follows:
Every second, two items are inserted into the cache, and the cache size does not grow indefinitely
Locust test run with two users:
The output from OPA is as follows:
4 requests per second. I guess that's fine, but the cache size keeps growing!
At some point the cache size reaches the configured max size. And at that point the logs start looking like this:
Click me
Expected behavior
Caching should work even with concurrent requests :)
Additional context
I've reproduced this in the docker images
0.45.0
and0.46.1
, as well as by compiling and running the main branch at this point.Let me know if you need any more info
The text was updated successfully, but these errors were encountered: