Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

http.send cache concurrency issues #5359

Closed
asleire opened this issue Nov 4, 2022 · 18 comments
Closed

http.send cache concurrency issues #5359

asleire opened this issue Nov 4, 2022 · 18 comments

Comments

@asleire
Copy link
Contributor

asleire commented Nov 4, 2022

Short description

The http.send interquery cache breaks during concurrent OPA requests.

Symptoms include:

  • The usage counter increases even though the same request is being made
  • Even if there are only two unique http.send requests being made by the policy, when the usage counter reaches max size something breaks in a way that an outgoing http request is made on every single request to OPA

Steps To Reproduce

In short:

  1. Create a policy whose evaluation will cause 2 http.send requests force cached for 1 second
  2. Run many concurrent requests to evaluate the policy
  3. Look at outgoing requests

In long:

I don't know Go so I wouldn't know how to reproduce this directly in the OPA code. Initially I ran load tests against our OPA server and detected through our metrics system that OPA suddenly started making a lot of outgoing requests.

My minimal reproduction is as follows:

  1. I cloned the OPA repository, and just before this line I simply added a line fmt.Println(time.Now().Second(), "unsafeInsert", c.usage, dropped) which would tell me when something was inserted into the cache, and the size of the cache.
  2. I added a policy that makes two http.send requests, each with a 1 second forced duration. See below for my policy. Note that I hid the URLs as I don't want just anyone to spam them :) Any URL that returns some amount of JSON data should do.
  3. I run the OPA server using the following command: go run main.go run policy -s --set=caching.inter_query_builtin_cache.max_size_bytes=1000000 --set=decision_logs.console=false --log-level error
  4. I run a load test on the OPA server using Locust and several users. See below for my script, and my observations.

My OPA policy:

Click me
package loadtest

req1 := http.send({
	"url": "<url1>",
	"method": "GET",
	"raise_error": false,
	"force_cache": true,
	"force_cache_duration_seconds": 1,
})

req2 := http.send({
	"url": "<url2>",
	"method": "GET",
	"raise_error": false,
	"force_cache": true,
	"force_cache_duration_seconds": 1,
})

result := "OK" {
	not req1.error
    not req2.error
} else := "ERROR" {
	true
}

My locust script:

Click me
from locust import HttpUser, task

class HelloWorldUser(HttpUser):
    @task
    def read(self):
        self.client.get("/v1/data/loadtest/result")

Locust test run with a single user:
This works just like one would expect. The output from OPA is as follows:

44 unsafeInsert 4757 0
44 unsafeInsert 10902 0
45 unsafeInsert 10902 0
45 unsafeInsert 10902 0
46 unsafeInsert 10902 0
46 unsafeInsert 10902 0
47 unsafeInsert 10902 0
47 unsafeInsert 10902 0
48 unsafeInsert 10902 0
48 unsafeInsert 10902 0
49 unsafeInsert 10902 0
49 unsafeInsert 10902 0

Every second, two items are inserted into the cache, and the cache size does not grow indefinitely

Locust test run with two users:
The output from OPA is as follows:

56 unsafeInsert 4757 0
56 unsafeInsert 10902 0
57 unsafeInsert 10902 0
57 unsafeInsert 15659 0
57 unsafeInsert 15659 0
57 unsafeInsert 21804 0
58 unsafeInsert 21803 0
58 unsafeInsert 20415 0
58 unsafeInsert 26559 0
58 unsafeInsert 32704 0
59 unsafeInsert 32704 0
59 unsafeInsert 31316 0
59 unsafeInsert 37461 0
59 unsafeInsert 43606 0
... // truncated, it keeps growing!

4 requests per second. I guess that's fine, but the cache size keeps growing!

At some point the cache size reaches the configured max size. And at that point the logs start looking like this:

Click me
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997022 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 995635 1
7 unsafeInsert 995634 1
7 unsafeInsert 995634 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 997022 1
7 unsafeInsert 997023 1
7 unsafeInsert 997022 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 995635 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 995635 1
7 unsafeInsert 997023 1
7 unsafeInsert 995635 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 995635 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
7 unsafeInsert 997023 1
Very, very many outgoing requests per second. The cache is effectively broken

Expected behavior

Caching should work even with concurrent requests :)

Additional context

I've reproduced this in the docker images 0.45.0 and 0.46.1, as well as by compiling and running the main branch at this point.

Let me know if you need any more info

@asleire asleire added the bug label Nov 4, 2022
@asleire
Copy link
Contributor Author

asleire commented Nov 4, 2022

You may find this interesting (I did!): When we first noticed this issue in our cluster, we saw that affected OPA containers had a huge drop in memory usage the moment it started doing a lot of outgoing HTTP requests
image

@srenatus
Copy link
Contributor

srenatus commented Nov 4, 2022

Thanks for writing this up.

Caching should work even with concurrent requests :)

Yes, definitely 🙃

So I can't pinpoint this right away, the only thing I notices is that if you go from

	c.mtx.Lock()
	defer c.mtx.Unlock()
	return c.unsafeInsert(k, v)

to

	c.mtx.Lock()
	defer c.mtx.Unlock()
	fmt.Println(time.Now().Second(), "unsafeInsert", c.usage, dropped)
	return c.unsafeInsert(k, v)

you will never see dropped != 0. Maybe it would be interesting to do this instead

	c.mtx.Lock()
	defer c.mtx.Unlock()
	dropped = c.unsafeInsert(k, v)
	fmt.Println(time.Now().Second(), "unsafeInsert", c.usage, dropped)
	return

@asleire
Copy link
Contributor Author

asleire commented Nov 4, 2022

I think you're looking at the wrong line? Edit: I've updated OP with a permalink just to be safe :)

@srenatus
Copy link
Contributor

srenatus commented Nov 4, 2022

So please correct me if I'm wrong but there are two things here to think about, I suppose:

  1. Why does the same request keep getting made in the first place, when it shouldn't be based on the caching logic?
  2. what happens if the inter query cache is full?

@srenatus
Copy link
Contributor

srenatus commented Nov 4, 2022

I think you're looking at the wrong line?

yes I was 😳 sorry about that. please ignore.

@asleire
Copy link
Contributor Author

asleire commented Nov 4, 2022

  1. Why does the same request keep getting made in the first place, when it shouldn't be based on the caching logic?

I'm guessing there's no locking mechanism on identical http.send requests, so when OPA processes two concurrent evals those two evals will both see that the currently cached item has expired and they both will perform a new http request and insert each response into the cache.

It would be nice that OPA didn't make these concurrent identical requests, but my big issue is that when the cache eventually becomes full the cache barely seems to work at all

@srenatus
Copy link
Contributor

srenatus commented Nov 4, 2022

I'm guessing there's no locking mechanism on identical http.send requests, so when OPA processes two concurrent evals those two evals will both see that the currently cached item has expired and they both will perform a new http request and insert each response into the cache.

So you got me thinking -- the two requests shouldn't each be stored in the cache, but only one of them. But looking at the code, that's just half the story:

opa/topdown/cache/cache.go

Lines 141 to 142 in e5bc273

c.items[k.String()] = v
c.l.PushBack(k)

L141 will overwrite, but L142 will append.

@srenatus
Copy link
Contributor

srenatus commented Nov 4, 2022

Similarly when that has happened, the extra copy is never removed: the c.l.Remove() in L136 will only remove one copy of it from the list:

c.l.Remove(key)

playground

@srenatus
Copy link
Contributor

srenatus commented Nov 4, 2022

But none of this should happen in the first place:

opa/topdown/cache/cache.go

Lines 101 to 105 in e5bc273

func (c *cache) Get(k ast.Value) (InterQueryCacheValue, bool) {
c.mtx.Lock()
defer c.mtx.Unlock()
return c.unsafeGet(k)
}

Even for concurrent requests, the mutex lock on the cache should mean that for two concurrent requests of the same key (i.e. http.send args), only one should ever be made 🤔 Where does it go wrong...?

@asleire
Copy link
Contributor Author

asleire commented Nov 4, 2022

the extra copy is never removed

It should be removed when the cache is full here: https://github.com/open-policy-agent/opa/blob/main/topdown/cache/cache.go#L136

Even for concurrent requests, the mutex lock on the cache should mean that for two concurrent requests of the same key (i.e. http.send args), only one should ever be made

It only prevents them from accessing the cache at the same time, but they will both insert the cache item. Which will cause the size to increase twice. But when the cache item is eventually removed, the size is only decreased once. ... Right? Did I just figure it out? 😅

@srenatus
Copy link
Contributor

srenatus commented Nov 4, 2022

OK I think I see the problem: While each Get and Insert method of the cache is guarded by a mutex, the caller doesn't do that: in topdown/http.go, we first check if the response is cached,

resp, err := c.checkHTTPSendInterQueryCache()

and then we insert the result.... your response has just popped up, and yes, I believe you've figured it out 😄

@srenatus
Copy link
Contributor

srenatus commented Nov 4, 2022

Now, how will we solve this? 😅

@srenatus srenatus added this to Backlog in Open Policy Agent via automation Nov 4, 2022
@asleire
Copy link
Contributor Author

asleire commented Nov 4, 2022

Personally I might prefer if http.send somehow acquired a lock on the cache key before it checks the cache or executes the HTTP request. This way http.send can perform different HTTP requests concurrently while only doing identical HTTP requests once, which I'm guessing is the behavior most people would want from forced caching

@asleire
Copy link
Contributor Author

asleire commented Nov 4, 2022

Of course, the very easy fix is to simply make sure we subtract the size of the old value before replacing it and adding the size of the new value, as done in #5361.

I tested with 10 concurrent requests, and while I'd still prefer multiple HTTP requests not be made, at least this fixes a bad bug 🙂

Click me!
0 unsafeInsert 6145 0
0 unsafeInsert 6145 0
0 unsafeInsert 10902 0
0 unsafeInsert 10902 0
1 unsafeInsert 6145 0
1 unsafeInsert 6145 0
1 unsafeInsert 6145 0
1 unsafeInsert 10902 0
1 unsafeInsert 10902 0
1 unsafeInsert 10900 0
2 unsafeInsert 10900 0
2 unsafeInsert 6145 0
2 unsafeInsert 6145 0
2 unsafeInsert 6145 0
2 unsafeInsert 6145 0
2 unsafeInsert 6145 0
2 unsafeInsert 10902 0
2 unsafeInsert 10902 0
2 unsafeInsert 10902 0
2 unsafeInsert 10902 0
2 unsafeInsert 10902 0
2 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
3 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10900 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
4 unsafeInsert 10902 0
5 unsafeInsert 10901 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
5 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
6 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10901 0
7 unsafeInsert 10902 0
7 unsafeInsert 10901 0
7 unsafeInsert 10901 0
7 unsafeInsert 10901 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
7 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
8 unsafeInsert 10902 0
9 unsafeInsert 10901 0
9 unsafeInsert 10901 0
9 unsafeInsert 10901 0
9 unsafeInsert 10901 0
9 unsafeInsert 10902 0
9 unsafeInsert 10902 0
9 unsafeInsert 10902 0
9 unsafeInsert 10902 0
9 unsafeInsert 10902 0
9 unsafeInsert 10902 0
9 unsafeInsert 10902 0
9 unsafeInsert 10902 0
9 unsafeInsert 10902 0
9 unsafeInsert 10902 0
9 unsafeInsert 10901 0
9 unsafeInsert 10902 0
9 unsafeInsert 10901 0
9 unsafeInsert 10902 0
9 unsafeInsert 10901 0
9 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10901 0
10 unsafeInsert 10901 0
10 unsafeInsert 10901 0
10 unsafeInsert 10901 0
10 unsafeInsert 10901 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
10 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10901 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0
11 unsafeInsert 10902 0

@anderseknert
Copy link
Member

anderseknert commented Nov 7, 2022

@asleire thanks for an excellent bug report! We'll look into it 👍

EDIT: Oh, I saw you've made a PR already. Great!

@ashutosh-narkar
Copy link
Member

Let's try to fix this by investigating if we can lock the cache check - execute HTTP req - insert in cache actions. I can look this.

ashutosh-narkar added a commit to ashutosh-narkar/opa that referenced this issue Nov 10, 2022
This change adds a lock on per key cache access to handle
cases where concurrent queries could result in the same key
being unnecessarily overwritten multiple times and fetched from the
server even though a fresh entry existed in the cache.

Fixes: open-policy-agent#5359

Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
ashutosh-narkar pushed a commit to asleire/opa that referenced this issue Nov 22, 2022
Signed-off-by: Aleksander <Alekken@live.no>
Open Policy Agent automation moved this from Backlog to Done Nov 24, 2022
@runes83
Copy link

runes83 commented Dec 4, 2022

When will there be a new releaes containing this fix?

@srenatus
Copy link
Contributor

srenatus commented Dec 4, 2022

Tomorrow. 🤞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging a pull request may close this issue.

5 participants