-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fatal error: concurrent map iteration and map write #5868
Comments
Thanks for reporting that @jewee! Sure looks like a bug to me. We'll look into it. |
@jewee thanks for reporting. Can you please try OPA |
we have recently upgraded to version |
We seem to have the same error "fatal error: concurrent map iteration and map write" not sure if it is related. Otherwise, I can create a new ticket. Running in AWS Fargate with OPA version 0.52.0
|
Thanks @scarlier 👍 That's useful for prioritization. |
@scarlier you seem to be using the |
This is our function for the http.send
i must say that we did notice some strange behavior that the cache does not work. at-least it works for around an hour and then we see the url getting hit a lot, normal use it would be a few call each hour. but then we see peak usage of around 350 rpm while the policy is called at a rate for around 700 rpm We call the policy with url
Opa is running in 2 fargate containers behind a NLB running on 2048 cpu units on each container. |
Thanks for sharing this @scarlier. Just wanted to point out that since you're using
Is this for the same request being made to OPA multiple times w/o any changes to any of the |
Yes, I am aware of caching mode. It was a test to see if things would improve. But memory would not seem to be a problem as the container is only using 10% of its memory. We don't really have large peak's in request and will vary within 100rpm. But it seems to start giving issues around 30-60 min of runtime after a fresh reboot. Then, we notice that calls are being made we expected to be in cache. Note that the container does not crash at this point but seem to ignore (partly?) the cache functionality of http.send after a while (random?) the container crashes with a fatal, this happens around 6 times a day. The request is the same for all calls, since the kid does not change that often, which is the only parameter we have. We did remove the function fetch_jwks from our code and stored the contents of the URL inline into the policy. And we did not have a single crash after a day of running. So the issue has to be inside that function. |
@jewee on the error you're seeing You mentioned you're running OPA in Kubernetes and from the logs you sent there seems to be a k8s From the logs, it looks like there are 2 concurrent health checks. Each of them will eventually access the modules on the compiler which are stored in a map and that results in the panic. So it would be helpful to understand how such a situation can occur. |
@scarlier in your case it looks like there are two concurrent dateHeader := headers.Get("date") and write operation on this line respHeaders[strings.ToLower(headerName)] = respValues Could |
The write operation is the creation of the map itself, I think. |
Yeah |
I suppose so, yup. |
@anderseknert we have both
Are there any recommendations on how to configure k8s probes so to not have this panic risc? |
I'm not aware of any way that'd prevent this, although sending less frequent pings might help. But really this bug should just be fixed, and I'm sure it will in time for the next release. |
I have changed the definition of |
Currently we parse store modules irrespective of whether there are modules on the Rego object. This will result in the compilation of those modules which triggers the bundle activation flow. Now as part of the module compilation we interact with the compiler's modules and run compilation on the input modules. If let's say there are concurrent health check requests (ie. /v1/health), this could result in a race during the compilation process while working with the compiler's modules. This change avoids this situation by skipping parsing of the store modules when none are set on the Rego object. The assumption this change makes is that while using the rego package the compiler and store are kept in-sync. Fixes: open-policy-agent#5868 Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
Currently we parse store modules irrespective of whether there are modules on the Rego object. This will result in the compilation of those modules which triggers the bundle activation flow. Now as part of the module compilation we interact with the compiler's modules and run compilation on the input modules. If let's say there are concurrent health check requests (ie. /v1/health), this could result in a race during the compilation process while working with the compiler's modules. This change avoids this situation by skipping parsing of the store modules when none are set on the Rego object. The assumption this change makes is that while using the rego package the compiler and store are kept in-sync. Fixes: open-policy-agent#5868 Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
Currently we parse store modules irrespective of whether there are modules on the Rego object. This will result in the compilation of those modules which triggers the bundle activation flow. Now as part of the module compilation we interact with the compiler's modules and run compilation on the input modules. If let's say there are concurrent health check requests (ie. /v1/health), this could result in a race during the compilation process while working with the compiler's modules. This change avoids this situation by skipping parsing of the store modules when none are set on the Rego object. The assumption this change makes is that while using the rego package the compiler and store are kept in-sync. Fixes: #5868 Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
Short description
We are running OPA Version: 0.48.0 in Kubernetes and occasionally we get pods restarting
with the following error:
The recurrence of this event currently is one pod every 2-3 days.
I have attached the complete panic log below.
Expected behavior
Opa should not panic and exit
fatal_error.log
The text was updated successfully, but these errors were encountered: