Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error: concurrent map iteration and map write #5868

Closed
jewee opened this issue Apr 26, 2023 · 19 comments · Fixed by #6081
Closed

fatal error: concurrent map iteration and map write #5868

jewee opened this issue Apr 26, 2023 · 19 comments · Fixed by #6081
Labels
bug investigating Issues being actively investigated

Comments

@jewee
Copy link

jewee commented Apr 26, 2023

Short description

We are running OPA Version: 0.48.0 in Kubernetes and occasionally we get pods restarting
with the following error:

fatal error: concurrent map iteration and map write

goroutine 23576821 [running]:
github.com/open-policy-agent/opa/bundle.compileModules(0xc1fcf4e140, {0x15a63a0, 0xc2ab0cea60}, 0x7f?, 0x1597040?, 0x0)
	/src/bundle/store.go:766 +0x19d
github.com/open-policy-agent/opa/bundle.activateBundles(0xc0040a4848)
	/src/bundle/store.go:453 +0x7b5

The recurrence of this event currently is one pod every 2-3 days.

I have attached the complete panic log below.

Expected behavior

Opa should not panic and exit

fatal_error.log

@jewee jewee added the bug label Apr 26, 2023
@anderseknert
Copy link
Member

Thanks for reporting that @jewee! Sure looks like a bug to me. We'll look into it.

@anderseknert anderseknert added this to Backlog in Open Policy Agent via automation Apr 26, 2023
@ashutosh-narkar
Copy link
Member

@jewee thanks for reporting. Can you please try OPA 0.51.0? Also it would be helpful if you can provide some steps to repro the issue.

@jewee
Copy link
Author

jewee commented May 10, 2023

we have recently upgraded to version 0.50.2 and still get this behavior (a restart every 2 days or so).
I am trying to find some ways to reproduce the problem.
Also will start to do some tests on the latest version `0.52.0

@scarlier
Copy link

scarlier commented May 22, 2023

We seem to have the same error "fatal error: concurrent map iteration and map write" not sure if it is related. Otherwise, I can create a new ticket.

Running in AWS Fargate with OPA version 0.52.0

fatal error: concurrent map read and map write
goroutine 253983 [running]:
net/textproto.MIMEHeader.Get(0xc0004f5260, {0x1e00c87?, 0xc000073800?})
	/usr/local/go/src/net/textproto/header.go:34 +0x45
net/http.Header.Get(...)
	/usr/local/go/src/net/http/header.go:50

dump.txt

@anderseknert
Copy link
Member

Thanks @scarlier 👍 That's useful for prioritization.

@ashutosh-narkar
Copy link
Member

@scarlier you seem to be using the http.send builtin in your policy. Can you share the policy? Also how is OPA queried in your case. Are there concurrent requests being made to OPA? Any additional details you can provide to repro the issue would be helpful. Thanks!

@scarlier
Copy link

@ashutosh-narkar

This is our function for the http.send

fetch_jwks(token, jwks_urls) = jwks {
  url := jwks_urls[opa.runtime().env.APPLICATION_ENVIRONMENT_NAME]

  jwt_unverified := io.jwt.decode(token)
  jwt_header := jwt_unverified[0]

  jwks := http.send({
    "url": concat("?", [url, urlquery.encode_object({"kid": jwt_header.kid})]),
    "method": "GET",
    "force_cache": true,
    "caching_mode": "deserialized",
    "force_cache_duration_seconds": 3600
  }).body

  jwt_header.kid == jwks.keys[_].kid
}

i must say that we did notice some strange behavior that the cache does not work. at-least it works for around an hour and then we see the url getting hit a lot, normal use it would be a few call each hour. but then we see peak usage of around 350 rpm while the policy is called at a rate for around 700 rpm

We call the policy with url http://localhost:8181/v1/data/pcsauth/authenticated with the following input

  Input-Token:
    type: object
    properties:
      input:
        $ref: '#/components/schemas/Token'
  Token:
    type: object
    properties:
      token:
        type: string
        description: JWT token
      ip:
        type: string
        description: ip-address of the client
        example: 192.0.0.1
      service:
        type: string
        description: The service name.
        example: example-service
      resource:
        type: string
        description: The HTTP request path or other identifier should be used to specify the resource being accessed, to ensure proper identification and authorization.
        example: /very-secret/endpoint
    required: ["token", "ip", "service", "resource"]

Opa is running in 2 fargate containers behind a NLB running on 2048 cpu units on each container.

@ashutosh-narkar
Copy link
Member

Thanks for sharing this @scarlier. Just wanted to point out that since you're using caching_mode: deserialized, an unlimited cache size will be assumed. So do you see the panic during peak time ie. when OPA is handling a bunch of requests?

we did notice some strange behavior that the cache does not work. at-least it works for around an hour and then we see the url getting hit a lot

Is this for the same request being made to OPA multiple times w/o any changes to any of the http.send request parameters?

@scarlier
Copy link

scarlier commented May 25, 2023

Yes, I am aware of caching mode. It was a test to see if things would improve. But memory would not seem to be a problem as the container is only using 10% of its memory.

We don't really have large peak's in request and will vary within 100rpm. But it seems to start giving issues around 30-60 min of runtime after a fresh reboot. Then, we notice that calls are being made we expected to be in cache. Note that the container does not crash at this point but seem to ignore (partly?) the cache functionality of http.send after a while (random?) the container crashes with a fatal, this happens around 6 times a day.

The request is the same for all calls, since the kid does not change that often, which is the only parameter we have.

We did remove the function fetch_jwks from our code and stored the contents of the URL inline into the policy. And we did not have a single crash after a day of running. So the issue has to be inside that function.

@ashutosh-narkar ashutosh-narkar added the requires-investigation Issues not under active investigation but requires one label May 25, 2023
@ashutosh-narkar ashutosh-narkar moved this from Backlog to Planning - v0.54 in Open Policy Agent May 25, 2023
@ashutosh-narkar ashutosh-narkar moved this from Planning - v0.54 to In Progress in Open Policy Agent Jun 1, 2023
@ashutosh-narkar ashutosh-narkar added investigating Issues being actively investigated and removed requires-investigation Issues not under active investigation but requires one labels Jun 1, 2023
@ashutosh-narkar
Copy link
Member

@jewee on the error you're seeing fatal error: concurrent map iteration and map write:

You mentioned you're running OPA in Kubernetes and from the logs you sent there seems to be a k8s livenessProbe configured to query OPA's /health endpoint. Is this correct? If yes, how frequently are these checks run and also how are you using OPA in this environment?

From the logs, it looks like there are 2 concurrent health checks. Each of them will eventually access the modules on the compiler which are stored in a map and that results in the panic. So it would be helpful to understand how such a situation can occur.

@ashutosh-narkar
Copy link
Member

ashutosh-narkar commented Jun 8, 2023

@scarlier in your case it looks like there are two concurrent http.send requests both of which trying to retrieve a cached value from the inter-query cache. In the process they both access the HTTP header map resulting in the fatal error: concurrent map read and map write error. So your case is due to a different cause compared to the fatal error: concurrent map iteration and map write error as described above. More specifically, per the log there is read operation here

dateHeader := headers.Get("date")

and write operation on this line

respHeaders[strings.ToLower(headerName)] = respValues

Could strings.ToLower mutate the input in some case and that is considered a write. That's a possibility maybe.

@srenatus
Copy link
Contributor

srenatus commented Jun 8, 2023

Could strings.ToLower mutate the input in some case and that is considered a write. That's a possibility maybe.

The write operation is the creation of the map itself, I think.

@ashutosh-narkar
Copy link
Member

The write operation is the creation of the map itself, I think.

Yeah strings.ToLower performing mutation did not seem convincing. Do you mean this map creation is the write operation or you're referring to something else? If yes, then the resolution is to either sync access to the headers map or we'll have to clone the header map somewhere else in the code path?

@srenatus
Copy link
Contributor

srenatus commented Jun 8, 2023

Do you mean this map creation is the write operation or you're referring to something else?
I guess setting the key to some value is the write operation, but that's a technicality. So, yes.

If yes, then the resolution is to either sync access to the headers map or we'll have to clone the header map somewhere else in the code path?

I suppose so, yup.

@jewee
Copy link
Author

jewee commented Jun 12, 2023

@jewee on the error you're seeing fatal error: concurrent map iteration and map write:

You mentioned you're running OPA in Kubernetes and from the logs you sent there seems to be a k8s livenessProbe configured to query OPA's /health endpoint. Is this correct? If yes, how frequently are these checks run and also how are you using OPA in this environment?

From the logs, it looks like there are 2 concurrent health checks. Each of them will eventually access the modules on the compiler which are stored in a map and that results in the panic. So it would be helpful to understand how such a situation can occur.

@anderseknert we have both Liveness and Readiness probes configured to work with OPA /health endpoint in the following way:

Liveness: http-get http://:8182/health delay=5s timeout=1s period=5s #success=1 #failure=3
Readiness: http-get http://:8182/health%3Fbundle=true delay=5s timeout=1s period=5s #success=1 #failure=3

Are there any recommendations on how to configure k8s probes so to not have this panic risc?

@anderseknert
Copy link
Member

I'm not aware of any way that'd prevent this, although sending less frequent pings might help. But really this bug should just be fixed, and I'm sure it will in time for the next release.

@jewee
Copy link
Author

jewee commented Jun 15, 2023

I have changed the definition of Liveness and Readiness probes to happen less frequently and to not run on the same schedule to minimize the moments when there are both executing at the same time. restarts have not happened in more than 40h now so there is something here for sure.

@ashutosh-narkar
Copy link
Member

@jewee as mentioned here the checks are resulting in a race for the compiler's modules resulting in a panic. We're currently investigating this.

@ashutosh-narkar
Copy link
Member

@scarlier we've merged this change which should resolve the issue you reported.

ashutosh-narkar added a commit to ashutosh-narkar/opa that referenced this issue Jul 7, 2023
Currently we parse store modules irrespective of whether
there are modules on the Rego object. This will result in
the compilation of those modules which triggers the bundle
activation flow. Now as part of the module compilation
we interact with the compiler's modules and run compilation
on the input modules. If let's say there are concurrent health
check requests (ie. /v1/health), this could result in a race
during the compilation process while working with the compiler's
modules.

This change avoids this situation by skipping parsing of the store
modules when none are set on the Rego object. The assumption this
change makes is that while using the rego package the compiler and
store are kept in-sync.

Fixes: open-policy-agent#5868

Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
ashutosh-narkar added a commit to ashutosh-narkar/opa that referenced this issue Jul 7, 2023
Currently we parse store modules irrespective of whether
there are modules on the Rego object. This will result in
the compilation of those modules which triggers the bundle
activation flow. Now as part of the module compilation
we interact with the compiler's modules and run compilation
on the input modules. If let's say there are concurrent health
check requests (ie. /v1/health), this could result in a race
during the compilation process while working with the compiler's
modules.

This change avoids this situation by skipping parsing of the store
modules when none are set on the Rego object. The assumption this
change makes is that while using the rego package the compiler and
store are kept in-sync.

Fixes: open-policy-agent#5868

Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
Open Policy Agent automation moved this from In Progress to Done Jul 7, 2023
ashutosh-narkar added a commit that referenced this issue Jul 7, 2023
Currently we parse store modules irrespective of whether
there are modules on the Rego object. This will result in
the compilation of those modules which triggers the bundle
activation flow. Now as part of the module compilation
we interact with the compiler's modules and run compilation
on the input modules. If let's say there are concurrent health
check requests (ie. /v1/health), this could result in a race
during the compilation process while working with the compiler's
modules.

This change avoids this situation by skipping parsing of the store
modules when none are set on the Rego object. The assumption this
change makes is that while using the rego package the compiler and
store are kept in-sync.

Fixes: #5868

Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug investigating Issues being actively investigated
Projects
Development

Successfully merging a pull request may close this issue.

5 participants