fatal error: concurrent map iteration and map write #5868

jewee · 2023-04-26T08:26:30Z

Short description

We are running OPA Version: 0.48.0 in Kubernetes and occasionally we get pods restarting
with the following error:

fatal error: concurrent map iteration and map write

goroutine 23576821 [running]:
github.com/open-policy-agent/opa/bundle.compileModules(0xc1fcf4e140, {0x15a63a0, 0xc2ab0cea60}, 0x7f?, 0x1597040?, 0x0)
	/src/bundle/store.go:766 +0x19d
github.com/open-policy-agent/opa/bundle.activateBundles(0xc0040a4848)
	/src/bundle/store.go:453 +0x7b5

The recurrence of this event currently is one pod every 2-3 days.

I have attached the complete panic log below.

Expected behavior

Opa should not panic and exit

fatal_error.log

The text was updated successfully, but these errors were encountered:

anderseknert · 2023-04-26T08:30:51Z

Thanks for reporting that @jewee! Sure looks like a bug to me. We'll look into it.

ashutosh-narkar · 2023-04-26T19:01:38Z

@jewee thanks for reporting. Can you please try OPA 0.51.0? Also it would be helpful if you can provide some steps to repro the issue.

jewee · 2023-05-10T14:24:32Z

we have recently upgraded to version 0.50.2 and still get this behavior (a restart every 2 days or so).
I am trying to find some ways to reproduce the problem.
Also will start to do some tests on the latest version `0.52.0

scarlier · 2023-05-22T11:51:03Z

We seem to have the same error "fatal error: concurrent map iteration and map write" not sure if it is related. Otherwise, I can create a new ticket.

Running in AWS Fargate with OPA version 0.52.0

fatal error: concurrent map read and map write
goroutine 253983 [running]:
net/textproto.MIMEHeader.Get(0xc0004f5260, {0x1e00c87?, 0xc000073800?})
	/usr/local/go/src/net/textproto/header.go:34 +0x45
net/http.Header.Get(...)
	/usr/local/go/src/net/http/header.go:50

dump.txt

anderseknert · 2023-05-22T12:01:23Z

Thanks @scarlier 👍 That's useful for prioritization.

ashutosh-narkar · 2023-05-22T23:49:11Z

@scarlier you seem to be using the http.send builtin in your policy. Can you share the policy? Also how is OPA queried in your case. Are there concurrent requests being made to OPA? Any additional details you can provide to repro the issue would be helpful. Thanks!

scarlier · 2023-05-23T07:32:17Z

@ashutosh-narkar

This is our function for the http.send

fetch_jwks(token, jwks_urls) = jwks {
  url := jwks_urls[opa.runtime().env.APPLICATION_ENVIRONMENT_NAME]

  jwt_unverified := io.jwt.decode(token)
  jwt_header := jwt_unverified[0]

  jwks := http.send({
    "url": concat("?", [url, urlquery.encode_object({"kid": jwt_header.kid})]),
    "method": "GET",
    "force_cache": true,
    "caching_mode": "deserialized",
    "force_cache_duration_seconds": 3600
  }).body

  jwt_header.kid == jwks.keys[_].kid
}

i must say that we did notice some strange behavior that the cache does not work. at-least it works for around an hour and then we see the url getting hit a lot, normal use it would be a few call each hour. but then we see peak usage of around 350 rpm while the policy is called at a rate for around 700 rpm

We call the policy with url http://localhost:8181/v1/data/pcsauth/authenticated with the following input

  Input-Token:
    type: object
    properties:
      input:
        $ref: '#/components/schemas/Token'
  Token:
    type: object
    properties:
      token:
        type: string
        description: JWT token
      ip:
        type: string
        description: ip-address of the client
        example: 192.0.0.1
      service:
        type: string
        description: The service name.
        example: example-service
      resource:
        type: string
        description: The HTTP request path or other identifier should be used to specify the resource being accessed, to ensure proper identification and authorization.
        example: /very-secret/endpoint
    required: ["token", "ip", "service", "resource"]

Opa is running in 2 fargate containers behind a NLB running on 2048 cpu units on each container.

ashutosh-narkar · 2023-05-24T23:57:08Z

Thanks for sharing this @scarlier. Just wanted to point out that since you're using caching_mode: deserialized, an unlimited cache size will be assumed. So do you see the panic during peak time ie. when OPA is handling a bunch of requests?

we did notice some strange behavior that the cache does not work. at-least it works for around an hour and then we see the url getting hit a lot

Is this for the same request being made to OPA multiple times w/o any changes to any of the http.send request parameters?

scarlier · 2023-05-25T08:13:58Z

Yes, I am aware of caching mode. It was a test to see if things would improve. But memory would not seem to be a problem as the container is only using 10% of its memory.

We don't really have large peak's in request and will vary within 100rpm. But it seems to start giving issues around 30-60 min of runtime after a fresh reboot. Then, we notice that calls are being made we expected to be in cache. Note that the container does not crash at this point but seem to ignore (partly?) the cache functionality of http.send after a while (random?) the container crashes with a fatal, this happens around 6 times a day.

The request is the same for all calls, since the kid does not change that often, which is the only parameter we have.

We did remove the function fetch_jwks from our code and stored the contents of the URL inline into the policy. And we did not have a single crash after a day of running. So the issue has to be inside that function.

ashutosh-narkar · 2023-06-07T22:13:07Z

@jewee on the error you're seeing fatal error: concurrent map iteration and map write:

You mentioned you're running OPA in Kubernetes and from the logs you sent there seems to be a k8s livenessProbe configured to query OPA's /health endpoint. Is this correct? If yes, how frequently are these checks run and also how are you using OPA in this environment?

From the logs, it looks like there are 2 concurrent health checks. Each of them will eventually access the modules on the compiler which are stored in a map and that results in the panic. So it would be helpful to understand how such a situation can occur.

ashutosh-narkar · 2023-06-08T00:14:36Z

@scarlier in your case it looks like there are two concurrent http.send requests both of which trying to retrieve a cached value from the inter-query cache. In the process they both access the HTTP header map resulting in the fatal error: concurrent map read and map write error. So your case is due to a different cause compared to the fatal error: concurrent map iteration and map write error as described above. More specifically, per the log there is read operation here

dateHeader := headers.Get("date")

and write operation on this line

respHeaders[strings.ToLower(headerName)] = respValues

Could strings.ToLower mutate the input in some case and that is considered a write. That's a possibility maybe.

srenatus · 2023-06-08T06:44:41Z

Could strings.ToLower mutate the input in some case and that is considered a write. That's a possibility maybe.

The write operation is the creation of the map itself, I think.

ashutosh-narkar · 2023-06-08T07:30:49Z

The write operation is the creation of the map itself, I think.

Yeah strings.ToLower performing mutation did not seem convincing. Do you mean this map creation is the write operation or you're referring to something else? If yes, then the resolution is to either sync access to the headers map or we'll have to clone the header map somewhere else in the code path?

srenatus · 2023-06-08T07:52:20Z

Do you mean this map creation is the write operation or you're referring to something else?
I guess setting the key to some value is the write operation, but that's a technicality. So, yes.

If yes, then the resolution is to either sync access to the headers map or we'll have to clone the header map somewhere else in the code path?

I suppose so, yup.

jewee · 2023-06-12T09:18:11Z

@jewee on the error you're seeing fatal error: concurrent map iteration and map write:

You mentioned you're running OPA in Kubernetes and from the logs you sent there seems to be a k8s livenessProbe configured to query OPA's /health endpoint. Is this correct? If yes, how frequently are these checks run and also how are you using OPA in this environment?

From the logs, it looks like there are 2 concurrent health checks. Each of them will eventually access the modules on the compiler which are stored in a map and that results in the panic. So it would be helpful to understand how such a situation can occur.

@anderseknert we have both Liveness and Readiness probes configured to work with OPA /health endpoint in the following way:

Liveness: http-get http://:8182/health delay=5s timeout=1s period=5s #success=1 #failure=3
Readiness: http-get http://:8182/health%3Fbundle=true delay=5s timeout=1s period=5s #success=1 #failure=3

Are there any recommendations on how to configure k8s probes so to not have this panic risc?

anderseknert · 2023-06-12T09:25:57Z

I'm not aware of any way that'd prevent this, although sending less frequent pings might help. But really this bug should just be fixed, and I'm sure it will in time for the next release.

jewee · 2023-06-15T07:44:55Z

I have changed the definition of Liveness and Readiness probes to happen less frequently and to not run on the same schedule to minimize the moments when there are both executing at the same time. restarts have not happened in more than 40h now so there is something here for sure.

ashutosh-narkar · 2023-06-15T17:19:45Z

@jewee as mentioned here the checks are resulting in a race for the compiler's modules resulting in a panic. We're currently investigating this.

ashutosh-narkar · 2023-06-15T17:25:16Z

@scarlier we've merged this change which should resolve the issue you reported.

Currently we parse store modules irrespective of whether there are modules on the Rego object. This will result in the compilation of those modules which triggers the bundle activation flow. Now as part of the module compilation we interact with the compiler's modules and run compilation on the input modules. If let's say there are concurrent health check requests (ie. /v1/health), this could result in a race during the compilation process while working with the compiler's modules. This change avoids this situation by skipping parsing of the store modules when none are set on the Rego object. The assumption this change makes is that while using the rego package the compiler and store are kept in-sync. Fixes: open-policy-agent#5868 Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>

Currently we parse store modules irrespective of whether there are modules on the Rego object. This will result in the compilation of those modules which triggers the bundle activation flow. Now as part of the module compilation we interact with the compiler's modules and run compilation on the input modules. If let's say there are concurrent health check requests (ie. /v1/health), this could result in a race during the compilation process while working with the compiler's modules. This change avoids this situation by skipping parsing of the store modules when none are set on the Rego object. The assumption this change makes is that while using the rego package the compiler and store are kept in-sync. Fixes: #5868 Signed-off-by: Ashutosh Narkar <anarkar4387@gmail.com>

jewee added the bug label Apr 26, 2023

anderseknert added this to Backlog in Open Policy Agent via automation Apr 26, 2023

ashutosh-narkar added the waiting-for-response label Apr 27, 2023

anderseknert removed the waiting-for-response label May 22, 2023

ashutosh-narkar added the requires-investigation Issues not under active investigation but requires one label May 25, 2023

ashutosh-narkar moved this from Backlog to Planning - v0.54 in Open Policy Agent May 25, 2023

ashutosh-narkar moved this from Planning - v0.54 to In Progress in Open Policy Agent Jun 1, 2023

ashutosh-narkar added investigating Issues being actively investigated and removed requires-investigation Issues not under active investigation but requires one labels Jun 1, 2023

ashutosh-narkar mentioned this issue Jun 9, 2023

topdown: Update cache interface with Clone method #5997

Merged

ashutosh-narkar mentioned this issue Jun 21, 2023

server: Update eval options for simple health query #6036

Closed

ashutosh-narkar mentioned this issue Jul 7, 2023

rego: Parse store modules iff modules set on the Rego object #6081

Merged

ashutosh-narkar closed this as completed in #6081 Jul 7, 2023

Open Policy Agent automation moved this from In Progress to Done Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fatal error: concurrent map iteration and map write #5868

fatal error: concurrent map iteration and map write #5868

jewee commented Apr 26, 2023

anderseknert commented Apr 26, 2023

ashutosh-narkar commented Apr 26, 2023

jewee commented May 10, 2023

scarlier commented May 22, 2023 •

edited

anderseknert commented May 22, 2023

ashutosh-narkar commented May 22, 2023

scarlier commented May 23, 2023

ashutosh-narkar commented May 24, 2023

scarlier commented May 25, 2023 •

edited

ashutosh-narkar commented Jun 7, 2023

ashutosh-narkar commented Jun 8, 2023 •

edited

srenatus commented Jun 8, 2023

ashutosh-narkar commented Jun 8, 2023

srenatus commented Jun 8, 2023

jewee commented Jun 12, 2023

anderseknert commented Jun 12, 2023

jewee commented Jun 15, 2023 •

edited

ashutosh-narkar commented Jun 15, 2023

ashutosh-narkar commented Jun 15, 2023

fatal error: concurrent map iteration and map write #5868

fatal error: concurrent map iteration and map write #5868

Comments

jewee commented Apr 26, 2023

Short description

Expected behavior

anderseknert commented Apr 26, 2023

ashutosh-narkar commented Apr 26, 2023

jewee commented May 10, 2023

scarlier commented May 22, 2023 • edited

anderseknert commented May 22, 2023

ashutosh-narkar commented May 22, 2023

scarlier commented May 23, 2023

ashutosh-narkar commented May 24, 2023

scarlier commented May 25, 2023 • edited

ashutosh-narkar commented Jun 7, 2023

ashutosh-narkar commented Jun 8, 2023 • edited

srenatus commented Jun 8, 2023

ashutosh-narkar commented Jun 8, 2023

srenatus commented Jun 8, 2023

jewee commented Jun 12, 2023

anderseknert commented Jun 12, 2023

jewee commented Jun 15, 2023 • edited

ashutosh-narkar commented Jun 15, 2023

ashutosh-narkar commented Jun 15, 2023

scarlier commented May 22, 2023 •

edited

scarlier commented May 25, 2023 •

edited

ashutosh-narkar commented Jun 8, 2023 •

edited

jewee commented Jun 15, 2023 •

edited