Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

go build/test: 400% perf regression with Go 1.20 when triggering via os/exec in Docker container #59146

Closed
junedev opened this issue Mar 20, 2023 · 9 comments
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Milestone

Comments

@junedev
Copy link

junedev commented Mar 20, 2023

What version of Go are you using (go version)?

$ go version
go version go1.20.2 linux/amd64

Does this issue reproduce with the latest release?

yes, afaik 1.20.2 is the latest release

What operating system and processor architecture are you using (go env)?

go env Output
Output from inside the docker container:

$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.2"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="0"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -fno-caret-diagnostics -Qunused-arguments -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3917626451=/tmp/go-build -gno-record-gcc-switches

What did you do?

(The issue came up in a more complex production scenario, I'm describing the minimal setup to reproduce that we could boil it down to.)

I have this small program that executes go test and prints the time it took:

func main() {
	start := time.Now()

	cmd := exec.Command("go", "test")
	if err := cmd.Run(); err != nil {
		log.Fatal(err)
	}

	fmt.Println(time.Since(start))
}

I run it in a Docker container with the following Dockerfile

FROM golang:1.20.2-alpine3.17

RUN adduser --disabled-password appuser

COPY . .

WORKDIR app
RUN go build -o main

USER appuser

CMD ["./main"]

Full runnable example here: https://github.com/junedev/testrunner-issue-repro

What did you expect to see?

This exact same setup with Go 1.19 (golang:1.19.7-alpine3.17) reports that the program took ~0.7s. We expected similar performance for Go 1.20.

What did you see instead?

With the Go 1.20 image, the program consistently reports that it took >3s to run.
We are seeing this performance issue on various local machines, as well as on our production infrastructure.

Some other things we found out so far:

  • The performance regression only appears when using the new non-root user to execute the binary (which is often recommended for security reasons.)
  • We speculated that the issue is related to the cgo handling changes that happened as part of 1.20 but we were not able to mitigate the issue by using an image that includes gcc (e.g. golang:1.20.2) and setting CGO_ENABLED.
  • Other commands like ls or go version do not seem to be affected by this issue.
  • By looking at the stdout result of the command with some more code, we double checked that go test is indeed executed just fine. It is just very slow.
  • The regression is fully mitigated by making the new user owner of the directory that contains the code via RUN chown -R appuser:appuser app. We nevertheless wanted to report the issue in case there is some underlying issue that needs fixing and as reference for others who run into this.

(Some background info how we ran into this: On Exercism, students can learn Go and other languages for free and we run tests on the code they submit and show the results on the website. We have a 10s timeout for running the tests. With Go 1.19 it took around 4s. After the update to 1.20 our students started complaining that they see timeouts all the time and we saw the tests now took ~20s to run.)

@bcmills bcmills added Performance NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Mar 20, 2023
@bcmills bcmills added this to the Backlog milestone Mar 20, 2023
@bcmills
Copy link
Contributor

bcmills commented Mar 20, 2023

Can you narrow down where the time is being spent? Is it before the go test command is started, or somewhere within go test?

@bcmills bcmills added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 20, 2023
@andrerfcsantos
Copy link

@bcmills We confirmed the extra time spent came from the cmd.Run() call. This happens both in the minimal reproduction example and in exercism/go-test-runner where we initially discovered this issue.

For instance, changing the example above to also time exec.Command():

func main() {
    startCommand := time.Now()
    cmd := exec.Command("go", "test")
    fmt.Printf("exec.Command() took %v\n", time.Since(startCommand))

    startRun := time.Now()
    err := cmd.Run()
    if err != nil {
        log.Fatal(err)
    }

    fmt.Printf("cmd.Run() took %v\n", time.Since(startRun))
}

Using the dockerfile in the post (golang:1.20.2-alpine3.17, no chown), we get:

$ docker run test-runner-minimal:1.20.2
exec.Command() took 427.488µs
cmd.Run() took 7.874369137s

$ docker run test-runner-minimal:1.20.2
exec.Command() took 104.56µs
cmd.Run() took 5.086919508s

Using the exact same setup, but using golang:1.19.7-alpine3.17:

$ docker run test-runner-minimal:1.19.7
exec.Command() took 123.863µs
cmd.Run() took 1.095058008s

$ docker run test-runner-minimal:1.19.7
exec.Command() took 106.118µs
cmd.Run() took 1.143650535s

Like mentioned above, to get similar time results with golang:1.20.2-alpine3.17, we can make appuser the owner of the folder. So, changing the Dockerfile to use Go 1.20.2 again, but running chown before switching user:

FROM golang:1.20.2-alpine3.17

RUN adduser --disabled-password appuser
COPY . .
RUN chown -R appuser:appuser app
USER appuser
WORKDIR app
RUN go build -o main

CMD ["./main"]

We get similar results to Go 1.19:

$ docker run test-runner-minimal:1.20.2-permissions
exec.Command() took 145.058µs
cmd.Run() took 990.321249ms

$ docker run test-runner-minimal:1.20.2-permissions
exec.Command() took 82.266µs
cmd.Run() took 1.13887994s

We did a similar print-based profiling in our test runner, and all the timings look similar to 1.19, except the cmd.Run() call which got massively slower unless we apply the chown command.

It's also worth mentioning that we could reproduce this problem in non-alpine images, namely buster and bullseye. We also experimented with different alpine versions, but what seems to make a difference is the Go version.

@bcmills
Copy link
Contributor

bcmills commented Mar 20, 2023

(*os/exec.Cmd).Run is a huge amount of code. Is the delay happening before the subprocess is started, or after? Is the CPU use during that time in the go test process itself, or in the parent process?

@bcmills bcmills added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Mar 20, 2023
@andrerfcsantos
Copy link

I modified the example program to generate some traces (let us know if there's a better way to see this) and according to the blocking profile, it seems the delay is due to the time spent in os.(*Process).blockUntilWaitable, which comes after the process is started.

Blocking Profile Graph (1.19) image

Full trace file: trace_fast_19.out.zip

Blocking Profile Graph (1.20) image

Full trace file: trace_slow_20.out.zip

@bcmills
Copy link
Contributor

bcmills commented Mar 21, 2023

Ok, so that's pretty solid evidence that the CPU time is not being spent in os/exec in the parent process. Is the CPU usage in the go command itself, in the compiler or linker, or in the test process? (Most likely it's the test process, which you could investigate using the test profiling flags.)

@bcmills bcmills added WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Mar 21, 2023
@Merovius
Copy link
Contributor

There are no tests in the example repo, so AIUI the only process where time could be spend is the go tool itself.

@Merovius
Copy link
Contributor

Merovius commented Mar 21, 2023

The regression is fully mitigated by making the new user owner of the directory that contains the code via RUN chown -R appuser:appuser app.

I can't actually reproduce this. That is, even if I add this to the Dockerfile (before the WORKDIR app directive) the run still takes ~3s for me.

I modified the test program to run go build -v -o /dev/null instead of go test (as there are no tests anyways) and set cmd.Stdout = os.Stdout (and similar for Stderr). That shows that Go 1.20 has to rebuild large parts of the standard library, which Go 1.19 doesn't. Which seems to be WAI. So, I assume this is where the observed difference is coming from. It is surprising to me that it would disappear if you change ownership of the source directory, but I can't reproduce that anyways, so…?

I'm not sure how to avoid that. I tried adding RUN go build -v std in the Dockerfile, hoping that would populate the build cache, but it doesn't help (presumably because aspects of the go.mod file are used in the build ID?). You might be able to mount a volume into the docker container that is used for the build cache, to at least amortize the cost over multiple runs. Apart from that, I'm not sure what to do.

But I'm afraid this issue is WAI.

@Merovius
Copy link
Contributor

Merovius commented Mar 21, 2023

Actually, this works:

FROM golang:1.20.2-alpine3.17

RUN adduser --disabled-password appuser
USER appuser
RUN go build -v std

ADD --chown=appuser:appuser . .
WORKDIR app
RUN go build -v -o main

CMD ["./main"]

This

  1. Populates the build cache by running go build -v std
  2. Has to switch to the appuser before doing that, otherwise root's build cache is populated, which is pointless
  3. So it has to add the source code as the right user, otherwise go build main complains.
  4. Is sequenced so the layer with the build cache is deeper down than the layer with the source code, to speed up container builds

You can probably take it from there :)

@andrerfcsantos
Copy link

@Merovius Thanks a lot for taking a look at this. Those conclusions make perfect sense. In our original case, we also had a ENV GOCACHE=/tmp line in the Dockerfile just before the container's entrypoint. We figured that removing that line also was necessary to fix the issue, and that would explain why.

In our case, we are also doing some other calls to go build before the entrypoint. We figured that switching users before those calls also helped, and this being a build cache issue also explains why that worked.

@junedev junedev closed this as not planned Won't fix, can't repro, duplicate, stale Mar 22, 2023
@junedev junedev changed the title os/exec: 400% perf regression with Go 1.20 when triggering "go test" in Docker container go build/test: 400% perf regression with Go 1.20 when triggering via os/exec in Docker container Mar 22, 2023
@golang golang locked and limited conversation to collaborators Mar 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. Performance WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided.
Projects
None yet
Development

No branches or pull requests

5 participants