Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prlimit Unit Test fails on s390x #3341

Closed
Davo911 opened this issue Jul 10, 2024 · 14 comments · Fixed by #3349
Closed

Prlimit Unit Test fails on s390x #3341

Davo911 opened this issue Jul 10, 2024 · 14 comments · Fixed by #3349
Labels

Comments

@Davo911
Copy link
Contributor

Davo911 commented Jul 10, 2024

What happened:
In #3226 it was discovered, that prlimit_test.go fails on s390x, when using a specific go version. The go version was then bumped to 1.22.3 and the Issue was closed, but the Problem is not yet resolved.

• [FAILED] [0.010 seconds]
Process Limits exec [It] command success with real limits
/root/go/src/kubevirt.io/containerized-data-importer/pkg/system/prlimit_test.go:62

  [FAILED] Expected
      <string>: unexpected fault address 0x180000
       fatal error: fault
       [signal SIGSEGV: segmentation violation code=0x2 addr=0x180000 pc=0x18002a]
       goroutine 1 gp=0xc0000021c0 m=0 mp=0x8241a0 [running, locked to thread]:
       runtime.throw({0x4edfa2...

There are currently 2 possible fixes for this:

  1. omitting the -coverprofile=.coverprofile flag from the go-test command
    test_command="env OPERATOR_DIR=${CDI_DIR} go test -v -coverprofile=.coverprofile -test.timeout 180m ${pkgs} ${test_args:+-args $test_args}"
  2. increasing the memory allocation from 1GB(1 << 30) to 2GB(1 << 31)
    Entry("command success with real limits", fakeCommandContext, nil, &ProcessLimitValues{1 << 30, 10}, "faker", "", "", "0", "", ""),

The second option seems to be the correct one, but not sure as both fixes do not seem to be related to me.


What you expected to happen:
Unit Tests Succeeding

How to reproduce it (as minimally and precisely as possible):
On a s390x LPAR:

  1. clone main
  2. export BUILDER_IMAGE=<ported builder image for s390x>
  3. Either
    a). go test -coverprofile=.coverprofile ./pkg/system/...
    b). ./hack/build/bazel-docker.sh ./hack/build/run-unit-tests.sh ./pkg/system/..."

Environment:

  • Go version: 1.22.3
  • OS (e.g. from /etc/os-release): RHEL 9.4
  • Kernel (e.g. uname -a): 5.14.0-427.22.1.el9_4.s390x
@akalenyu
Copy link
Collaborator

akalenyu commented Jul 10, 2024

  • CDI version : v1.58.1

So this branch of CDI is using older golang, maybe that is the problem?

@Davo911
Copy link
Contributor Author

Davo911 commented Jul 11, 2024

This was tested with go 1.22.3 and 1.22.4.
The original Issue (#3226) also needed to omit the coverprofile flag as a workaround

@akalenyu
Copy link
Collaborator

The unit tests run is containerized, this is why I am implying the branch matters

@Davo911
Copy link
Contributor Author

Davo911 commented Jul 11, 2024

yep, thanks for pointing that out
I noted the CDI version of the instance deployed on my cluster by mistake, but the branch I'm on is latest(1.59), when running the Unit tests here

@akalenyu
Copy link
Collaborator

Go 1.22 only made it to main, so it's not in 1.59

@Davo911
Copy link
Contributor Author

Davo911 commented Jul 11, 2024

Oh meant main
What could be arguments against increasing that limit, at least architecture dependend to 2GB?
As long as I understand it, this 1GB stems from a recommandation made at OpenStack to prevent the allocation of an arbitrary amount of memory when using qemu-img

@akalenyu
Copy link
Collaborator

Yeah that sounds about right, but, ideally I would like to understand what is it exactly that s390x doesn't like in this particular case? Is there a nonlegitimate request in that? seems completely fine to me

@akalenyu
Copy link
Collaborator

We discussed this in the sig-storage call, what we can do as a quick fix is to bump the memory limit to 2Gi
depending on the arch.

However, ideally, we would like to understand specifically why this fails on s390x;
We have some ideas, like, maybe the coverage tracking is really expensive and badly optimized for s390x,
or, maybe we were always close to exhausting the memory in the unit tests.

@Davo911
Copy link
Contributor Author

Davo911 commented Jul 21, 2024

I wasn't successful yet in troubleshooting the issue through memory profiling or valgrind, but I found a value, where the tests gets flaky.
At a limit of 1509000000 roughly 40-50% of the time the test succeeds

@Davo911
Copy link
Contributor Author

Davo911 commented Aug 1, 2024

@akalenyu
Copy link
Collaborator

akalenyu commented Aug 4, 2024

Additionally the same Error can happen under amd64: https://prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_containerized-data-importer/3349/pull-cdi-unit-test/1818300000342380544#1:build-log.txt%3A1206-1240

A further investigation is neccesary here

Interesting.. I would run the unit tests locally and follow the memory consumption of the container,
but if I'm not mistaken this simply means that the 8Gi limit on the CI job for unit tests does not cut it,
Which is surprising to me even if this unit test alone is taking 1Gi.

@Davo911
Copy link
Contributor Author

Davo911 commented Aug 5, 2024

Additionally the same Error can happen under amd64: prow.ci.kubevirt.io/view/gs/kubevirt-prow/pr-logs/pull/kubevirt_containerized-data-importer/3349/pull-cdi-unit-test/1818300000342380544#1:build-log.txt:1206-1240

A further investigation is neccesary here

Small correction here to pull back a little:
The one coming up on that flaky amd64 run is not the same error as for the original s390x one.

<string>: fatal error: runtime: cannot allocate memory
vs
<string>: unexpected fault address 0x180000\nfatal error: fault\n[signal SIGSEGV: segmentation violation

The first one can easily be reproduced by lowering the limit to 1<<29 (~500MB) or below.

@akalenyu
Copy link
Collaborator

yeah tbh today I noticed the unit test container spike over 4Gi even using docker stats.
As I see it runtime: cannot allocate memory is not really exhausting the available memory.
I think in that case, we would have seen a segfault

@akalenyu
Copy link
Collaborator

So I have been chasing a similar issue recently.. apparently we're in the wrong to be using RLIMIT_AS in the first place to try to limit go programs, which are notorious for high virtual memory size.
golang/go#69105 (comment)

I am assuming the bit that makes it worse for your GOARCH is somewhere around the mmap here
https://cs.opensource.google/go/go/+/master:src/runtime/malloc.go;l=508-529;bpv=1
You could just go build the cmd/cdi-importer app on both arches and try to htop for a general idea of virtual memory size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants