Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pillar: Release CPUs on domain activation failure. #3952

Conversation

OhmSpectator
Copy link
Member

This commit addresses an issue where CPUs assigned to a domain within doActivate() are not released if the domain activation fails. The new logic ensures that CPUs are properly released and the CPU mask in the status is updated accordingly. This is achieved by introducing the releaseCPUs function and calling it in the appropriate error handling blocks within doActivate().

It is common for doActivate to fail in scenarios such as switching application profiles that share the same adapter. In such cases, the second application will fail to activate until the first one releases the necessary adapter.

@OhmSpectator
Copy link
Member Author

A strange failure on the PR build

+ echo docker run -i --rm -u runner -w /go/src/github.com/lf-edge/eve/pkg/pillar -v /home/runner/actions-runner/_work/eve/eve/.go:/go:z -v /home/runner/actions-runner/_work/eve/eve/pkg/pillar:/go/src/github.com/lf-edge/eve/pkg/pillar:z -v /home/runner/actions-runner/_work/eve/eve/build-tools/bin:/go/bin:z -v /home/runner/actions-runner/_work/eve/eve/:/eve:z -v /home/runner:/home/runner:z -e GOOS -e GOARCH -e CGO_ENABLED -e BUILD=local eve-build-runner bash --noprofile --norc -c "unset GOFLAGS; rm -rf /tmp/linuxkit && git clone https://github.com/linuxkit/linuxkit.git /tmp/linuxkit && cd /tmp/linuxkit && git checkout e6b0ae05eb3a2b99e84d9ffc03a3a5c9c3e7e371 && if [ -e /eve/tools/linuxkit/patches ]; then     patch -p1 < /eve/tools/linuxkit/patches/*.patch; fi && cd /tmp/linuxkit/src/cmd/linuxkit && GO111MODULE=on CGO_ENABLED=0 go build -o /go/bin/linuxkit -mod=vendor . && cd && rm -rf /tmp/linuxkit"
+ docker run -i --rm -u runner -w /go/src/github.com/lf-edge/eve/pkg/pillar -v /home/runner/actions-runner/_work/eve/eve/.go:/go:z -v /home/runner/actions-runner/_work/eve/eve/pkg/pillar:/go/src/github.com/lf-edge/eve/pkg/pillar:z -v /home/runner/actions-runner/_work/eve/eve/build-tools/bin:/go/bin:z -v /home/runner/actions-runner/_work/eve/eve/:/eve:z -v /home/runner:/home/runner:z -e GOOS -e GOARCH -e CGO_ENABLED -e BUILD=local eve-build-runner bash --noprofile --norc -c unset GOFLAGS; rm -rf /tmp/linuxkit && git clone https://github.com/linuxkit/linuxkit.git /tmp/linuxkit && cd /tmp/linuxkit && git checkout e6b0ae05eb3a2b99e84d9ffc03a3a5c9c3e7e371 && if [ -e /eve/tools/linuxkit/patches ]; then     patch -p1 < /eve/tools/linuxkit/patches/*.patch; fi && cd /tmp/linuxkit/src/cmd/linuxkit && GO111MODULE=on CGO_ENABLED=0 go build -o /go/bin/linuxkit -mod=vendor . && cd && rm -rf /tmp/linuxkit
docker run -i --rm -u runner -w /go/src/github.com/lf-edge/eve/pkg/pillar -v /home/runner/actions-runner/_work/eve/eve/.go:/go:z -v /home/runner/actions-runner/_work/eve/eve/pkg/pillar:/go/src/github.com/lf-edge/eve/pkg/pillar:z -v /home/runner/actions-runner/_work/eve/eve/build-tools/bin:/go/bin:z -v /home/runner/actions-runner/_work/eve/eve/:/eve:z -v /home/runner:/home/runner:z -e GOOS -e GOARCH -e CGO_ENABLED -e BUILD=local eve-build-runner bash --noprofile --norc -c "unset GOFLAGS; rm -rf /tmp/linuxkit && git clone https://github.com/linuxkit/linuxkit.git /tmp/linuxkit && cd /tmp/linuxkit && git checkout e6b0ae05eb3a2b99e84d9ffc03a3a5c9c3e7e371 && if [ -e /eve/tools/linuxkit/patches ]; then     patch -p1 < /eve/tools/linuxkit/patches/*.patch; fi && cd /tmp/linuxkit/src/cmd/linuxkit && GO111MODULE=on CGO_ENABLED=0 go build -o /go/bin/linuxkit -mod=vendor . && cd && rm -rf /tmp/linuxkit"
Cloning into '/tmp/linuxkit'...
error: 264 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

@uncleDecart, is it runner-related?...

@OhmSpectator
Copy link
Member Author

Once merged, the PR is to be backported into:

  • 12.0
  • 11.0-stable
  • 10.4-stable
  • 9.4-stable

@OhmSpectator OhmSpectator force-pushed the bugfix/ev-1136-release-cpus-on-unsuccessful-activate branch from bc23738 to 56ede2c Compare May 28, 2024 16:53
@eriknordmark eriknordmark added bug Something isn't working stable Should be backported to stable release(s) labels May 28, 2024
This commit addresses an issue where CPUs assigned to a domain within
doActivate() are not released if the domain activation fails. The new
logic ensures that CPUs are properly released and the CPU mask in the
status is updated accordingly. This is achieved by introducing the
releaseCPUs function and calling it in the appropriate error handling
blocks within doActivate().

It is common for doActivate to fail in scenarios such as switching
application profiles that share the same adapter. In such cases, the
second application will fail to activate until the first one releases
the necessary adapter.

Signed-off-by: Nikolay Martyanov <nikolay@zededa.com>
@OhmSpectator OhmSpectator force-pushed the bugfix/ev-1136-release-cpus-on-unsuccessful-activate branch from 56ede2c to e5ebb10 Compare May 29, 2024 10:38
Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@OhmSpectator
Copy link
Member Author

Not sure, why the TMP tests fail...

/home/runner/actions-runner/_work/eve/eve/eden/dist/bin/eden sdn fwd eth0 2223 -- ssh -o ConnectTimeout=10 -o StrictHostKeyChecking=no -i /home/runner/actions-runner/_work/eve/eve/eden/dist/tests/eclient/image/cert/id_rsa root@FWD_IP -p FWD_PORT grep -q "before_restart" /etc/injected_file.txt
        Try 1
        time="2024-05-29T14:43:05Z" level=fatal msg="command ssh failed: exit status 255"
        Try 2
        time="2024-05-29T14:43:17Z" level=fatal msg="command ssh failed: exit status 255"
        Try 3
        time="2024-05-29T14:43:29Z" level=fatal msg="command ssh failed: exit status 255"
        Try 4
        [stderr]
        Connection timed out during banner exchange
        Connection to 127.0.0.1 port 2223 timed out
        Connection timed out during banner exchange
        Connection to 127.0.0.1 port 2223 timed out
        Connection timed out during banner exchange
        Connection to 127.0.0.1 port 2223 timed out
        [context deadline exceeded]
        FAIL: ../eclient/testdata/userdata.txt:41: command failure

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stable Should be backported to stable release(s)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants