New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1964591: Remove AI Agent image in case of service failure #1836
Conversation
@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @flaper87 |
/bugzilla refresh |
@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
1 similar comment
@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
With the change systemd unit looks as follows
Note that we are using
|
Could you try restarting the agent? Just want to do a sanity check :) |
Is it the only image that we pull? |
Those are all the images I have on the worker during the installation
|
I did it as a manual test, i.e.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the thorough tests. I have one comment that I think we should tackle before merging this PR.
internal/ignition/ignition.go
Outdated
@@ -139,7 +139,7 @@ const discoveryIgnitionConfigFormat = `{ | |||
"units": [{ | |||
"name": "agent.service", | |||
"enabled": true, | |||
"contents": "[Service]\nType=simple\nRestart=always\nRestartSec=3\nStartLimitInterval=0\nEnvironment=HTTP_PROXY={{.HTTPProxy}}\nEnvironment=http_proxy={{.HTTPProxy}}\nEnvironment=HTTPS_PROXY={{.HTTPSProxy}}\nEnvironment=https_proxy={{.HTTPSProxy}}\nEnvironment=NO_PROXY={{.NoProxy}}\nEnvironment=no_proxy={{.NoProxy}}{{if .PullSecretToken}}\nEnvironment=PULL_SECRET_TOKEN={{.PullSecretToken}}{{end}}\nTimeoutStartSec={{.AgentTimeoutStartSec}}\nExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin {{.AgentDockerImg}} cp /usr/bin/agent /hostbin\nExecStart=/usr/local/bin/agent --url {{.ServiceBaseURL}} --cluster-id {{.clusterId}} --agent-version {{.AgentDockerImg}} --insecure={{.SkipCertVerification}} {{if .HostCACertPath}}--cacert {{.HostCACertPath}}{{end}}\n\n[Unit]\nWants=network-online.target\nAfter=network-online.target\n\n[Install]\nWantedBy=multi-user.target" | |||
"contents": "[Service]\nType=simple\nRestart=always\nRestartSec=3\nStartLimitInterval=0\nEnvironment=HTTP_PROXY={{.HTTPProxy}}\nEnvironment=http_proxy={{.HTTPProxy}}\nEnvironment=HTTPS_PROXY={{.HTTPSProxy}}\nEnvironment=https_proxy={{.HTTPSProxy}}\nEnvironment=NO_PROXY={{.NoProxy}}\nEnvironment=no_proxy={{.NoProxy}}{{if .PullSecretToken}}\nEnvironment=PULL_SECRET_TOKEN={{.PullSecretToken}}{{end}}\nTimeoutStartSec={{.AgentTimeoutStartSec}}\nExecStartPre=-podman rmi --force {{.AgentDockerImg}}\nExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin {{.AgentDockerImg}} cp /usr/bin/agent /hostbin\nExecStart=/usr/local/bin/agent --url {{.ServiceBaseURL}} --cluster-id {{.clusterId}} --agent-version {{.AgentDockerImg}} --insecure={{.SkipCertVerification}} {{if .HostCACertPath}}--cacert {{.HostCACertPath}}{{end}}\n\n[Unit]\nWants=network-online.target\nAfter=network-online.target\n\n[Install]\nWantedBy=multi-user.target" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I think about it a bit further, I think we should check if the image exists first.
In an environment with many concurrent nodes being deployed, this could add an extra load to the local (or remote) registry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it makes sense (however we are handling a failure scenario here, so extra load would mean that there were multiple machines with a corrupted podman storage).
I added a change so that we will check if the image exists and in case of any errors we will delete it. So just restarting agent.service
without breaking podman storage does not cause now the image to be deleted and pulled again.
@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I slightly reorganized stuff here and as discussed in the comment above, restarting Currently the
Starting
|
/bugzilla refresh Recalculating validity in case the underlying Bugzilla bug has changed. |
@openshift-bot: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@flaper87: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh Recalculating validity in case the underlying Bugzilla bug has changed. |
@openshift-bot: This pull request references Bugzilla bug 1964591, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@mkowalski: This pull request references Bugzilla bug 1964591, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla (yobshans@redhat.com), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
10 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@mkowalski: All pull requests linked via external trackers have merged: Bugzilla bug 1964591 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…ift#1836) This PR adds a handler for a failure scenario of `agent.service` which removes the `assisted-installer-agent` container image. This is a workaround for an issue where symlinks in `/var/lib/containers/` are corrupted. Deleting an image in `ExecStartPre` means that every time agent.service starts we make sure the image is available. If it's the very first attempt to start `agent.service`, then the the image will be pulled as it would be in any other scenario. Any consecutive attempt to start `agent.service` will first check if the image is present and in case of errors will remove it so that it can be pulled again. We are not using the `OnFailure` directive because the unit defined there would only be started once all the restarts attempts are exhausted which is not a desired workflow in this scenario here. Closes: OCPBUGSM-29583
/cherry-pick ocm-2.3 Backport BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1971630 |
@mkowalski: new pull request created: #2031 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This PR adds a handler for a failure scenario of
agent.service
whichremoves the
assisted-installer-agent
container image.This is a workaround for an issue where symlinks in
/var/lib/containers/
are corrupted. Deleting an image in
ExecStartPre
means that every timeagent.service starts we make sure the image is available. If it's the
very first attempt to start
agent.service
, then the the image will bepulled as it would be in any other scenario. Any consecutive attempt to
start
agent.service
will first check if the image is present and incase of errors will remove it so that it can be pulled again.
We are not using the
OnFailure
directive because the unit definedthere would only be started once all the restarts attempts are exhausted
which is not a desired workflow in this scenario here.
Closes: OCPBUGSM-29583