Bug 1964591: Remove AI Agent image in case of service failure #1836

mkowalski · 2021-05-26T11:24:24Z

This PR adds a handler for a failure scenario of agent.service which
removes the assisted-installer-agent container image.

This is a workaround for an issue where symlinks in /var/lib/containers/
are corrupted. Deleting an image in ExecStartPre means that every time
agent.service starts we make sure the image is available. If it's the
very first attempt to start agent.service, then the the image will be
pulled as it would be in any other scenario. Any consecutive attempt to
start agent.service will first check if the image is present and in
case of errors will remove it so that it can be pulled again.

We are not using the OnFailure directive because the unit defined
there would only be started once all the restarts attempts are exhausted
which is not a desired workflow in this scenario here.

Closes: OCPBUGSM-29583

openshift-ci · 2021-05-26T11:24:28Z

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[WIP] Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-05-26T11:24:29Z

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[WIP] Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

flaper87 · 2021-05-26T11:30:36Z

/assign @flaper87

mkowalski · 2021-05-26T14:23:30Z

/bugzilla refresh

openshift-ci · 2021-05-26T14:23:34Z

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-05-26T14:57:06Z

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2021-05-26T14:57:07Z

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2021-05-26T20:38:47Z

/retest

openshift-ci · 2021-05-27T07:34:23Z

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2021-05-27T09:51:58Z

With the change systemd unit looks as follows

[core@worker-0 ~]$ cat /etc/systemd/system/agent.service
[Service]
Type=simple
Restart=always
RestartSec=3
StartLimitInterval=0
Environment=HTTP_PROXY=
Environment=http_proxy=
Environment=HTTPS_PROXY=
Environment=https_proxy=
Environment=NO_PROXY=
Environment=no_proxy=
Environment=PULL_SECRET_TOKEN=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiN2MwNWI4YTMtNWRjMC00YjcwLWE0MDktOGYxNDk0OGVkZjJlIn0.ReyzNdQKiyvU7uiF7IhjRXc5ORzeInIZ3KmYYB-zOWGFqkU3vnJfw6U-EY5ypGEP6VK0wy_SZ2lDkuVR0K0oIA
TimeoutStartSec=180
ExecStartPre=-podman rmi --force quay.io/ocpmetal/assisted-installer-agent:latest
ExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin quay.io/ocpmetal/assisted-installer-agent:latest cp /usr/bin/agent /hostbin
ExecStart=/usr/local/bin/agent --url https://assisted-service-assisted-installer.apps.ostest.test.metalkube.org --cluster-id 7c05b8a3-5dc0-4b70-a409-8f14948edf2e --agent-version quay.io/ocpmetal/assisted-installer-agent:latest --insecure=false  --cacert /etc/assisted-service/service-ca-cert.crt

[Unit]
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

Note that we are using - in order to ignore the status of podman rmi. This is because the initial run of agent.service has no image to be removed yet, so podman rmi fails for a valid reason

[core@worker-0 ~]$ systemctl status agent.service 
● agent.service
   Loaded: loaded (/etc/systemd/system/agent.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2021-05-27 09:18:02 UTC; 32min ago
  Process: 1634 ExecStartPre=/usr/bin/podman run --privileged --rm -v /usr/local/bin:/hostbin quay.io/ocpmetal/assisted-installer-agent:latest cp /usr/bin/agent /hostbin (code=exited, status=0/SUCCESS)
  Process: 1565 ExecStartPre=/usr/bin/podman rmi --force quay.io/ocpmetal/assisted-installer-agent:latest (code=exited, status=1/FAILURE)
[...]

flaper87 · 2021-05-27T10:12:53Z

Note that we are using - in order to ignore the status of podman rmi. This is because the initial run of agent.service has no image to be removed yet, so podman rmi fails for a valid reason

Could you try restarting the agent? Just want to do a sanity check :)

romfreiman · 2021-05-27T10:17:46Z

Is it the only image that we pull?

mkowalski · 2021-05-27T10:28:53Z

Is it the only image that we pull?

Those are all the images I have on the worker during the installation

REPOSITORY                                      TAG     IMAGE ID      CREATED       SIZE                                                                                                                                         [0/0]
quay.io/ocpmetal/assisted-installer-agent       latest  f39447ddaf2f  6 hours ago   828 MB
quay.io/ocpmetal/assisted-installer             latest  8c10045f04c8  13 hours ago  273 MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev  <none>  147d9596fcf6  2 weeks ago   438 MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev  <none>  87bff3c8ae43  2 weeks ago   391 MB

mkowalski · 2021-05-27T11:00:40Z

Could you try restarting the agent? Just want to do a sanity check :)

I did it as a manual test, i.e.

started installation of the SNO
during the runtime of agent.service manually killed processes spawned (kill -9 XXX)
systemd noticed failure and restarted agent.service
podman image got removed
during startup of agent.service the quay.io/ocpmetal/assisted-installer-agent image got pulled again

flaper87

Thanks a lot for the thorough tests. I have one comment that I think we should tackle before merging this PR.

flaper87 · 2021-05-27T11:04:03Z

internal/ignition/ignition.go

@@ -139,7 +139,7 @@ const discoveryIgnitionConfigFormat = `{
    "units": [{
      "name": "agent.service",
      "enabled": true,
-      "contents": "[Service]\nType=simple\nRestart=always\nRestartSec=3\nStartLimitInterval=0\nEnvironment=HTTP_PROXY={{.HTTPProxy}}\nEnvironment=http_proxy={{.HTTPProxy}}\nEnvironment=HTTPS_PROXY={{.HTTPSProxy}}\nEnvironment=https_proxy={{.HTTPSProxy}}\nEnvironment=NO_PROXY={{.NoProxy}}\nEnvironment=no_proxy={{.NoProxy}}{{if .PullSecretToken}}\nEnvironment=PULL_SECRET_TOKEN={{.PullSecretToken}}{{end}}\nTimeoutStartSec={{.AgentTimeoutStartSec}}\nExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin {{.AgentDockerImg}} cp /usr/bin/agent /hostbin\nExecStart=/usr/local/bin/agent --url {{.ServiceBaseURL}} --cluster-id {{.clusterId}} --agent-version {{.AgentDockerImg}} --insecure={{.SkipCertVerification}}  {{if .HostCACertPath}}--cacert {{.HostCACertPath}}{{end}}\n\n[Unit]\nWants=network-online.target\nAfter=network-online.target\n\n[Install]\nWantedBy=multi-user.target"
+      "contents": "[Service]\nType=simple\nRestart=always\nRestartSec=3\nStartLimitInterval=0\nEnvironment=HTTP_PROXY={{.HTTPProxy}}\nEnvironment=http_proxy={{.HTTPProxy}}\nEnvironment=HTTPS_PROXY={{.HTTPSProxy}}\nEnvironment=https_proxy={{.HTTPSProxy}}\nEnvironment=NO_PROXY={{.NoProxy}}\nEnvironment=no_proxy={{.NoProxy}}{{if .PullSecretToken}}\nEnvironment=PULL_SECRET_TOKEN={{.PullSecretToken}}{{end}}\nTimeoutStartSec={{.AgentTimeoutStartSec}}\nExecStartPre=-podman rmi --force {{.AgentDockerImg}}\nExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin {{.AgentDockerImg}} cp /usr/bin/agent /hostbin\nExecStart=/usr/local/bin/agent --url {{.ServiceBaseURL}} --cluster-id {{.clusterId}} --agent-version {{.AgentDockerImg}} --insecure={{.SkipCertVerification}}  {{if .HostCACertPath}}--cacert {{.HostCACertPath}}{{end}}\n\n[Unit]\nWants=network-online.target\nAfter=network-online.target\n\n[Install]\nWantedBy=multi-user.target"


Now that I think about it a bit further, I think we should check if the image exists first.

In an environment with many concurrent nodes being deployed, this could add an extra load to the local (or remote) registry.

Yeah, it makes sense (however we are handling a failure scenario here, so extra load would mean that there were multiple machines with a corrupted podman storage).

I added a change so that we will check if the image exists and in case of any errors we will delete it. So just restarting agent.service without breaking podman storage does not cause now the image to be deleted and pulled again.

openshift-ci · 2021-05-27T12:40:18Z

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2021-05-27T12:43:16Z

I slightly reorganized stuff here and as discussed in the comment above, restarting agent.service does not automatically cause removing the assisted-installer-agent image. It will happen now only if there is a problem executing podman images.

Currently the agent.service looks as follows

[root@worker-0 ~]# cat /etc/systemd/system/agent.service                                                                                                                                                                              
[Service]                                                                                                                                                                                                                             
Type=simple                                                                                                                                                                                                                           
Restart=always                                                                                                                                                                                                                        
[...]
ExecStartPre=/usr/local/bin/agent-fix-bz1964591 quay.io/ocpmetal/assisted-installer-agent:latest
ExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin quay.io/ocpmetal/assisted-installer-agent:latest cp /usr/bin/agent /hostbin
[...]

Starting agent.service with the image already pulled looks as below

[root@worker-0 ~]# systemctl status agent.service                                                                                                                                                                                     
● agent.service                                                                                                                                                                                                                       
   Loaded: loaded (/etc/systemd/system/agent.service; enabled; vendor preset: enabled)                                                                                                                                                
   Active: active (running) since Thu 2021-05-27 12:37:27 UTC; 23s ago                                                                                                                                                                
[...]
May 27 12:37:26 worker-0 systemd[1]: Starting agent.service...                                                                                                                                                                        
May 27 12:37:26 worker-0 agent-fix-bz1964591[2514]: quay.io/ocpmetal/assisted-installer-agent  latest  f39447ddaf2f  12 hours ago  828 MB                                                                                             May 27 12:37:27 worker-0 systemd[1]: agent.service: Found left-over process 2677 (conmon) in control group while starting unit. Ignoring.                                                                                             
[...]
May 27 12:37:27 worker-0 systemd[1]: Started agent.service.

openshift-bot · 2021-05-30T01:00:39Z

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

openshift-ci · 2021-05-30T01:00:43Z

@openshift-bot: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

flaper87 · 2021-05-30T10:48:08Z

/bugzilla refresh

openshift-ci · 2021-05-30T10:48:12Z

@flaper87: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2021-05-31T01:00:39Z

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

openshift-ci · 2021-05-31T01:00:42Z

@openshift-bot: This pull request references Bugzilla bug 1964591, which is invalid:

expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2021-05-31T11:04:36Z

/bugzilla refresh

openshift-ci · 2021-05-31T11:04:43Z

@mkowalski: This pull request references Bugzilla bug 1964591, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.8.0) matches configured target release for branch (4.8.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (yobshans@redhat.com), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mkowalski · 2021-05-31T14:35:28Z

/retest

openshift-bot · 2021-05-31T17:29:31Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-05-31T18:08:32Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-05-31T20:23:11Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-05-31T20:35:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-05-31T23:11:12Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-05-31T23:23:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-06-01T02:11:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-06-01T02:23:11Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-06-01T04:59:11Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-06-01T05:11:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-06-01T07:35:11Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-06-01T11:35:37Z

@mkowalski: All pull requests linked via external trackers have merged:

openshift/assisted-service#1836

Bugzilla bug 1964591 has been moved to the MODIFIED state.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

…ift#1836) This PR adds a handler for a failure scenario of `agent.service` which removes the `assisted-installer-agent` container image. This is a workaround for an issue where symlinks in `/var/lib/containers/` are corrupted. Deleting an image in `ExecStartPre` means that every time agent.service starts we make sure the image is available. If it's the very first attempt to start `agent.service`, then the the image will be pulled as it would be in any other scenario. Any consecutive attempt to start `agent.service` will first check if the image is present and in case of errors will remove it so that it can be pulled again. We are not using the `OnFailure` directive because the unit defined there would only be started once all the restarts attempts are exhausted which is not a desired workflow in this scenario here. Closes: OCPBUGSM-29583

mkowalski · 2021-06-17T12:06:36Z

/cherry-pick ocm-2.3

Backport BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1971630

openshift-cherrypick-robot · 2021-06-17T12:07:17Z

@mkowalski: new pull request created: #2031

In response to this:

/cherry-pick ocm-2.3

Backport BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1971630

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from eranco74 and ybettan May 26, 2021 11:24

openshift-ci bot assigned flaper87 May 26, 2021

mkowalski force-pushed the bz1964591 branch from c5a7895 to a321398 Compare May 26, 2021 11:56

mkowalski changed the title ~~[WIP] Bug 1964591: Remove AI Agent image in case of service failure~~ Bug 1964591: Remove AI Agent image in case of service failure May 26, 2021

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2021

mkowalski force-pushed the bz1964591 branch from a321398 to 4af1c9a Compare May 26, 2021 14:51

mkowalski force-pushed the bz1964591 branch from 4af1c9a to 5e4f04a Compare May 27, 2021 07:32

flaper87 suggested changes May 27, 2021

View reviewed changes

mkowalski force-pushed the bz1964591 branch from 5e4f04a to 4bb8adb Compare May 27, 2021 12:39

mkowalski force-pushed the bz1964591 branch from 4bb8adb to 207ac3c Compare May 27, 2021 12:45

openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 31, 2021

openshift-merge-robot merged commit c9d0529 into openshift:master Jun 1, 2021

mkowalski deleted the bz1964591 branch June 1, 2021 11:36

openshift-cherrypick-robot mentioned this pull request Jun 17, 2021

Bug 1971630: Remove AI Agent image in case of service failure #2031

Merged

lranjbar mentioned this pull request Feb 17, 2022

Add agent service openshift-agent-team/fleeting#4

Merged

Bug 1964591: Remove AI Agent image in case of service failure #1836

Bug 1964591: Remove AI Agent image in case of service failure #1836

Conversation

mkowalski commented May 26, 2021 • edited

openshift-ci bot commented May 26, 2021

openshift-ci bot commented May 26, 2021

flaper87 commented May 26, 2021

mkowalski commented May 26, 2021

openshift-ci bot commented May 26, 2021

openshift-ci bot commented May 26, 2021

openshift-ci bot commented May 26, 2021

mkowalski commented May 26, 2021

openshift-ci bot commented May 27, 2021

mkowalski commented May 27, 2021 • edited

flaper87 commented May 27, 2021

romfreiman commented May 27, 2021

mkowalski commented May 27, 2021

mkowalski commented May 27, 2021

flaper87 left a comment

Choose a reason for hiding this comment

flaper87 May 27, 2021

Choose a reason for hiding this comment

mkowalski May 27, 2021

Choose a reason for hiding this comment

openshift-ci bot commented May 27, 2021

mkowalski commented May 27, 2021 • edited

openshift-bot commented May 30, 2021

openshift-ci bot commented May 30, 2021

flaper87 commented May 30, 2021

openshift-ci bot commented May 30, 2021

openshift-bot commented May 31, 2021

openshift-ci bot commented May 31, 2021

mkowalski commented May 31, 2021

openshift-ci bot commented May 31, 2021

mkowalski commented May 31, 2021

openshift-bot commented May 31, 2021

openshift-bot commented May 31, 2021

openshift-bot commented May 31, 2021

openshift-bot commented May 31, 2021

openshift-bot commented May 31, 2021

openshift-bot commented May 31, 2021

openshift-bot commented Jun 1, 2021

openshift-bot commented Jun 1, 2021

openshift-bot commented Jun 1, 2021

openshift-bot commented Jun 1, 2021

openshift-bot commented Jun 1, 2021

openshift-ci bot commented Jun 1, 2021

mkowalski commented Jun 17, 2021

openshift-cherrypick-robot commented Jun 17, 2021

mkowalski commented May 26, 2021 •

edited

mkowalski commented May 27, 2021 •

edited

mkowalski commented May 27, 2021 •

edited