Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1964591: Remove AI Agent image in case of service failure #1836

Merged
merged 1 commit into from Jun 1, 2021

Conversation

mkowalski
Copy link
Contributor

@mkowalski mkowalski commented May 26, 2021

This PR adds a handler for a failure scenario of agent.service which
removes the assisted-installer-agent container image.

This is a workaround for an issue where symlinks in /var/lib/containers/
are corrupted. Deleting an image in ExecStartPre means that every time
agent.service starts we make sure the image is available. If it's the
very first attempt to start agent.service, then the the image will be
pulled as it would be in any other scenario. Any consecutive attempt to
start agent.service will first check if the image is present and in
case of errors will remove it so that it can be pulled again.

We are not using the OnFailure directive because the unit defined
there would only be started once all the restarts attempts are exhausted
which is not a desired workflow in this scenario here.

Closes: OCPBUGSM-29583

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 26, 2021
@openshift-ci
Copy link

openshift-ci bot commented May 26, 2021

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[WIP] Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci
Copy link

openshift-ci bot commented May 26, 2021

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

[WIP] Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from eranco74 and ybettan May 26, 2021 11:24
@flaper87
Copy link
Contributor

/assign @flaper87

@mkowalski mkowalski changed the title [WIP] Bug 1964591: Remove AI Agent image in case of service failure Bug 1964591: Remove AI Agent image in case of service failure May 26, 2021
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 26, 2021
@mkowalski
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci
Copy link

openshift-ci bot commented May 26, 2021

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link

openshift-ci bot commented May 26, 2021

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

1 similar comment
@openshift-ci
Copy link

openshift-ci bot commented May 26, 2021

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski
Copy link
Contributor Author

/retest

@openshift-ci
Copy link

openshift-ci bot commented May 27, 2021

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski
Copy link
Contributor Author

mkowalski commented May 27, 2021

With the change systemd unit looks as follows

[core@worker-0 ~]$ cat /etc/systemd/system/agent.service
[Service]
Type=simple
Restart=always
RestartSec=3
StartLimitInterval=0
Environment=HTTP_PROXY=
Environment=http_proxy=
Environment=HTTPS_PROXY=
Environment=https_proxy=
Environment=NO_PROXY=
Environment=no_proxy=
Environment=PULL_SECRET_TOKEN=eyJhbGciOiJFUzI1NiIsInR5cCI6IkpXVCJ9.eyJjbHVzdGVyX2lkIjoiN2MwNWI4YTMtNWRjMC00YjcwLWE0MDktOGYxNDk0OGVkZjJlIn0.ReyzNdQKiyvU7uiF7IhjRXc5ORzeInIZ3KmYYB-zOWGFqkU3vnJfw6U-EY5ypGEP6VK0wy_SZ2lDkuVR0K0oIA
TimeoutStartSec=180
ExecStartPre=-podman rmi --force quay.io/ocpmetal/assisted-installer-agent:latest
ExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin quay.io/ocpmetal/assisted-installer-agent:latest cp /usr/bin/agent /hostbin
ExecStart=/usr/local/bin/agent --url https://assisted-service-assisted-installer.apps.ostest.test.metalkube.org --cluster-id 7c05b8a3-5dc0-4b70-a409-8f14948edf2e --agent-version quay.io/ocpmetal/assisted-installer-agent:latest --insecure=false  --cacert /etc/assisted-service/service-ca-cert.crt

[Unit]
Wants=network-online.target
After=network-online.target

[Install]
WantedBy=multi-user.target

Note that we are using - in order to ignore the status of podman rmi. This is because the initial run of agent.service has no image to be removed yet, so podman rmi fails for a valid reason

[core@worker-0 ~]$ systemctl status agent.service 
● agent.service
   Loaded: loaded (/etc/systemd/system/agent.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2021-05-27 09:18:02 UTC; 32min ago
  Process: 1634 ExecStartPre=/usr/bin/podman run --privileged --rm -v /usr/local/bin:/hostbin quay.io/ocpmetal/assisted-installer-agent:latest cp /usr/bin/agent /hostbin (code=exited, status=0/SUCCESS)
  Process: 1565 ExecStartPre=/usr/bin/podman rmi --force quay.io/ocpmetal/assisted-installer-agent:latest (code=exited, status=1/FAILURE)
[...]

@flaper87
Copy link
Contributor

Note that we are using - in order to ignore the status of podman rmi. This is because the initial run of agent.service has no image to be removed yet, so podman rmi fails for a valid reason

Could you try restarting the agent? Just want to do a sanity check :)

@romfreiman
Copy link
Contributor

Is it the only image that we pull?

@mkowalski
Copy link
Contributor Author

Is it the only image that we pull?

Those are all the images I have on the worker during the installation

REPOSITORY                                      TAG     IMAGE ID      CREATED       SIZE                                                                                                                                         [0/0]
quay.io/ocpmetal/assisted-installer-agent       latest  f39447ddaf2f  6 hours ago   828 MB
quay.io/ocpmetal/assisted-installer             latest  8c10045f04c8  13 hours ago  273 MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev  <none>  147d9596fcf6  2 weeks ago   438 MB
quay.io/openshift-release-dev/ocp-v4.0-art-dev  <none>  87bff3c8ae43  2 weeks ago   391 MB

@mkowalski
Copy link
Contributor Author

Could you try restarting the agent? Just want to do a sanity check :)

I did it as a manual test, i.e.

  • started installation of the SNO
  • during the runtime of agent.service manually killed processes spawned (kill -9 XXX)
  • systemd noticed failure and restarted agent.service
  • podman image got removed
  • during startup of agent.service the quay.io/ocpmetal/assisted-installer-agent image got pulled again

Copy link
Contributor

@flaper87 flaper87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the thorough tests. I have one comment that I think we should tackle before merging this PR.

@@ -139,7 +139,7 @@ const discoveryIgnitionConfigFormat = `{
"units": [{
"name": "agent.service",
"enabled": true,
"contents": "[Service]\nType=simple\nRestart=always\nRestartSec=3\nStartLimitInterval=0\nEnvironment=HTTP_PROXY={{.HTTPProxy}}\nEnvironment=http_proxy={{.HTTPProxy}}\nEnvironment=HTTPS_PROXY={{.HTTPSProxy}}\nEnvironment=https_proxy={{.HTTPSProxy}}\nEnvironment=NO_PROXY={{.NoProxy}}\nEnvironment=no_proxy={{.NoProxy}}{{if .PullSecretToken}}\nEnvironment=PULL_SECRET_TOKEN={{.PullSecretToken}}{{end}}\nTimeoutStartSec={{.AgentTimeoutStartSec}}\nExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin {{.AgentDockerImg}} cp /usr/bin/agent /hostbin\nExecStart=/usr/local/bin/agent --url {{.ServiceBaseURL}} --cluster-id {{.clusterId}} --agent-version {{.AgentDockerImg}} --insecure={{.SkipCertVerification}} {{if .HostCACertPath}}--cacert {{.HostCACertPath}}{{end}}\n\n[Unit]\nWants=network-online.target\nAfter=network-online.target\n\n[Install]\nWantedBy=multi-user.target"
"contents": "[Service]\nType=simple\nRestart=always\nRestartSec=3\nStartLimitInterval=0\nEnvironment=HTTP_PROXY={{.HTTPProxy}}\nEnvironment=http_proxy={{.HTTPProxy}}\nEnvironment=HTTPS_PROXY={{.HTTPSProxy}}\nEnvironment=https_proxy={{.HTTPSProxy}}\nEnvironment=NO_PROXY={{.NoProxy}}\nEnvironment=no_proxy={{.NoProxy}}{{if .PullSecretToken}}\nEnvironment=PULL_SECRET_TOKEN={{.PullSecretToken}}{{end}}\nTimeoutStartSec={{.AgentTimeoutStartSec}}\nExecStartPre=-podman rmi --force {{.AgentDockerImg}}\nExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin {{.AgentDockerImg}} cp /usr/bin/agent /hostbin\nExecStart=/usr/local/bin/agent --url {{.ServiceBaseURL}} --cluster-id {{.clusterId}} --agent-version {{.AgentDockerImg}} --insecure={{.SkipCertVerification}} {{if .HostCACertPath}}--cacert {{.HostCACertPath}}{{end}}\n\n[Unit]\nWants=network-online.target\nAfter=network-online.target\n\n[Install]\nWantedBy=multi-user.target"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I think about it a bit further, I think we should check if the image exists first.

In an environment with many concurrent nodes being deployed, this could add an extra load to the local (or remote) registry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it makes sense (however we are handling a failure scenario here, so extra load would mean that there were multiple machines with a corrupted podman storage).

I added a change so that we will check if the image exists and in case of any errors we will delete it. So just restarting agent.service without breaking podman storage does not cause now the image to be deleted and pulled again.

@openshift-ci
Copy link

openshift-ci bot commented May 27, 2021

@mkowalski: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski
Copy link
Contributor Author

mkowalski commented May 27, 2021

I slightly reorganized stuff here and as discussed in the comment above, restarting agent.service does not automatically cause removing the assisted-installer-agent image. It will happen now only if there is a problem executing podman images.

Currently the agent.service looks as follows

[root@worker-0 ~]# cat /etc/systemd/system/agent.service                                                                                                                                                                              
[Service]                                                                                                                                                                                                                             
Type=simple                                                                                                                                                                                                                           
Restart=always                                                                                                                                                                                                                        
[...]
ExecStartPre=/usr/local/bin/agent-fix-bz1964591 quay.io/ocpmetal/assisted-installer-agent:latest
ExecStartPre=podman run --privileged --rm -v /usr/local/bin:/hostbin quay.io/ocpmetal/assisted-installer-agent:latest cp /usr/bin/agent /hostbin
[...]

Starting agent.service with the image already pulled looks as below

[root@worker-0 ~]# systemctl status agent.service                                                                                                                                                                                     
● agent.service                                                                                                                                                                                                                       
   Loaded: loaded (/etc/systemd/system/agent.service; enabled; vendor preset: enabled)                                                                                                                                                
   Active: active (running) since Thu 2021-05-27 12:37:27 UTC; 23s ago                                                                                                                                                                
[...]
May 27 12:37:26 worker-0 systemd[1]: Starting agent.service...                                                                                                                                                                        
May 27 12:37:26 worker-0 agent-fix-bz1964591[2514]: quay.io/ocpmetal/assisted-installer-agent  latest  f39447ddaf2f  12 hours ago  828 MB                                                                                             May 27 12:37:27 worker-0 systemd[1]: agent.service: Found left-over process 2677 (conmon) in control group while starting unit. Ignoring.                                                                                             
[...]
May 27 12:37:27 worker-0 systemd[1]: Started agent.service.                                                                                                                                                                           

@openshift-bot
Copy link
Contributor

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

@openshift-ci
Copy link

openshift-ci bot commented May 30, 2021

@openshift-bot: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@flaper87
Copy link
Contributor

/bugzilla refresh

@openshift-ci
Copy link

openshift-ci bot commented May 30, 2021

@flaper87: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

@openshift-ci
Copy link

openshift-ci bot commented May 31, 2021

@openshift-bot: This pull request references Bugzilla bug 1964591, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.8-premerge" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Recalculating validity in case the underlying Bugzilla bug has changed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci openshift-ci bot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels May 31, 2021
@openshift-ci
Copy link

openshift-ci bot commented May 31, 2021

@mkowalski: This pull request references Bugzilla bug 1964591, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (yobshans@redhat.com), skipping review request.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

10 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit c9d0529 into openshift:master Jun 1, 2021
@openshift-ci
Copy link

openshift-ci bot commented Jun 1, 2021

@mkowalski: All pull requests linked via external trackers have merged:

Bugzilla bug 1964591 has been moved to the MODIFIED state.

In response to this:

Bug 1964591: Remove AI Agent image in case of service failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mkowalski mkowalski deleted the bz1964591 branch June 1, 2021 11:36
YuviGold pushed a commit to YuviGold/assisted-service that referenced this pull request Jun 9, 2021
…ift#1836)

This PR adds a handler for a failure scenario of `agent.service` which
removes the `assisted-installer-agent` container image.

This is a workaround for an issue where symlinks in `/var/lib/containers/`
are corrupted. Deleting an image in `ExecStartPre` means that every time
agent.service starts we make sure the image is available. If it's the
very first attempt to start `agent.service`, then the the image will be
pulled as it would be in any other scenario. Any consecutive attempt to
start `agent.service` will first check if the image is present and in
case of errors will remove it so that it can be pulled again.

We are not using the `OnFailure` directive because the unit defined
there would only be started once all the restarts attempts are exhausted
which is not a desired workflow in this scenario here.

Closes: OCPBUGSM-29583
@mkowalski
Copy link
Contributor Author

/cherry-pick ocm-2.3

Backport BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1971630

@openshift-cherrypick-robot

@mkowalski: new pull request created: #2031

In response to this:

/cherry-pick ocm-2.3

Backport BZ is https://bugzilla.redhat.com/show_bug.cgi?id=1971630

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants