Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maintenance: remove pools and volumes #3620

Merged

Conversation

clnperez
Copy link
Contributor

This will clean up all volumes under all non-default pools. The
openshift CI creates a pool for each cluster.

Signed-off-by: Christy Norman christy@linux.vnet.ibm.com

@clnperez clnperez changed the title remove pools and volumes maintenance: remove pools and volumes May 19, 2020
@clnperez
Copy link
Contributor Author

/retest

@clnperez
Copy link
Contributor Author

It doesn't seem like any of those failures are related. The tf-lint one says it passed. 🤷‍♀️
@jcpowermac @jhixson74 can you chime in?

if test "${POOL}" = default
then
continue
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First time I look at this script, but it seems destroying all volumes including the ones in the default pool would be more consistent with what it's doing (which is removing all libvirt vms/net/...). Apart from this, looks good to me.

This will clean up all volumes under all non-default pools. The
openshift CI creates a pool for each cluster.

Signed-off-by: Christy Norman <christy@linux.vnet.ibm.com>
@clnperez
Copy link
Contributor Author

@cfergeau -- missed your review earlier, but just made that changed and pushed it.

@cfergeau
Copy link
Contributor

/lgtm /approve

@clnperez
Copy link
Contributor Author

@cfergeau is mergebot not going to take this one?

@cfergeau
Copy link
Contributor

@cfergeau is mergebot not going to take this one?

no idea how this bot works. Let's try again
/lgtm
/approve

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2020
@cfergeau
Copy link
Contributor

/assign @abhinavdahiya
Your approval is apparently needed here.

@clnperez
Copy link
Contributor Author

clnperez commented Aug 3, 2020

ping @abhinavdahiya -- could this one get merged?

@clnperez
Copy link
Contributor Author

clnperez commented Aug 3, 2020

/retest

@openshift-ci-robot
Copy link
Contributor

@clnperez: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-ovirt 48fe5ed link /test e2e-ovirt
ci/prow/e2e-crc 48fe5ed link /test e2e-crc
ci/prow/e2e-metal-ipi 48fe5ed link /test e2e-metal-ipi

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor

This will clean up all volumes under all non-default pools. The
openshift CI creates a pool for each cluster.

that is pretty dansgerous removing all the volume pools, i think the entire script is like that..

why is the openshift-install destroy cluster not enough?

@clnperez
Copy link
Contributor Author

clnperez commented Sep 3, 2020

@abhinavdahiya are you suggesting it not remove the default pool then? or that this script not remove pools at all? it does give a warning at the top of the script that all resources are being destroyed. without this change, we have leftover volume pools after some ci runs. the cluster destroy seems to not always be called, and this is the best thing we have to get a clean system back.

@clnperez
Copy link
Contributor Author

@abhinavdahiya @cfergeau ping. (btw I added a slack reminder for myself so i'll not miss these e-mails sent when you make comments and hopefully can speed this up. apologies again for the long delays between my updates in the past)

@clnperez
Copy link
Contributor Author

@jcpowermac @abhinavdahiya @cfergeau ping again

@sdodson
Copy link
Member

sdodson commented Sep 29, 2020

@clnperez Can the CI job not be amended such that it always runs openshift-install destroy cluster or alternatively move this CI specific logic into the CI job rather than in the installer repo?

@clnperez
Copy link
Contributor Author

@sdodson it does run that. But we've seen times when not everything is cleaned up. I don't know that this is a ci-specific script per se.

@clnperez
Copy link
Contributor Author

clnperez commented Sep 30, 2020

A little more context ... here's the place this is mentioned in the libvirt readme: https://github.com/openshift/installer/blob/master/docs/dev/libvirt/README.md#cleanup

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2020
@cfergeau
Copy link
Contributor

cfergeau commented Jan 4, 2021

Given that this script is already fairly destructive, and has to be run manually, I'm fine with extending it.

@jaypoulz
Copy link
Contributor

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2021
@jaypoulz
Copy link
Contributor

In order to get this landed, we need an approval from one of the members in the OWNER_ALIASES file.
See:
https://github.com/openshift/installer/blob/master/OWNERS_ALIASES#L4-L15

@jaypoulz
Copy link
Contributor

@jcpowermac @jhixson74 @crawford @abhinavdahiya Can any of you take a look?

@jaypoulz
Copy link
Contributor

@clnperez You will also need a bugzilla targeted to the current release. :)
You can link it by naming the PR:
Bug BUG_NUMBER: TITLE_GOES_HERE
Finally, a /bugzilla referesh should link up the issue.

@jaypoulz
Copy link
Contributor

Also, is this script used anywhere?
In CI, we're currently deleting domains, pools, and networks associated with the lease we've acquired. IMHO, this script should be reworked to target a specific cluster-name pattern.

I disagree with the point about destroying the volumes in the pool instead of the pool itself since the pool is created during deployment by the installer. Destroying the volumes before you destroy+undefine the pool is just destroying+undefining the pool with more steps.

We use a variation of this script in our dev environments with a similar "target", only it uses our user-ids to filter the clusters. This script destroys all domains and all networks (besides), so it's already super dangerous. Which is why I'd like to see it modified for security.

To answer @abhinavdahiya's question:
If you run a libvirt destroy and something interrupts it, the libvirt resources won't get fully destroyed but all your cluster info will. This means the only way to restore a working environment (or to clean up the leaked resources) is to manually clean up the virsh resources. This happens much more often that we would like.

@staebler
Copy link
Contributor

If you run a libvirt destroy and something interrupts it, the libvirt resources won't get fully destroyed but all your cluster info will. This means the only way to restore a working environment (or to clean up the leaked resources) is to manually clean up the virsh resources. This happens much more often that we would like.

Why does an interrupted libvirt destroy delete the cluster info? Is the libvirt destroy that you are referring to the openshift-install destroy command? If so, can we fix the destroyer so that it does not delete the cluster info unless the destroy is successful?

@jaypoulz
Copy link
Contributor

Why does an interrupted libvirt destroy delete the cluster info?

I am unsure. I suspect that the terraform teardown might not receive an error, or if it does receive one, it might ignore it.

Is the libvirt destroy that you are referring to the openshift-install destroy command?

Yes!

If so, can we fix the destroyer so that it does not delete the cluster info unless the destroy is successful?

That would be a great fix.

@crawford
Copy link
Contributor

This seems fine to me. @jaypoulz can you file a BZ for the cluster destroy behavior you noted. That sounds like a bug.

/approve

This will also need a valid BZ or will need to wait until the branch opens again in a few weeks.

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfergeau, crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

9 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@sdodson
Copy link
Member

sdodson commented Jan 26, 2021

/skip
That job hasn't passed in months, maybe it's no longer required?

@sdodson sdodson added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 26, 2021
@sdodson
Copy link
Member

sdodson commented Jan 26, 2021

I've overridden bugzilla/valid-bug since this only affects libvirt use cases and I don't expect us to attempt to backport this. If we do need to backport then we'll need to get a bug filed retroactively.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 26, 2021

@clnperez: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-crc 48fe5ed link /test e2e-crc

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 95855a1 into openshift:master Jan 26, 2021
@clnperez clnperez deleted the libvirt-pool-cleanup branch February 11, 2022 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants