maintenance: remove pools and volumes #3620

clnperez · 2020-05-19T16:45:43Z

This will clean up all volumes under all non-default pools. The
openshift CI creates a pool for each cluster.

Signed-off-by: Christy Norman christy@linux.vnet.ibm.com

clnperez · 2020-05-19T19:41:07Z

/retest

clnperez · 2020-05-19T21:45:17Z

It doesn't seem like any of those failures are related. The tf-lint one says it passed. 🤷‍♀️
@jcpowermac @jhixson74 can you chime in?

cfergeau · 2020-06-09T12:23:30Z

scripts/maintenance/virsh-cleanup.sh

+        if test "${POOL}" = default
+        then
+                continue
+        fi


First time I look at this script, but it seems destroying all volumes including the ones in the default pool would be more consistent with what it's doing (which is removing all libvirt vms/net/...). Apart from this, looks good to me.

This will clean up all volumes under all non-default pools. The openshift CI creates a pool for each cluster. Signed-off-by: Christy Norman <christy@linux.vnet.ibm.com>

clnperez · 2020-06-18T19:56:20Z

@cfergeau -- missed your review earlier, but just made that changed and pushed it.

cfergeau · 2020-06-19T07:21:11Z

/lgtm /approve

clnperez · 2020-06-23T20:36:38Z

@cfergeau is mergebot not going to take this one?

cfergeau · 2020-06-24T09:15:13Z

@cfergeau is mergebot not going to take this one?

no idea how this bot works. Let's try again
/lgtm
/approve

cfergeau · 2020-06-24T09:16:11Z

/assign @abhinavdahiya
Your approval is apparently needed here.

clnperez · 2020-08-03T14:07:49Z

ping @abhinavdahiya -- could this one get merged?

clnperez · 2020-08-03T14:08:13Z

/retest

openshift-ci-robot · 2020-08-03T16:22:09Z

@clnperez: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-ovirt	`48fe5ed`	link	`/test e2e-ovirt`
ci/prow/e2e-crc	`48fe5ed`	link	`/test e2e-crc`
ci/prow/e2e-metal-ipi	`48fe5ed`	link	`/test e2e-metal-ipi`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

abhinavdahiya · 2020-08-04T20:31:54Z

This will clean up all volumes under all non-default pools. The
openshift CI creates a pool for each cluster.

that is pretty dansgerous removing all the volume pools, i think the entire script is like that..

why is the openshift-install destroy cluster not enough?

clnperez · 2020-09-03T21:13:31Z

@abhinavdahiya are you suggesting it not remove the default pool then? or that this script not remove pools at all? it does give a warning at the top of the script that all resources are being destroyed. without this change, we have leftover volume pools after some ci runs. the cluster destroy seems to not always be called, and this is the best thing we have to get a clean system back.

clnperez · 2020-09-14T16:04:06Z

@abhinavdahiya @cfergeau ping. (btw I added a slack reminder for myself so i'll not miss these e-mails sent when you make comments and hopefully can speed this up. apologies again for the long delays between my updates in the past)

clnperez · 2020-09-21T15:13:59Z

@jcpowermac @abhinavdahiya @cfergeau ping again

sdodson · 2020-09-29T17:59:03Z

@clnperez Can the CI job not be amended such that it always runs openshift-install destroy cluster or alternatively move this CI specific logic into the CI job rather than in the installer repo?

clnperez · 2020-09-30T22:30:30Z

@sdodson it does run that. But we've seen times when not everything is cleaned up. I don't know that this is a ci-specific script per se.

clnperez · 2020-09-30T22:55:48Z

A little more context ... here's the place this is mentioned in the libvirt readme: https://github.com/openshift/installer/blob/master/docs/dev/libvirt/README.md#cleanup

openshift-bot · 2020-12-30T11:21:36Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

cfergeau · 2021-01-04T11:06:31Z

Given that this script is already fairly destructive, and has to be run manually, I'm fine with extending it.

jaypoulz · 2021-01-12T22:48:18Z

/remove-lifecycle stale

jaypoulz · 2021-01-12T22:48:56Z

In order to get this landed, we need an approval from one of the members in the OWNER_ALIASES file.
See:
https://github.com/openshift/installer/blob/master/OWNERS_ALIASES#L4-L15

jaypoulz · 2021-01-12T22:49:50Z

@jcpowermac @jhixson74 @crawford @abhinavdahiya Can any of you take a look?

jaypoulz · 2021-01-12T22:51:45Z

@clnperez You will also need a bugzilla targeted to the current release. :)
You can link it by naming the PR:
Bug BUG_NUMBER: TITLE_GOES_HERE
Finally, a /bugzilla referesh should link up the issue.

jaypoulz · 2021-01-12T23:03:59Z

Also, is this script used anywhere?
In CI, we're currently deleting domains, pools, and networks associated with the lease we've acquired. IMHO, this script should be reworked to target a specific cluster-name pattern.

I disagree with the point about destroying the volumes in the pool instead of the pool itself since the pool is created during deployment by the installer. Destroying the volumes before you destroy+undefine the pool is just destroying+undefining the pool with more steps.

We use a variation of this script in our dev environments with a similar "target", only it uses our user-ids to filter the clusters. This script destroys all domains and all networks (besides), so it's already super dangerous. Which is why I'd like to see it modified for security.

To answer @abhinavdahiya's question:
If you run a libvirt destroy and something interrupts it, the libvirt resources won't get fully destroyed but all your cluster info will. This means the only way to restore a working environment (or to clean up the leaked resources) is to manually clean up the virsh resources. This happens much more often that we would like.

staebler · 2021-01-13T15:16:32Z

If you run a libvirt destroy and something interrupts it, the libvirt resources won't get fully destroyed but all your cluster info will. This means the only way to restore a working environment (or to clean up the leaked resources) is to manually clean up the virsh resources. This happens much more often that we would like.

Why does an interrupted libvirt destroy delete the cluster info? Is the libvirt destroy that you are referring to the openshift-install destroy command? If so, can we fix the destroyer so that it does not delete the cluster info unless the destroy is successful?

jaypoulz · 2021-01-13T15:55:24Z

Why does an interrupted libvirt destroy delete the cluster info?

I am unsure. I suspect that the terraform teardown might not receive an error, or if it does receive one, it might ignore it.

Is the libvirt destroy that you are referring to the openshift-install destroy command?

Yes!

If so, can we fix the destroyer so that it does not delete the cluster info unless the destroy is successful?

That would be a great fix.

crawford · 2021-01-25T21:06:46Z

This seems fine to me. @jaypoulz can you file a BZ for the cluster destroy behavior you noted. That sounds like a bug.

/approve

This will also need a valid BZ or will need to wait until the branch opens again in a few weeks.

openshift-ci-robot · 2021-01-25T21:07:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cfergeau, crawford

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-01-25T21:50:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-25T22:03:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-25T23:47:04Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T00:26:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T02:23:07Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T03:54:07Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T06:43:02Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T07:48:00Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T09:32:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T11:42:03Z

/retest

Please review the full test history for this PR and help us cut down flakes.

sdodson · 2021-01-26T13:33:09Z

/skip
That job hasn't passed in months, maybe it's no longer required?

sdodson · 2021-01-26T13:34:23Z

I've overridden bugzilla/valid-bug since this only affects libvirt use cases and I don't expect us to attempt to backport this. If we do need to backport then we'll need to get a bug filed retroactively.

openshift-ci · 2021-01-26T13:56:13Z

@clnperez: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-crc	`48fe5ed`	link	`/test e2e-crc`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2021-01-26T15:10:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-01-26T15:23:04Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot requested review from jcpowermac and jhixson74 May 19, 2020 16:45

clnperez changed the title ~~remove pools and volumes~~ maintenance: remove pools and volumes May 19, 2020

clnperez force-pushed the libvirt-pool-cleanup branch from f2ddf13 to eb67ba1 Compare May 19, 2020 16:55

cfergeau reviewed Jun 9, 2020

View reviewed changes

clnperez force-pushed the libvirt-pool-cleanup branch from eb67ba1 to f570343 Compare June 18, 2020 19:54

maintenance: remove pools and volumes

48fe5ed

This will clean up all volumes under all non-default pools. The openshift CI creates a pool for each cluster. Signed-off-by: Christy Norman <christy@linux.vnet.ibm.com>

clnperez force-pushed the libvirt-pool-cleanup branch from f570343 to 48fe5ed Compare June 18, 2020 19:55

openshift-ci-robot assigned cfergeau Jun 24, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 24, 2020

openshift-ci-robot assigned abhinavdahiya Jun 24, 2020

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2020

openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 25, 2021

sdodson added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 26, 2021

openshift-merge-robot merged commit 95855a1 into openshift:master Jan 26, 2021

clnperez deleted the libvirt-pool-cleanup branch February 11, 2022 17:07

maintenance: remove pools and volumes #3620

maintenance: remove pools and volumes #3620

Conversation

clnperez commented May 19, 2020

clnperez commented May 19, 2020

clnperez commented May 19, 2020

cfergeau Jun 9, 2020

Choose a reason for hiding this comment

clnperez commented Jun 18, 2020

cfergeau commented Jun 19, 2020

clnperez commented Jun 23, 2020

cfergeau commented Jun 24, 2020

cfergeau commented Jun 24, 2020

clnperez commented Aug 3, 2020

clnperez commented Aug 3, 2020

openshift-ci-robot commented Aug 3, 2020

abhinavdahiya commented Aug 4, 2020

clnperez commented Sep 3, 2020 • edited

clnperez commented Sep 14, 2020

clnperez commented Sep 21, 2020

sdodson commented Sep 29, 2020

clnperez commented Sep 30, 2020

clnperez commented Sep 30, 2020 • edited

openshift-bot commented Dec 30, 2020

cfergeau commented Jan 4, 2021

jaypoulz commented Jan 12, 2021

jaypoulz commented Jan 12, 2021

jaypoulz commented Jan 12, 2021

jaypoulz commented Jan 12, 2021

jaypoulz commented Jan 12, 2021

staebler commented Jan 13, 2021

jaypoulz commented Jan 13, 2021

crawford commented Jan 25, 2021

openshift-ci-robot commented Jan 25, 2021

openshift-bot commented Jan 25, 2021

openshift-bot commented Jan 25, 2021

openshift-bot commented Jan 25, 2021

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

sdodson commented Jan 26, 2021

sdodson commented Jan 26, 2021

openshift-ci bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

openshift-bot commented Jan 26, 2021

clnperez commented Sep 3, 2020 •

edited

clnperez commented Sep 30, 2020 •

edited