New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destroy machine when last principal unit removed. #52
Conversation
func (m *Machine) ForceDestroy() error { | ||
// forceDestroyOps returns the transaction operations necessary to completely | ||
// remove the machine along with its units and containers. | ||
func (m *Machine) forceDestroyOps() ([]txn.Op, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm hesitant to use "forceDestroyOps" in something that should be a clean teardown. Maybe call this "destroyOps", and have a different one for forceDestroyOps which includes the regular cleanup?
I guess my concern is that we should only be tearing down the machine automatically when it is already empty (but dirty), so we shouldn't need to run all the extra "make sure everything really is gone". Because if everything isn't gone, we shouldn't be tearing down the machine.
So NOT LGTM as is, mostly because it has 0 tests. The logic seems like it would do what we want, though my personal feeling is that it is overly forceful. |
I've simplified this quite a bit by deferring Destroy() on the unit's assigned machine in Unit.Destroy -- if the machine is eligible for destruction, for which I've refactored logic out of Machine.Destroy(), into Machine.CheckDestroy(). PTAL, thanks! |
// machine is not voting) | ||
return fmt.Errorf("machine %s is required by the environment", m.doc.Id) | ||
} | ||
if m.doc.HasVote { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
grr, we shouldn't have both these checks. not actionable in this PR though.
NOT LGTM yet, we need to be obsessively careful about state changes. It might be helpful for you to refresh your memory of the "lifecycles", "death-and-destruction", and "hacking-state" documentation (in the doc/ dir); but you should also, definitely, come talk to me about anything that's not crystal clear. |
ops = append(ops, svcOp) | ||
|
||
if removeMachine { | ||
ops = append(ops, txn.Op{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not keen on having two ops referring to the same document. Might be able to expand on this comment usefully when I've seen more.
This is converging fast, but needs a couple of code tweaks (and much heavier testing than is currently in place). |
Still wip, need to add test cases with transaction hooks |
PTAL |
Update: bson.D{{"$set", bson.D{{"life", Dying}}}}, | ||
}) | ||
} else { | ||
assert = keepAssert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we'd want to $or together the various keepAssert possibilities, wouldn't we? ie if one of them no longer applied, but another did, we should be able to proceed as we did before.
PTAL |
ops = append(ops, txn.Op{ | ||
C: u.st.containerRefs.Name, | ||
Id: m.doc.Id, | ||
Assert: bson.D{{"children", containers}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd be checking for not-empty, rather than exact-match, here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Didn't see replies to earlier comments until now. |
var keepAsserts []bson.D | ||
m, err := u.st.Machine(u.doc.MachineId) | ||
if err != nil { | ||
if errors.IsNotFound(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is certainly an indication that something is up -- the machine doc doesn't match the unit. I think this should return txn.ErrTransientFailure -- it should rebuild the ops and try again in the hope of seeing consistent db state.
I'd ask for an explicit test for this situation, but I'm not sure we can construct one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this, but it caused a failure in CleanupSuite.TestCleanupForceDestroyedMachineWithContainer, cleanup_test.go:232.
PTAL |
result = append(result, tc) | ||
} | ||
{ | ||
tc := destroyMachineTestCase{desc: "principal has subordinate", destroyed: true} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think we should probably drop this test case. One reason is that unit destruction should take out the subordinates anyway...
LGTM, thank you for your patience :). |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
Build failed: Tests failed |
Either way, assert that the state assumptions for being eligible or not-eligible hold true for the transaction. Improved state assertion in removeUnitOps More test cases, renamed destroyHostOps destroyHostOps fix, allow machine removal on retry If a unit's host cannot be removed, make that condition the only assertion on the transaction, so that we can re-evaluate the state completely on retry. Examples: - host container removed after check, before txn exec - host gives up voting rights after check, before txn exec Unit tests added for the above, and the converse (host qualifies on check, disqualified before txn exec). Fix import Add thrashing state testcase. Consolidate destroyHostOps machine ops. $or together all the "keep host alive" assertions, so that a transaction can succeed if the individual "keep" criteria change in a transaction but the outcome remains the same. Improve destroyHostOps test cases. Test coverage for removing an unassigned unit. Use assertLife helper function. More descriptive test names. More assertion checks. Improve non-empty containers assertion. MongoDB makes it kind of weird. https://stackoverflow.com/questions/14789684/find-mongodb-records-where-array-field-is-not-empty-using-mongoose Clean up tests, log removal of unassigned unit. Removed test cruft, .Refresh() from txn hooks. Colocated unit set up outside of the transaction hook in TestRemoveUnitMachineRetryOrCond. Remove TestDestroyCleanRelations. Check unit destroyed in TestRemoveUnitMachineRetryOrCond Consolidate & improve destroyHostOps assert conds. Refactor out unit test suite method setMachineVote. More comments. Remove subordinate unit test case.
Fixed service removal test by calling State.Cleanup.
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
Build failed: Tests failed |
Rolling again... |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
Build failed: Tests failed |
|
Status: merge request accepted. Url: http://juju-ci.vapour.ws:8080/job/github-merge-juju |
Destroy machine when last principal unit removed. This change causes a machine to be automatically removed when the last principal unit running on that machine is removed (remove-unit). If the machine is a parent to any containers, it is left alone. Background: https://bugs.launchpad.net/juju-core/+bug/1206532 https://bugs.launchpad.net/juju-core/+bug/1183309/comments/8 This could use more test coverage, but I'd like to get a general feel for how it's looking so far before I go much further with it. This is my first time mucking around with state transactions. Please let me know if I've got it right! I haven't added an option to keep empty machines/containers around after removing the last unit. I'd like to see if we can avoid it. If we do need the option, I think it should be opt in. Let me know.
charm: allow for juju-info relation
Make model optional in blockdevice schema MAAS2 sometimes sends null for the "model" field in blockdevice.
This change causes a machine to be automatically removed when the last principal unit running on that machine is removed (remove-unit). If the machine is a parent to any containers, it is left alone.
Background:
https://bugs.launchpad.net/juju-core/+bug/1206532
https://bugs.launchpad.net/juju-core/+bug/1183309/comments/8
This could use more test coverage, but I'd like to get a general feel for how it's looking so far before I go much further with it. This is my first time mucking around with state transactions. Please let me know if I've got it right!
I haven't added an option to keep empty machines/containers around after removing the last unit. I'd like to see if we can avoid it. If we do need the option, I think it should be opt in. Let me know.