Add tear down events #9

sttts · 2019-01-24T14:31:27Z

send out bootstrap-success event before tear down
optionally wait for event given to --tear-down-event.

sttts · 2019-01-24T14:31:36Z

Untested.

sttts · 2019-01-24T14:31:45Z

openshift-ci-robot · 2019-01-24T14:31:45Z

@sttts: GitHub didn't allow me to assign the following users: wking.

Note that only openshift members and repo collaborators can be assigned and that issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @wking @mfojtik

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-01-25T00:23:41Z

pkg/start/start.go

@@ -72,6 +74,11 @@ func (b *startCommand) Run() error {
 		return err
 	}

+	// notify installer that we are ready to tear down the temporary bootstrap control plane
+	if _, err := client.CoreV1().Events("kube-system").Create(makeBootstrapSuccessEvent("kube-system", "bootstrap-success")); err != nil {


The installer uses bootstrap-complete for this. But instead of hard-coding this, can we make it an option (--pods-available-event?)? That way the installer can gracefully transition from it's direct bootstrap-complete injection to a cluster-bootstrap-injected bootstrap-complete without having runs doubling up on that event. If the option was not set, then cluster-bootstrap would not inject an event on pod completion.

Sure, I can make that configurable. But note that the bootstrap-success event is always sent, and is just ignored by missing code in the installer now. The only support you have to enable explicitly (which changes the cluster-bootstrap behaviour) is --wait-for-event-before-tear-down.

With this change, I'm expecting this event to be able to replace the current installer-injected event. I have a WIP installer shuffle to make that work; we'll see how it goes ;).

technically you still have to wait to bootstrap-complete because only then you are sure that everything has been shut down.

cmd/cluster-bootstrap/start.go

pkg/start/start.go

wking · 2019-01-25T10:00:10Z

pkg/start/start.go

+	bootstrapPodsRunningTimeout = 20 * time.Minute
+
+	// how long we wait for the installer to send a tear down event after we had sent the bootstrap-success event
+	tearDownEventWaitTimeout = 30 * time.Minute


Why have these timeouts? I'd expect cluster-bootstrap to wait forever in both cases, with timeouts being set in the installer and CI tooling.

We can also wait infinitely, have no strong opinion. In either case we are in bad state.

In either case we are in bad state.

Yup. But in the wait-indefinitely case, live debugging may be easier, because you stay in a similar bad state while you look around (vs. having cluster-bootstrap keel over on you in addition to whatever else went wrong).

wking · 2019-01-25T10:22:21Z

Looks good to me :)

wking · 2019-01-25T10:26:19Z

/lgtm

openshift-ci-robot · 2019-01-25T10:28:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sttts, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sttts]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2019-01-25T10:28:08Z

@wking: changing LGTM is restricted to assignees, and only openshift/cluster-bootstrap repo collaborators may be assigned issues.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sttts · 2019-01-25T17:00:30Z

@mfojtik @wking fixed typo in flag definition. Please retag.

openshift-bot · 2019-01-25T17:40:42Z

/retest