emit events for each new payload #411

deads2k · 2020-07-21T20:02:45Z

Since the CVO updates itself every time it updates, we always lose our logs for what happens when the update starts. Updates are infrequent enough that we can emit events for them without loading the kube-apiserver.

smarterclayton · 2020-07-21T20:07:44Z

This is eminently reasonable

pkg/cvo/sync_worker.go

smarterclayton · 2020-07-21T20:10:14Z

One comment, this lgtm assuming e2e and upgrade pass and output events

mfojtik · 2020-07-22T09:17:26Z

/lgtm

Clayton comment is reasonable to address :-)

deads2k · 2020-07-22T12:42:04Z

One comment, this lgtm assuming e2e and upgrade pass and output events

That's tricksy. You can't see the events until the CVO doing the upgrade (current master) emits the events.

pkg/cvo/cvo.go

wking · 2020-07-22T14:21:48Z

pkg/cvo/sync_worker.go

@@ -474,6 +480,8 @@ func (w *SyncWorker) syncOnce(ctx context.Context, work *SyncWork, maxWorkers in
 	validPayload := w.payload
 	if validPayload == nil || !equalUpdate(configv1.Update{Image: validPayload.ReleaseImage}, configv1.Update{Image: update.Image}) {
 		klog.V(4).Infof("Loading payload")
+		cvoObjectRef := &corev1.ObjectReference{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version"}
+		w.eventRecorder.Eventf(cvoObjectRef, corev1.EventTypeNormal, "RetrievePayload", "retrieving payload version=%q image=%q", update.Version, update.Image)


it would be nice to have the version and image as structured data on the event. Is that possible, or are flat strings all we have available?

it would be nice to have the version and image as structured data on the event. Is that possible, or are flat strings all we have available?

To my knowledge, flat strings are what you have.

Technically we could do that as annotation, but let's not for now.

wking · 2020-07-22T14:22:40Z

pkg/cvo/sync_worker.go

@@ -483,6 +491,7 @@ func (w *SyncWorker) syncOnce(ctx context.Context, work *SyncWork, maxWorkers in
 		})
 		info, err := w.retriever.RetrievePayload(ctx, update)
 		if err != nil {
+			w.eventRecorder.Eventf(cvoObjectRef, corev1.EventTypeWarning, "RetrievePayloadFailed", "retrieving payload failed version=%q image=%q failure=%v", update.Version, update.Image, err)
 			reporter.Report(SyncWorkerStatus{


Seems like we could move the event recorder under reporter.Report?

Seems like we could move the event recorder under reporter.Report?

that would be a significant change. It's used for all error calls and I'm not certain of the fanout that would cause. I'd like to solve the immediate "special" events and allow someone closer to the operator decide if and how to emit an event for each report.

Yeah, let's keep reporter separate for now.

deads2k · 2020-07-22T17:07:33Z

/retest

smarterclayton · 2020-07-22T20:40:05Z

Wierd, none of these events are showing up in the monitor. Are events getting sent?

deads2k · 2020-07-22T22:23:40Z

Wierd, none of these events are showing up in the monitor. Are events getting sent?

no, because the version of the CVO that would produce these is master:HEAD since that's the version looking up the new payload. We won't see these until after this is merged. Same problem you had.

wking · 2020-07-22T22:38:47Z

I launched a cluster-bot test with test upgrade openshift/cluster-version-operator#411 4.6.0-0.ci gcp, which succeeded. But:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1286037162390720512/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | .firstTimestamp + " " + .type + " " + .reason + ": " + .message' | grep 'Payload\|Preconditions\|version='
...no hits...

Checking the source release, the build log has:

2020/07/22 20:36:15 Resolved source https://github.com/openshift/cluster-version-operator to master@28e4400e, merging: #411 7d3621f4 @system:serviceaccount:ci:ci-chat-bot

so not clear to me why I'm not seeing these events. Do we actually set a real eventRecorder in 7d3621f?

wking · 2020-07-23T02:33:01Z

e2e with 42d74a0:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/411/pull-ci-openshift-cluster-version-operator-master-e2e/1286074666141618176/artifacts/e2e/gather-extra/pods/openshift-cluster-version_cluster-version-operator-b5c76fb4b-zrf76_cluster-version-operator.log | grep -A3 DEADS
#### DEADS patch  "openshift-cluster-version"!!!
I0722 23:19:59.995655       1 payload.go:230] Loading updatepayload from "/"
E0722 23:19:59.998582       1 event.go:272] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)
I0722 23:19:59.998691       1 event.go:281] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'Something' someevent

So some sort of hiccup around which namespace this is living in. I'm not familiar enough with Event publishing to know exactly what we're missing...

wking · 2020-07-23T04:22:38Z

Logged line is from here. Looks like our sink expects openshift-cluster-version (good), but the *Event getting fed in is using default (bad). I'm still trying to figure out how eventRecorder.Eventf picks its namespace so we can figure out how to configure it... [edit, here is how]

pkg/cvo/sync_worker.go

wking · 2020-07-23T17:27:15Z

New test upgrade openshift/cluster-version-operator#411 4.6.0-0.ci gcp job (job failed, but completed the update):

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1286331887161184256/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | .firstTimestamp + " " + .type + " " + .reason + ": " + .message' | grep 'Payload\|Preconditions\|version='
2020-07-23T16:25:39Z Normal RetrievePayload: retrieving payload version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:25:39Z Normal VerifyPayload: verifying payload version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:25:40Z Normal PayloadLoaded: payload loaded version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:54:01Z Normal RetrievePayload: retrieving payload version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:54:01Z Normal VerifyPayload: verifying payload version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:54:01Z Normal PayloadLoaded: payload loaded version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:57:24Z Normal RetrievePayload: retrieving payload version="" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:bcdcc30fef79ad503c4931800554f1ab258c36f25ff646670e673ead68456db6"
2020-07-23T16:57:33Z Normal VerifyPayload: verifying payload version="" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:bcdcc30fef79ad503c4931800554f1ab258c36f25ff646670e673ead68456db6"
2020-07-23T16:57:33Z Normal PreconditionsPassed: preconditions passed for payload loaded version="" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:bcdcc30fef79ad503c4931800554f1ab258c36f25ff646670e673ead68456db6"
2020-07-23T16:57:33Z Normal PayloadLoaded: payload loaded version="" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:bcdcc30fef79ad503c4931800554f1ab258c36f25ff646670e673ead68456db6"

Do we want to include the current reconciliation mode in the events, so folks understand why we aren't running preconditions as the install-time release's CVO hops from node to node? And do we want to do anything about version="" (because for --to-image flows, we only know the version after we've loaded it. There may be a way we could get a version string into the PreconditionsPassed and PayloadLoaded events...).

Also in this space, we should teach the CVO that it does not need to download and fetch a release image pullspec that matches its current pod pullspec. That would cover the node-hopping angle.

deads2k · 2020-07-23T17:47:17Z

Do we want to include the current reconciliation mode in the events, so folks understand why we aren't running preconditions as the install-time release's CVO hops from node to node? And do we want to do anything about version="" (because for --to-image flows, we only know the version after we've loaded it. There may be a way we could get a version string into the PreconditionsPassed and PayloadLoaded events...).

Also in this space, we should teach the CVO that it does not need to download and fetch a release image pullspec that matches its current pod pullspec. That would cover the node-hopping angle.

for my purposes, this current behavior works fine.

deads2k · 2020-07-23T17:51:31Z

/retest

deads2k · 2020-07-23T17:52:02Z

amusingly, I need this in master to figure out why the latest upgrade for it failed.

wking · 2020-07-23T21:45:35Z

@deads2k assures me that there should be no downstream consumers that expect specific Event patterns as an API. I've filed kubernetes/kubernetes#93396 with an attempt to formalize that. With that understanding, I am fine landing this PR as it stands, knowing we are free to reroll any and all of it later without worrying about breaking compat.

/lgtm

openshift-ci-robot · 2020-07-24T12:10:27Z

New changes are detected. LGTM label has been removed.

deads2k · 2020-07-24T12:10:51Z

simple rebase, retagged

openshift-bot · 2020-07-24T14:12:17Z

/retest