New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
emit events for each new payload #411
emit events for each new payload #411
Conversation
This is eminently reasonable |
One comment, this lgtm assuming e2e and upgrade pass and output events |
/lgtm Clayton comment is reasonable to address :-) |
That's tricksy. You can't see the events until the CVO doing the upgrade (current master) emits the events. |
576ed71
to
7d3621f
Compare
@@ -474,6 +480,8 @@ func (w *SyncWorker) syncOnce(ctx context.Context, work *SyncWork, maxWorkers in | |||
validPayload := w.payload | |||
if validPayload == nil || !equalUpdate(configv1.Update{Image: validPayload.ReleaseImage}, configv1.Update{Image: update.Image}) { | |||
klog.V(4).Infof("Loading payload") | |||
cvoObjectRef := &corev1.ObjectReference{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version"} | |||
w.eventRecorder.Eventf(cvoObjectRef, corev1.EventTypeNormal, "RetrievePayload", "retrieving payload version=%q image=%q", update.Version, update.Image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to have the version and image as structured data on the event. Is that possible, or are flat strings all we have available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to have the version and image as structured data on the event. Is that possible, or are flat strings all we have available?
To my knowledge, flat strings are what you have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically we could do that as annotation, but let's not for now.
@@ -483,6 +491,7 @@ func (w *SyncWorker) syncOnce(ctx context.Context, work *SyncWork, maxWorkers in | |||
}) | |||
info, err := w.retriever.RetrievePayload(ctx, update) | |||
if err != nil { | |||
w.eventRecorder.Eventf(cvoObjectRef, corev1.EventTypeWarning, "RetrievePayloadFailed", "retrieving payload failed version=%q image=%q failure=%v", update.Version, update.Image, err) | |||
reporter.Report(SyncWorkerStatus{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could move the event recorder under reporter.Report
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could move the event recorder under
reporter.Report
?
that would be a significant change. It's used for all error calls and I'm not certain of the fanout that would cause. I'd like to solve the immediate "special" events and allow someone closer to the operator decide if and how to emit an event for each report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's keep reporter separate for now.
/retest |
Wierd, none of these events are showing up in the monitor. Are events getting sent? |
no, because the version of the CVO that would produce these is master:HEAD since that's the version looking up the new payload. We won't see these until after this is merged. Same problem you had. |
I launched a cluster-bot test with $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1286037162390720512/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | .firstTimestamp + " " + .type + " " + .reason + ": " + .message' | grep 'Payload\|Preconditions\|version='
...no hits... Checking the source release, the build log has:
so not clear to me why I'm not seeing these events. Do we actually set a real |
f8c7a6c
to
42d74a0
Compare
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/411/pull-ci-openshift-cluster-version-operator-master-e2e/1286074666141618176/artifacts/e2e/gather-extra/pods/openshift-cluster-version_cluster-version-operator-b5c76fb4b-zrf76_cluster-version-operator.log | grep -A3 DEADS
#### DEADS patch "openshift-cluster-version"!!!
I0722 23:19:59.995655 1 payload.go:230] Loading updatepayload from "/"
E0722 23:19:59.998582 1 event.go:272] Unable to write event: 'can't create an event with namespace 'default' in namespace 'openshift-cluster-version'' (may retry after sleeping)
I0722 23:19:59.998691 1 event.go:281] Event(v1.ObjectReference{Kind:"ClusterVersion", Namespace:"", Name:"version", UID:"", APIVersion:"config.openshift.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'Something' someevent So some sort of hiccup around which namespace this is living in. I'm not familiar enough with Event publishing to know exactly what we're missing... |
42d74a0
to
041e96e
Compare
New $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1286331887161184256/artifacts/launch/events.json | jq -r '.items[] | select(.metadata.namespace == "openshift-cluster-version") | .firstTimestamp + " " + .type + " " + .reason + ": " + .message' | grep 'Payload\|Preconditions\|version='
2020-07-23T16:25:39Z Normal RetrievePayload: retrieving payload version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:25:39Z Normal VerifyPayload: verifying payload version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:25:40Z Normal PayloadLoaded: payload loaded version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:54:01Z Normal RetrievePayload: retrieving payload version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:54:01Z Normal VerifyPayload: verifying payload version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:54:01Z Normal PayloadLoaded: payload loaded version="0.0.1-2020-07-23-161007" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:7e61fe41593f01779cbb6e1960201a13ae08e254e3c86215ad9916e4cff32b51"
2020-07-23T16:57:24Z Normal RetrievePayload: retrieving payload version="" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:bcdcc30fef79ad503c4931800554f1ab258c36f25ff646670e673ead68456db6"
2020-07-23T16:57:33Z Normal VerifyPayload: verifying payload version="" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:bcdcc30fef79ad503c4931800554f1ab258c36f25ff646670e673ead68456db6"
2020-07-23T16:57:33Z Normal PreconditionsPassed: preconditions passed for payload loaded version="" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:bcdcc30fef79ad503c4931800554f1ab258c36f25ff646670e673ead68456db6"
2020-07-23T16:57:33Z Normal PayloadLoaded: payload loaded version="" image="registry.svc.ci.openshift.org/ci-ln-2iqzykk/release@sha256:bcdcc30fef79ad503c4931800554f1ab258c36f25ff646670e673ead68456db6" Do we want to include the current reconciliation mode in the events, so folks understand why we aren't running preconditions as the install-time release's CVO hops from node to node? And do we want to do anything about Also in this space, we should teach the CVO that it does not need to download and fetch a release image pullspec that matches its current pod pullspec. That would cover the node-hopping angle. |
for my purposes, this current behavior works fine. |
/retest |
amusingly, I need this in master to figure out why the latest upgrade for it failed. |
041e96e
to
e1de651
Compare
@deads2k assures me that there should be no downstream consumers that expect specific Event patterns as an API. I've filed kubernetes/kubernetes#93396 with an attempt to formalize that. With that understanding, I am fine landing this PR as it stands, knowing we are free to reroll any and all of it later without worrying about breaking compat. /lgtm |
e1de651
to
475e71f
Compare
New changes are detected. LGTM label has been removed. |
simple rebase, retagged |
/retest Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
Only during updates, because: * Install-time is a free-for-all, where the CVO doesn't block on anything. This would be a lot of "node complete" noise about nodes where we had only attempted to push manifests, and that's unlikely to be what event-readers expect TaskNodeComplete to imply. * Reconcile-time hopefully has very few instances where the CVO needs to stomp on changes, block on a recently Available=False operator, etc. Eventing on each completed TaskNode would be lots of noise without much interesting signal. During updates, we have the structured graph and blocking TaskNodes described in docs/user/reconciliation.md, and the flow through that graph is what the events from this commit will help shed light on. You could also achieve this by preserving logs from the CVO pods as they are repositioned throughout an update, but we don't have tooling in CI to do that conveniently today. The hardcoded name and namespace for cvoObjectRef isn't great (for example, it won't work in pkg/start/start_integration_test.go , where the ClusterVersion's name and namespace are random). But it's the pattern we've used since we started eventing in 475e71f (emit events for each new payload, 2020-07-21, openshift#411), so I'm recylcing it for now.
Only during updates, because: * Install-time is a free-for-all, where the CVO doesn't block on anything. This would be a lot of "node complete" noise about nodes where we had only attempted to push manifests, and that's unlikely to be what event-readers expect TaskNodeComplete to imply. * Reconcile-time hopefully has very few instances where the CVO needs to stomp on changes, block on a recently Available=False operator, etc. Eventing on each completed TaskNode would be lots of noise without much interesting signal. During updates, we have the structured graph and blocking TaskNodes described in docs/user/reconciliation.md, and the flow through that graph is what the events from this commit will help shed light on. You could also achieve this by preserving logs from the CVO pods as they are repositioned throughout an update, but we don't have tooling in CI to do that conveniently today. The hardcoded name and namespace for cvoObjectRef isn't great (for example, it won't work in pkg/start/start_integration_test.go , where the ClusterVersion's name and namespace are random). But it's the pattern we've used since we started eventing in 475e71f (emit events for each new payload, 2020-07-21, openshift#411), so I'm recylcing it for now.
Only during updates, because: * Install-time is a free-for-all, where the CVO doesn't block on anything. This would be a lot of "node complete" noise about nodes where we had only attempted to push manifests, and that's unlikely to be what event-readers expect TaskNodeComplete to imply. * Reconcile-time hopefully has very few instances where the CVO needs to stomp on changes, block on a recently Available=False operator, etc. Eventing on each completed TaskNode would be lots of noise without much interesting signal. During updates, we have the structured graph and blocking TaskNodes described in docs/user/reconciliation.md, and the flow through that graph is what the events from this commit will help shed light on. You could also achieve this by preserving logs from the CVO pods as they are repositioned throughout an update, but we don't have tooling in CI to do that conveniently today. The hardcoded name and namespace for cvoObjectRef isn't great (for example, it won't work in pkg/start/start_integration_test.go , where the ClusterVersion's name and namespace are random). But it's the pattern we've used since we started eventing in 475e71f (emit events for each new payload, 2020-07-21, openshift#411), so I'm recylcing it for now.
Only during updates, because: * Install-time is a free-for-all, where the CVO doesn't block on anything. This would be a lot of "node complete" noise about nodes where we had only attempted to push manifests, and that's unlikely to be what event-readers expect TaskNodeComplete to imply. * Reconcile-time hopefully has very few instances where the CVO needs to stomp on changes, block on a recently Available=False operator, etc. Eventing on each completed TaskNode would be lots of noise without much interesting signal. During updates, we have the structured graph and blocking TaskNodes described in docs/user/reconciliation.md, and the flow through that graph is what the events from this commit will help shed light on. You could also achieve this by preserving logs from the CVO pods as they are repositioned throughout an update, but we don't have tooling in CI to do that conveniently today. The hardcoded name and namespace for cvoObjectRef isn't great (for example, it won't work in pkg/start/start_integration_test.go , where the ClusterVersion's name and namespace are random). But it's the pattern we've used since we started eventing in 475e71f (emit events for each new payload, 2020-07-21, openshift#411), so I'm recylcing it for now.
Only during updates, because: * Install-time is a free-for-all, where the CVO doesn't block on anything. This would be a lot of "node complete" noise about nodes where we had only attempted to push manifests, and that's unlikely to be what event-readers expect TaskNodeComplete to imply. * Reconcile-time hopefully has very few instances where the CVO needs to stomp on changes, block on a recently Available=False operator, etc. Eventing on each completed TaskNode would be lots of noise without much interesting signal. During updates, we have the structured graph and blocking TaskNodes described in docs/user/reconciliation.md, and the flow through that graph is what the events from this commit will help shed light on. You could also achieve this by preserving logs from the CVO pods as they are repositioned throughout an update, but we don't have tooling in CI to do that conveniently today. The hardcoded name and namespace for cvoObjectRef isn't great (for example, it won't work in pkg/start/start_integration_test.go , where the ClusterVersion's name and namespace are random). But it's the pattern we've used since we started eventing in 475e71f (emit events for each new payload, 2020-07-21, openshift#411), so I'm recylcing it for now. Also log ApplyFailed events when we fail in apply, to remove some of the guesswork in determining what manifest(s) had trouble.
Only during updates, because: * Install-time is a free-for-all, where the CVO doesn't block on anything. This would be a lot of "node complete" noise about nodes where we had only attempted to push manifests, and that's unlikely to be what event-readers expect TaskNodeComplete to imply. * Reconcile-time hopefully has very few instances where the CVO needs to stomp on changes, block on a recently Available=False operator, etc. Eventing on each completed TaskNode would be lots of noise without much interesting signal. During updates, we have the structured graph and blocking TaskNodes described in docs/user/reconciliation.md, and the flow through that graph is what the events from this commit will help shed light on. You could also achieve this by preserving logs from the CVO pods as they are repositioned throughout an update, but we don't have tooling in CI to do that conveniently today. The hardcoded name and namespace for cvoObjectRef isn't great (for example, it won't work in pkg/start/start_integration_test.go , where the ClusterVersion's name and namespace are random). But it's the pattern we've used since we started eventing in 475e71f (emit events for each new payload, 2020-07-21, openshift#411), so I'm recylcing it for now. Also log ApplyFailed events when we fail in apply, to remove some of the guesswork in determining what manifest(s) had trouble.
Only during updates, because: * Install-time is a free-for-all, where the CVO doesn't block on anything. This would be a lot of "node complete" noise about nodes where we had only attempted to push manifests, and that's unlikely to be what event-readers expect TaskNodeComplete to imply. * Reconcile-time hopefully has very few instances where the CVO needs to stomp on changes, block on a recently Available=False operator, etc. Eventing on each completed TaskNode would be lots of noise without much interesting signal. During updates, we have the structured graph and blocking TaskNodes described in docs/user/reconciliation.md, and the flow through that graph is what the events from this commit will help shed light on. You could also achieve this by preserving logs from the CVO pods as they are repositioned throughout an update, but we don't have tooling in CI to do that conveniently today. The hardcoded name and namespace for cvoObjectRef isn't great (for example, it won't work in pkg/start/start_integration_test.go , where the ClusterVersion's name and namespace are random). But it's the pattern we've used since we started eventing in 475e71f (emit events for each new payload, 2020-07-21, openshift#411), so I'm recylcing it for now. Also log ApplyFailed events when we fail in apply, to remove some of the guesswork in determining what manifest(s) had trouble.
…failure For unforced updates, when signature, etc. verification fails, RetrievePayload returns an error, and we have emitted events for RetrievePayload errors since 475e71f (emit events for each new payload, 2020-07-21, openshift#411). However, when the user forces the update, we log but do not error out on verification failures. With this commit, we will also emit a warning event with an error message, which will make it easier to understand how signature verification failed, even when we don't have access to the logs of the outgoing cluster-version operator [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
…failure For unforced updates, when signature, etc. verification fails, RetrievePayload returns an error, and we have emitted events for RetrievePayload errors since 475e71f (emit events for each new payload, 2020-07-21, openshift#411). However, when the user forces the update, we log but do not error out on verification failures. With this commit, we will also emit a warning event with an error message, which will make it easier to understand how signature verification failed, even when we don't have access to the logs of the outgoing cluster-version operator [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
…failure For unforced updates, when signature, etc. verification fails, RetrievePayload returns an error, and we have emitted events for RetrievePayload errors since 475e71f (emit events for each new payload, 2020-07-21, openshift#411). However, when the user forces the update, we log but do not error out on verification failures. With this commit, we will also emit a warning event with an error message, which will make it easier to understand how signature verification failed, even when we don't have access to the logs of the outgoing cluster-version operator [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
…failure For unforced updates, when signature, etc. verification fails, RetrievePayload returns an error, and we have emitted events for RetrievePayload errors since 475e71f (emit events for each new payload, 2020-07-21, openshift#411). However, when the user forces the update, we log but do not error out on verification failures. With this commit, we will also emit a warning event with an error message, which will make it easier to understand how signature verification failed, even when we don't have access to the logs of the outgoing cluster-version operator [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
…failure For unforced updates, when signature, etc. verification fails, RetrievePayload returns an error, and we have emitted events for RetrievePayload errors since 475e71f (emit events for each new payload, 2020-07-21, openshift#411). However, when the user forces the update, we log but do not error out on verification failures. With this commit, we will also emit a warning event with an error message, which will make it easier to understand how signature verification failed, even when we don't have access to the logs of the outgoing cluster-version operator [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
…failure For unforced updates, when signature, etc. verification fails, RetrievePayload returns an error, and we have emitted events for RetrievePayload errors since 475e71f (emit events for each new payload, 2020-07-21, openshift#411). However, when the user forces the update, we log but do not error out on verification failures. With this commit, we will also emit a warning event with an error message, which will make it easier to understand how signature verification failed, even when we don't have access to the logs of the outgoing cluster-version operator [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
…failure For unforced updates, when signature, etc. verification fails, RetrievePayload returns an error, and we have emitted events for RetrievePayload errors since 475e71f (emit events for each new payload, 2020-07-21, openshift#411). However, when the user forces the update, we log but do not error out on verification failures. With this commit, we will also emit a warning event with an error message, which will make it easier to understand how signature verification failed, even when we don't have access to the logs of the outgoing cluster-version operator [1]. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2
…failure For unforced updates, when signature, etc. verification fails, RetrievePayload returns an error, and we have emitted events for RetrievePayload errors since 475e71f (emit events for each new payload, 2020-07-21, openshift#411). However, when the user forces the update, we log but do not error out on verification failures. With this commit, we will also emit a warning event with an error message, which will make it easier to understand how signature verification failed, even when we don't have access to the logs of the outgoing cluster-version operator [1]. We should transition away from the maintenance-mode github.com/pkg/errors [2], but I'm deferring that to follow-up work. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=2071998#c2 [2]: https://pkg.go.dev/github.com/pkg/errors#readme-roadmap
Since the CVO updates itself every time it updates, we always lose our logs for what happens when the update starts. Updates are infrequent enough that we can emit events for them without loading the kube-apiserver.