New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-23377: to reduce log noise, only log sync_worker messages on retry or payload change #997
Conversation
pkg/cvo/sync_worker.go
Outdated
continue | ||
} | ||
if err := w.builder.Apply(ctx, task.Manifest, payload.PrecreatingPayload); err != nil { | ||
klog.V(2).Infof("Unable to precreate resource %s: %v", task, err) | ||
klog.V(manifestVerbosity).Infof("Unable to precreate resource %s: %v", task, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep this at two, so we log the first error. As you have it, we'll presumably soon be in a new sync round with Attempt > 0
, and we'll log any second precreation error then. But it still seems like it might be useful for incident timinling to have a timestamp logged for the first round's error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should not change the logging level for the precreate resource error.
|
||
if err := task.Manifest.Include(nil, nil, nil, &capabilities, work.Overrides); err != nil { | ||
klog.V(2).Infof("Skipping %s: %s", task, err) | ||
klog.V(manifestVerbosity).Infof("Skipping %s: %s", task, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no problems being noisy when overrides
is set. I don't think we run any overrides
-setting CI, except during installs where the installer fiddles with overrides
to keep us from outracing cluster-bootstrap for resources the installer cares about.
Checking the no-capabilities job to see how noisy we are for capabilities filtering, TestGrid -> run, where it's not crazy-noisy:
$ $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-no-capabilities/1724760694437449728/artifacts/e2e-aws-ovn-no-capabilities/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78698c8956-4wjjc_cluster-version-operator.log | grep -o '[.]go.*Skipping .*' | sort | uniq -c | sort -n | tail
48 .go:978] Skipping service "openshift-image-registry/image-registry-operator" (347 of 865): disabled capabilities: ImageRegistry
48 .go:978] Skipping service "openshift-insights/metrics" (629 of 865): disabled capabilities: Insights
48 .go:978] Skipping service "openshift-machine-api/cluster-baremetal-operator-service" (237 of 865): disabled capabilities: baremetal
48 .go:978] Skipping service "openshift-machine-api/cluster-baremetal-webhook-service" (238 of 865): disabled capabilities: baremetal
48 .go:978] Skipping service "openshift-marketplace/marketplace-operator-metrics" (682 of 865): disabled capabilities: marketplace
48 .go:978] Skipping service "openshift-operator-lifecycle-manager/catalog-operator-metrics" (655 of 865): disabled capabilities: OperatorLifecycleManager
48 .go:978] Skipping service "openshift-operator-lifecycle-manager/olm-operator-metrics" (654 of 865): disabled capabilities: OperatorLifecycleManager
48 .go:978] Skipping service "openshift-operator-lifecycle-manager/package-server-manager-metrics" (657 of 865): disabled capabilities: OperatorLifecycleManager
48 .go:978] Skipping subscription "openshift-operator-lifecycle-manager/packageserver" (668 of 865): disabled capabilities: OperatorLifecycleManager
48 .go:978] Skipping validatingwebhookconfiguration "performance-addon-operator" (442 of 865): disabled capabilities: NodeTuning
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-no-capabilities/1724760694437449728/artifacts/e2e-aws-ovn-no-capabilities/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78698c8956-4wjjc_cluster-version-operator.log | sed -n 's/^.* \([^ ]*[.]go:[0-9]*\).*/\1/p' | sort | uniq -c | sort -n | tail
389 cvo.go:634
424 payload.go:210
576 sync_worker.go:959
1007 sync_worker.go:966
2046 task_graph.go:478
3696 task_graph.go:481
14602 sync_worker.go:978
20305 sync_worker.go:701
26118 sync_worker.go:990
40833 sync_worker.go:975
But it is still the 4th most commonly hit log. Maybe we can build something that throttles here to log once an hour per manifest or something or when the err
string changes (e.g. because some capability was enabled, but not all the caps the resource needs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with not printing this as we do not recommend customers using override. This is not something going to be impact our customers in a negative way. If we find something which convince us otherwise we can comeback and fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not just override, it's also excluding capabilities (the no-capabilities
jobs I poked at above does not set overrides
, does exclude capabilities, and sees a fair amount of volume from this line). We do expect some customers to exclude capabilities (that's why we built that feature). The resulting log volume may or may not be something those customers care about. The presence of the lines every few minutes per manifest is more than anyone needs. The presence of the lines at all is occasionally useful, e.g. I used it while internally triaging OCPBUGS-22266. But I haven't needed the line while triaging issues with very old releases, so if folks want it off most of the time by default, I don't expect that to cause me problems. And hopefully soon we'll have a tunable knob to bring it back in when folks want it to help debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest we go with current PR and if during the testing of 4.15 we feel that we need the log line back then we can create a bug and add it back.
pkg/cvo/sync_worker.go
Outdated
continue | ||
} | ||
if err := w.builder.Apply(ctx, task.Manifest, payload.PrecreatingPayload); err != nil { | ||
klog.V(2).Infof("Unable to precreate resource %s: %v", task, err) | ||
klog.V(manifestVerbosity).Infof("Unable to precreate resource %s: %v", task, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we should not change the logging level for the precreate resource error.
|
||
if err := task.Manifest.Include(nil, nil, nil, &capabilities, work.Overrides); err != nil { | ||
klog.V(2).Infof("Skipping %s: %s", task, err) | ||
klog.V(manifestVerbosity).Infof("Skipping %s: %s", task, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am fine with not printing this as we do not recommend customers using override. This is not something going to be impact our customers in a negative way. If we find something which convince us otherwise we can comeback and fix it.
pkg/cvo/sync_worker.go
Outdated
manifestVerbosity = 2 | ||
case work.State == payload.UpdatingPayload || | ||
work.State == payload.InitializingPayload || | ||
work.State == payload.PrecreatingPayload: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our typing does not clarify this, but PrecreatingPayload
is only used at a manifest level, and not on SyncWorker
state as a whole:
$ git --no-pager grep PrecreatingPayload
pkg/cvo/cvo.go: case payload.PrecreatingPayload:
pkg/cvo/sync_worker.go: if err := w.builder.Apply(ctx, task.Manifest, payload.PrecreatingPayload); err != nil {
pkg/payload/payload.go: // PrecreatingPayload indicates we are selectively creating
pkg/payload/payload.go: PrecreatingPayload
It doesn't hurt to have this entry here, outside of a chance of confusing future devs about the possible worker-level states. I'm agnostic about whether we weight that possibility as riskier or not vs. the possibility that future code changes make PrecreatingPayload
something that could apply to worker-level state.
@deads2k: This pull request references Jira Issue OCPBUGS-23377, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/label qe-approved |
@deads2k: This pull request references Jira Issue OCPBUGS-23377, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (jianl@redhat.com), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
7d02961
to
86b1233
Compare
tweaked per comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, LalatenduMohanty The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pathological |
/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change Otherwise, we just need to wait for the |
@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn-upgrade-out-of-change In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@deads2k: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/refresh |
This is stuck on a buggy Tide feature, pls keep this in this state, I'll investigate and either fix or revert the offending Tide config change. |
This should fix Tide: kubernetes/test-infra#31323 |
@deads2k: Jira Issue OCPBUGS-23377: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-23377 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
We got the fix in openshift/release#46195 and Tide merged this, looks good 🤞 |
[ART PR BUILD NOTIFIER] This PR has been included in build cluster-version-operator-container-v4.15.0-202311291533.p0.gd32dad9.assembly.stream for distgit cluster-version-operator. |
Fix included in accepted release 4.15.0-0.nightly-2023-12-02-123536 |
initial reduction for https://issues.redhat.com/browse/OTA-923