Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-23377: to reduce log noise, only log sync_worker messages on retry or payload change #997

Merged
merged 1 commit into from Nov 29, 2023

Conversation

deads2k
Copy link
Contributor

@deads2k deads2k commented Nov 16, 2023

continue
}
if err := w.builder.Apply(ctx, task.Manifest, payload.PrecreatingPayload); err != nil {
klog.V(2).Infof("Unable to precreate resource %s: %v", task, err)
klog.V(manifestVerbosity).Infof("Unable to precreate resource %s: %v", task, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep this at two, so we log the first error. As you have it, we'll presumably soon be in a new sync round with Attempt > 0, and we'll log any second precreation error then. But it still seems like it might be useful for incident timinling to have a timestamp logged for the first round's error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should not change the logging level for the precreate resource error.


if err := task.Manifest.Include(nil, nil, nil, &capabilities, work.Overrides); err != nil {
klog.V(2).Infof("Skipping %s: %s", task, err)
klog.V(manifestVerbosity).Infof("Skipping %s: %s", task, err)
Copy link
Member

@wking wking Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problems being noisy when overrides is set. I don't think we run any overrides-setting CI, except during installs where the installer fiddles with overrides to keep us from outracing cluster-bootstrap for resources the installer cares about.

Checking the no-capabilities job to see how noisy we are for capabilities filtering, TestGrid -> run, where it's not crazy-noisy:

$ $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-no-capabilities/1724760694437449728/artifacts/e2e-aws-ovn-no-capabilities/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78698c8956-4wjjc_cluster-version-operator.log | grep -o '[.]go.*Skipping .*' | sort | uniq -c | sort -n | tail
     48 .go:978] Skipping service "openshift-image-registry/image-registry-operator" (347 of 865): disabled capabilities: ImageRegistry
     48 .go:978] Skipping service "openshift-insights/metrics" (629 of 865): disabled capabilities: Insights
     48 .go:978] Skipping service "openshift-machine-api/cluster-baremetal-operator-service" (237 of 865): disabled capabilities: baremetal
     48 .go:978] Skipping service "openshift-machine-api/cluster-baremetal-webhook-service" (238 of 865): disabled capabilities: baremetal
     48 .go:978] Skipping service "openshift-marketplace/marketplace-operator-metrics" (682 of 865): disabled capabilities: marketplace
     48 .go:978] Skipping service "openshift-operator-lifecycle-manager/catalog-operator-metrics" (655 of 865): disabled capabilities: OperatorLifecycleManager
     48 .go:978] Skipping service "openshift-operator-lifecycle-manager/olm-operator-metrics" (654 of 865): disabled capabilities: OperatorLifecycleManager
     48 .go:978] Skipping service "openshift-operator-lifecycle-manager/package-server-manager-metrics" (657 of 865): disabled capabilities: OperatorLifecycleManager
     48 .go:978] Skipping subscription "openshift-operator-lifecycle-manager/packageserver" (668 of 865): disabled capabilities: OperatorLifecycleManager
     48 .go:978] Skipping validatingwebhookconfiguration "performance-addon-operator" (442 of 865): disabled capabilities: NodeTuning
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.15-e2e-aws-ovn-no-capabilities/1724760694437449728/artifacts/e2e-aws-ovn-no-capabilities/gather-extra/artifacts/pods/openshift-cluster-version_cluster-version-operator-78698c8956-4wjjc_cluster-version-operator.log | sed -n 's/^.* \([^ ]*[.]go:[0-9]*\).*/\1/p' | sort | uniq -c | sort -n | tail
    389 cvo.go:634
    424 payload.go:210
    576 sync_worker.go:959
   1007 sync_worker.go:966
   2046 task_graph.go:478
   3696 task_graph.go:481
  14602 sync_worker.go:978
  20305 sync_worker.go:701
  26118 sync_worker.go:990
  40833 sync_worker.go:975

But it is still the 4th most commonly hit log. Maybe we can build something that throttles here to log once an hour per manifest or something or when the err string changes (e.g. because some capability was enabled, but not all the caps the resource needs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with not printing this as we do not recommend customers using override. This is not something going to be impact our customers in a negative way. If we find something which convince us otherwise we can comeback and fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not just override, it's also excluding capabilities (the no-capabilities jobs I poked at above does not set overrides, does exclude capabilities, and sees a fair amount of volume from this line). We do expect some customers to exclude capabilities (that's why we built that feature). The resulting log volume may or may not be something those customers care about. The presence of the lines every few minutes per manifest is more than anyone needs. The presence of the lines at all is occasionally useful, e.g. I used it while internally triaging OCPBUGS-22266. But I haven't needed the line while triaging issues with very old releases, so if folks want it off most of the time by default, I don't expect that to cause me problems. And hopefully soon we'll have a tunable knob to bring it back in when folks want it to help debugging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest we go with current PR and if during the testing of 4.15 we feel that we need the log line back then we can create a bug and add it back.

continue
}
if err := w.builder.Apply(ctx, task.Manifest, payload.PrecreatingPayload); err != nil {
klog.V(2).Infof("Unable to precreate resource %s: %v", task, err)
klog.V(manifestVerbosity).Infof("Unable to precreate resource %s: %v", task, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should not change the logging level for the precreate resource error.


if err := task.Manifest.Include(nil, nil, nil, &capabilities, work.Overrides); err != nil {
klog.V(2).Infof("Skipping %s: %s", task, err)
klog.V(manifestVerbosity).Infof("Skipping %s: %s", task, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with not printing this as we do not recommend customers using override. This is not something going to be impact our customers in a negative way. If we find something which convince us otherwise we can comeback and fix it.

manifestVerbosity = 2
case work.State == payload.UpdatingPayload ||
work.State == payload.InitializingPayload ||
work.State == payload.PrecreatingPayload:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our typing does not clarify this, but PrecreatingPayload is only used at a manifest level, and not on SyncWorker state as a whole:

$ git --no-pager grep PrecreatingPayload
pkg/cvo/cvo.go: case payload.PrecreatingPayload:
pkg/cvo/sync_worker.go:                 if err := w.builder.Apply(ctx, task.Manifest, payload.PrecreatingPayload); err != nil {
pkg/payload/payload.go: // PrecreatingPayload indicates we are selectively creating
pkg/payload/payload.go: PrecreatingPayload

It doesn't hurt to have this entry here, outside of a chance of confusing future devs about the possible worker-level states. I'm agnostic about whether we weight that possibility as riskier or not vs. the possibility that future code changes make PrecreatingPayload something that could apply to worker-level state.

@wking wking changed the title to reduce log noise, only log sync_worker messages on retry or payload change OCPBUGS-23377: to reduce log noise, only log sync_worker messages on retry or payload change Nov 16, 2023
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Nov 16, 2023
@openshift-ci-robot
Copy link
Contributor

@deads2k: This pull request references Jira Issue OCPBUGS-23377, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @jiajliu

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

initial reduction for https://issues.redhat.com/browse/OTA-923

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from jiajliu November 16, 2023 22:23
@JianLi-RH
Copy link

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Nov 17, 2023
@openshift-ci-robot
Copy link
Contributor

@deads2k: This pull request references Jira Issue OCPBUGS-23377, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.15.0) matches configured target version for branch (4.15.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (jianl@redhat.com), skipping review request.

In response to this:

initial reduction for https://issues.redhat.com/browse/OTA-923

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@deads2k
Copy link
Contributor Author

deads2k commented Nov 17, 2023

tweaked per comments.

Copy link
Member

@LalatenduMohanty LalatenduMohanty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Nov 17, 2023
Copy link
Contributor

openshift-ci bot commented Nov 17, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, LalatenduMohanty

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 17, 2023
@wking
Copy link
Member

wking commented Nov 17, 2023

Pathological cluster-baremetal-operator container restarts are unrelated to this change.

@petr-muller
Copy link
Member

Pathological cluster-baremetal-operator container restarts are unrelated to this change.

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Otherwise, we just need to wait for the acknowledge... gate to be lifted.

Copy link
Contributor

openshift-ci bot commented Nov 24, 2023

@petr-muller: Overrode contexts on behalf of petr-muller: ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

In response to this:

Pathological cluster-baremetal-operator container restarts are unrelated to this change.

/override ci/prow/e2e-agnostic-ovn-upgrade-out-of-change

Otherwise, we just need to wait for the acknowledge... gate to be lifted.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

openshift-ci bot commented Nov 24, 2023

@deads2k: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@deads2k
Copy link
Contributor Author

deads2k commented Nov 27, 2023

/refresh

@petr-muller
Copy link
Member

This is stuck on a buggy Tide feature, pls keep this in this state, I'll investigate and either fix or revert the offending Tide config change.

@petr-muller
Copy link
Member

This should fix Tide: kubernetes/test-infra#31323

@openshift-merge-bot openshift-merge-bot bot merged commit d32dad9 into openshift:master Nov 29, 2023
11 checks passed
@openshift-ci-robot
Copy link
Contributor

@deads2k: Jira Issue OCPBUGS-23377: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-23377 has been moved to the MODIFIED state.

In response to this:

initial reduction for https://issues.redhat.com/browse/OTA-923

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@petr-muller
Copy link
Member

We got the fix in openshift/release#46195 and Tide merged this, looks good 🤞

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-version-operator-container-v4.15.0-202311291533.p0.gd32dad9.assembly.stream for distgit cluster-version-operator.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.15.0-0.nightly-2023-12-02-123536

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants