Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-25079: Prevent NoRunningOvnControlPlane alert getting fired continuously #2208

Merged
merged 1 commit into from Jan 27, 2024

Conversation

arghosh93
Copy link
Contributor

Prevent NoRunningOvnControlPlane alert getting fired continuously for managed cluster

In managed cluster kube-rbac-proxy is not there to scrape metrics from ovnkube-cluster-manager. Instead metrics endpoint of ovnkube-cluster-manager gets exposed directly. This PR takes care of following aspects:

  • Makes 9108 port listen on all IPv4 interfaces instead of localhost only.
  • Change expression used with NoRunningOvnControlPlane alert to correct namespace to HostedClusterNamespace.

… managed cluster

In managed cluster kube-rbac-proxy is not there to scrape metrics from
ovnkube-cluster-manager. Instead metrics endpoint of ovnkube-cluster-manager
gets exposed directly. This PR takes care of following aspects:

- Makes 9108 port listen on all IPv4 interfaces instead of localhost only.
- Change expression used with NoRunningOvnControlPlane alert to correct
  namespace to HostedClusterNamespace.

Signed-off-by: Arnab Ghosh <arnabghosh89@gmail.com>
@openshift-ci openshift-ci bot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 17, 2024
Copy link
Contributor

openshift-ci bot commented Jan 17, 2024

Hi @arghosh93. Thanks for your PR.

I'm waiting for a openshift member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@arghosh93 arghosh93 changed the title Prevent NoRunningOvnControlPlane alert getting fired continuously OCPBUGS-25079: Prevent NoRunningOvnControlPlane alert getting fired continuously Jan 17, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Jan 17, 2024
@openshift-ci-robot
Copy link
Contributor

@arghosh93: This pull request references Jira Issue OCPBUGS-25079, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Prevent NoRunningOvnControlPlane alert getting fired continuously for managed cluster

In managed cluster kube-rbac-proxy is not there to scrape metrics from ovnkube-cluster-manager. Instead metrics endpoint of ovnkube-cluster-manager gets exposed directly. This PR takes care of following aspects:

  • Makes 9108 port listen on all IPv4 interfaces instead of localhost only.
  • Change expression used with NoRunningOvnControlPlane alert to correct namespace to HostedClusterNamespace.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tssurya
Copy link
Contributor

tssurya commented Jan 22, 2024

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 22, 2024
@arghosh93
Copy link
Contributor Author

I have followed below procedure to verify the fix.

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get hostedcluster -A
NAMESPACE   NAME      VERSION   KUBECONFIG                 PROGRESS    AVAILABLE   PROGRESSING   MESSAGE
arghosh     arghosh   4.14.8    arghosh-admin-kubeconfig   Completed   True        False         The hosted control plane is available

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc patch -n arghosh hostedclusters/arghosh -p '{"spec":{"pausedUntil":"true"}}' --type=merge
hostedcluster.hypershift.openshift.io/arghosh patched

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc set image deploy/cluster-network-operator cluster-network-operator=quay.io/arghosh/alert:1004
deployment.apps/cluster-network-operator image updated

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get promrule master-rules -n arghosh-arghosh -oyaml|grep NoRunningOvnControlPlane -A8|grep expr -A1
      expr: |
        absent(up{job="ovnkube-control-plane", namespace="arghosh-arghosh"} == 1)

[arghosh@arghosh-thinkpadp1gen3 ~]$ oc get po ovnkube-control-plane-6fc66886d6-l54ck -oyaml|grep metrics-bind-address
        --metrics-bind-address "0.0.0.0:9108" \

PFA screenshot showing that the alert stopped getting fired after applying above changes.

If the alert is still getting fired then restart promeheus operator POD from openshift-user-workload-monitoring project. There is an open issue[1] for this under prometheus-operator project.

[1] - prometheus-operator/prometheus-operator#6018 (comment)

@arghosh93
Copy link
Contributor Author

NoRunningOvnControlPlane-details

Copy link
Contributor

@tssurya tssurya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change looks safe to me.
/assign @kyrtapz

@@ -42,7 +42,7 @@ spec:
Networking control plane is degraded. Networking configuration updates applied to the cluster will not be
implemented while there are no OVN Kubernetes pods.
expr: |
absent(up{job="ovnkube-control-plane", namespace="openshift-ovn-kubernetes"} == 1)
absent(up{job="ovnkube-control-plane", namespace="{{.HostedClusterNamespace}}"} == 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change lgtm, PTAL @kyrtapz
when this is backported (if we want to, we probably wanna take care of all the multizone files)

@kyrtapz
Copy link
Contributor

kyrtapz commented Jan 24, 2024

/retest

@kyrtapz
Copy link
Contributor

kyrtapz commented Jan 24, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 24, 2024
@kyrtapz
Copy link
Contributor

kyrtapz commented Jan 24, 2024

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Jan 24, 2024
@openshift-ci-robot
Copy link
Contributor

@kyrtapz: This pull request references Jira Issue OCPBUGS-25079, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Jan 24, 2024
@tssurya
Copy link
Contributor

tssurya commented Jan 24, 2024

/test e2e-hypershift-ovn

1 similar comment
@tssurya
Copy link
Contributor

tssurya commented Jan 24, 2024

/test e2e-hypershift-ovn

@kyrtapz
Copy link
Contributor

kyrtapz commented Jan 25, 2024

/retest-required

@jcaamano
Copy link
Contributor

/approve

Copy link
Contributor

openshift-ci bot commented Jan 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arghosh93, jcaamano, kyrtapz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 26, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 4ae9882 and 2 for PR HEAD cc3b303 in total

@tssurya
Copy link
Contributor

tssurya commented Jan 26, 2024

/retest-required

1 similar comment
@tssurya
Copy link
Contributor

tssurya commented Jan 26, 2024

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ba8677d and 1 for PR HEAD cc3b303 in total

Copy link
Contributor

openshift-ci bot commented Jan 27, 2024

@arghosh93: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-hypershift-ovn-kubevirt cc3b303 link false /test e2e-aws-hypershift-ovn-kubevirt
ci/prow/e2e-aws-live-migration-sdn-ovn-rollback cc3b303 link false /test e2e-aws-live-migration-sdn-ovn-rollback
ci/prow/e2e-vsphere-ovn-dualstack-primaryv6 cc3b303 link false /test e2e-vsphere-ovn-dualstack-primaryv6
ci/prow/e2e-aws-sdn-upgrade cc3b303 link false /test e2e-aws-sdn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@tssurya
Copy link
Contributor

tssurya commented Jan 27, 2024

/retest-required

@openshift-merge-bot openshift-merge-bot bot merged commit 855457c into openshift:master Jan 27, 2024
37 of 41 checks passed
@openshift-ci-robot
Copy link
Contributor

@arghosh93: Jira Issue OCPBUGS-25079: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-25079 has been moved to the MODIFIED state.

In response to this:

Prevent NoRunningOvnControlPlane alert getting fired continuously for managed cluster

In managed cluster kube-rbac-proxy is not there to scrape metrics from ovnkube-cluster-manager. Instead metrics endpoint of ovnkube-cluster-manager gets exposed directly. This PR takes care of following aspects:

  • Makes 9108 port listen on all IPv4 interfaces instead of localhost only.
  • Change expression used with NoRunningOvnControlPlane alert to correct namespace to HostedClusterNamespace.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-01-31-073538

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants