Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-10318: [release-4.12] node: add node healthz server for cloud load balancers #1570

Conversation

ricky-rav
Copy link
Contributor

@ricky-rav ricky-rav commented Mar 9, 2023

4.12 backport of c8489e3 and 9a836e3

Not a clean backport, because 4.13 changes were against secondary network controller code in ovnk node, which we don't have in 4.12.

Also, in the backport of c8489e3, I moved to the new source file pkg/node/openflow_manager.go the relevant bits of https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/node/healthcheck.go instead of what we had in 4.13, where a few functions (checkForStaleOVSInternalPorts, checkForStaleOVSRepresentorInterfaces, checkForStaleOVSInterfaces) had been removed in previous commits.

The node healthz server is then enabled by CNO with this PR:
openshift/cluster-network-operator#1731

closes #OCPBUGS-10318

dcbw added 2 commits March 9, 2023 12:57
openflowManager code is only marginally health-check related, while the
service code provides healthz responses for local traffic policy
services on the node.

No code changes, just moving stuff around.

Signed-off-by: Dan Williams <dcbw@redhat.com>
(cherry picked from commit c8489e3)
For Cluster traffic policy services every node should accept
traffic and balance to a node with an endpoint for that service.
The cloud LB periodically health checks each node to know what
nodes it can send traffic to.

For Local traffic policy the cloud LB health-checks the specific
service's port on every node to determine whether that node has
any endpoints for the service, so no node-level health checks
are needed.

GCE's legacy cloud provider hardcodes port 10256 (the default
kube-proxy port) for its node-level health checks.

kube-proxy starts a healthz server on port 10256 on every node
for Cluster traffic policy services. ovnkube didn't do that,
so in some cases the cloud LB wouldn't consider nodes healthy.

Fixes: https://issues.redhat.com/browse/OCPBUGS-7158

Signed-off-by: Dan Williams <dcbw@redhat.com>
(cherry picked from commit 9a836e3)
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 9, 2023

@ricky-rav: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

[release-4.12] node: add node healthz server for cloud load balancers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from dcbw and tssurya March 9, 2023 12:21
@ricky-rav
Copy link
Contributor Author

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 14, 2023

@ricky-rav: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ovn-hybrid-step-registry f2b6a8d link false /test e2e-ovn-hybrid-step-registry
ci/prow/e2e-vsphere-ovn f2b6a8d link false /test e2e-vsphere-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@dcbw
Copy link
Member

dcbw commented Mar 14, 2023

@ricky-rav do we have a Jira for this backport yet?

@ricky-rav ricky-rav changed the title [release-4.12] node: add node healthz server for cloud load balancers OCPBUGS-10318: [release-4.12] node: add node healthz server for cloud load balancers Mar 15, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 15, 2023

@ricky-rav: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

OCPBUGS-10318: [release-4.12] node: add node healthz server for cloud load balancers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 15, 2023
@openshift-ci-robot
Copy link
Contributor

@ricky-rav: This pull request references Jira Issue OCPBUGS-10318, which is invalid:

  • expected the bug to target the "4.12.z" version, but it targets "4.13.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

4.12 backport of c8489e3 and 9a836e3

Not a clean backport, because 4.13 changes were against secondary network controller code in ovnk node, which we don't have in 4.12.

Also, in the backport of c8489e3, I moved to the new source file pkg/node/openflow_manager.go the relevant bits of https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/node/healthcheck.go instead of what we had in 4.13, where a few functions (checkForStaleOVSInternalPorts, checkForStaleOVSRepresentorInterfaces, checkForStaleOVSInterfaces) had been removed in previous commits.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 15, 2023

@ricky-rav: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

OCPBUGS-10318: [release-4.12] node: add node healthz server for cloud load balancers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@ricky-rav: This pull request references Jira Issue OCPBUGS-10318, which is invalid:

  • expected the bug to target the "4.12.z" version, but it targets "4.13.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

4.12 backport of c8489e3 and 9a836e3

Not a clean backport, because 4.13 changes were against secondary network controller code in ovnk node, which we don't have in 4.12.

Also, in the backport of c8489e3, I moved to the new source file pkg/node/openflow_manager.go the relevant bits of https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/node/healthcheck.go instead of what we had in 4.13, where a few functions (checkForStaleOVSInternalPorts, checkForStaleOVSRepresentorInterfaces, checkForStaleOVSInterfaces) had been removed in previous commits.

closes #OCPBUGS-10318

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ricky-rav
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 15, 2023
@openshift-ci-robot
Copy link
Contributor

@ricky-rav: This pull request references Jira Issue OCPBUGS-10318, which is valid. The bug has been moved to the POST state.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.z) matches configured target version for branch (4.12.z)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-10317 is in the state Closed (Done), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE))
  • dependent Jira Issue OCPBUGS-10317 targets the "4.13.0" version, which is one of the valid target versions: 4.13.0
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ricky-rav
Copy link
Contributor Author

/retest-required

@openshift-ci-robot
Copy link
Contributor

@ricky-rav: This pull request references Jira Issue OCPBUGS-10318, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.z) matches configured target version for branch (4.12.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • dependent bug Jira Issue OCPBUGS-10317 is in the state Closed (Done), which is one of the valid states (VERIFIED, RELEASE PENDING, CLOSED (ERRATA), CLOSED (CURRENT RELEASE), CLOSED (DONE))
  • dependent Jira Issue OCPBUGS-10317 targets the "4.13.0" version, which is one of the valid target versions: 4.13.0
  • bug has dependents

Requesting review from QA contact:
/cc @anuragthehatter

In response to this:

4.12 backport of c8489e3 and 9a836e3

Not a clean backport, because 4.13 changes were against secondary network controller code in ovnk node, which we don't have in 4.12.

Also, in the backport of c8489e3, I moved to the new source file pkg/node/openflow_manager.go the relevant bits of https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/node/healthcheck.go instead of what we had in 4.13, where a few functions (checkForStaleOVSInternalPorts, checkForStaleOVSRepresentorInterfaces, checkForStaleOVSInterfaces) had been removed in previous commits.

The node healthz server is then enabled by CNO with this PR:
openshift/cluster-network-operator#1731

closes #OCPBUGS-10318

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 16, 2023

@ricky-rav: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Retaining the bugzilla/valid-bug label as it was manually added.

In response to this:

OCPBUGS-10318: [release-4.12] node: add node healthz server for cloud load balancers

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dcbw dcbw added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. labels Mar 16, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 16, 2023

[APPROVALNOTIFIER] This PR is APPROVED

Approval requirements bypassed by manually added approval.

This pull-request has been approved by: ricky-rav

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@anuragthehatter
Copy link

/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Mar 16, 2023
@dcbw
Copy link
Member

dcbw commented Mar 16, 2023

/test e2e-gcp-ovn

@openshift-merge-robot openshift-merge-robot merged commit da42356 into openshift:release-4.12 Mar 16, 2023
21 checks passed
@openshift-ci-robot
Copy link
Contributor

@ricky-rav: Jira Issue OCPBUGS-10318: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-10318 has been moved to the MODIFIED state.

In response to this:

4.12 backport of c8489e3 and 9a836e3

Not a clean backport, because 4.13 changes were against secondary network controller code in ovnk node, which we don't have in 4.12.

Also, in the backport of c8489e3, I moved to the new source file pkg/node/openflow_manager.go the relevant bits of https://github.com/openshift/ovn-kubernetes/blob/release-4.12/go-controller/pkg/node/healthcheck.go instead of what we had in 4.13, where a few functions (checkForStaleOVSInternalPorts, checkForStaleOVSRepresentorInterfaces, checkForStaleOVSInterfaces) had been removed in previous commits.

The node healthz server is then enabled by CNO with this PR:
openshift/cluster-network-operator#1731

closes #OCPBUGS-10318

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ricky-rav
Copy link
Contributor Author

/cherry-pick release-4.11

@openshift-cherrypick-robot

@ricky-rav: #1570 failed to apply on top of branch "release-4.11":

Applying: node: split openflowManager and service health check code
Using index info to reconstruct a base tree...
M	go-controller/pkg/node/healthcheck.go
Falling back to patching base and 3-way merge...
Auto-merging go-controller/pkg/node/openflow_manager.go
CONFLICT (content): Merge conflict in go-controller/pkg/node/openflow_manager.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 node: split openflowManager and service health check code
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.11

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@anuragthehatter
Copy link

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet