Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add connectivity checker controller #856

Conversation

rcarrillocruz
Copy link
Contributor

@rcarrillocruz rcarrillocruz commented Oct 28, 2020

Add network diagnostics controller to CNO

This leverages PodNetworkConnectivityChecks (https://github.com/openshift/enhancements/blob/241bfa6388a2bde7d78c4be7f0479fe99784daed/enhancements/kube-apiserver/stability-point-to-point-network-check.md) to perform network checks periodically.
The checks are:

From regular pod to kubeapiserver/openshiftapiserver service and endpoints and LBs.
From regular pod to pods on every node running a hello-openshift service.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 28, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 28, 2020
@rcarrillocruz
Copy link
Contributor Author

@sanchezl can you please comment how EventRecorder gets initialized on other operators ?

Copy link
Contributor

@sanchezl sanchezl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please comment how EventRecorder gets initialized on other operators ?

We use a framework for most of our operator commands that creates the event.Recorder for us. I don't see a proper event.Recorder in use in any of the existing controllers, but the pki controller does have a loggingRecorder that simply writes to Stdout. You can start with that, or use events.NewRecorder() to create a new recorder that actually creates events.

@rcarrillocruz rcarrillocruz force-pushed the add_connectivity_check_controller branch 5 times, most recently from 6470dac to 8eea9d1 Compare October 29, 2020 19:12
@sanchezl
Copy link
Contributor

@rcarrillocruz See rcarrillocruz#1 for some thoughts on the context.

@rcarrillocruz rcarrillocruz force-pushed the add_connectivity_check_controller branch from 8eea9d1 to 99c45e6 Compare October 29, 2020 20:52
@rcarrillocruz
Copy link
Contributor Author

Running this patch with cluster network operator running locally it gets stuck with:

I1030 10:16:01.494974 252285 base_controller.go:66] Waiting for caches to sync for ConnectivityCheckController

@rcarrillocruz
Copy link
Contributor Author

/assign @sttts

Copy link
Contributor

@sanchezl sanchezl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like one more informer factory to start.

@rcarrillocruz rcarrillocruz force-pushed the add_connectivity_check_controller branch 3 times, most recently from 7895a89 to 44762c1 Compare October 30, 2020 16:49
@rcarrillocruz
Copy link
Contributor Author

We will need to tweak the period each check is run , right now is every second (in library-go)

@rcarrillocruz
Copy link
Contributor Author

I need to templatize the manifests, so the RBAC and check-endpoints is only added in case the flag $check_endpoints is enabled

@rcarrillocruz
Copy link
Contributor Author

I also need to put this on SDN

@rcarrillocruz rcarrillocruz force-pushed the add_connectivity_check_controller branch 3 times, most recently from dfd8206 to dc25805 Compare November 4, 2020 09:36
@squeed
Copy link
Contributor

squeed commented Nov 4, 2020

Yeah, it would be good to think about how this could be made more generic - do we even need to care about the underlying network provider?

@rcarrillocruz
Copy link
Contributor Author

/retest

This leverages PodNetworkConnectivityChecks to perform network checks periodically.
The checks are:

From regular pod to kubeapiserver/openshiftapiserver service and endpoints and LBs.
From regular pod to pods on every node.
From regular pod to a test hello-openshift service.
@rcarrillocruz rcarrillocruz force-pushed the add_connectivity_check_controller branch from bd6599c to 123b165 Compare December 4, 2020 07:47
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2020
@danielmellado
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danielmellado, danwinship, rcarrillocruz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [danwinship,rcarrillocruz]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rcarrillocruz
Copy link
Contributor Author

/retest

4 similar comments
@rcarrillocruz
Copy link
Contributor Author

/retest

@rcarrillocruz
Copy link
Contributor Author

/retest

@rcarrillocruz
Copy link
Contributor Author

/retest

@rcarrillocruz
Copy link
Contributor Author

/retest

@rcarrillocruz
Copy link
Contributor Author

/test e2e-gcp

@rcarrillocruz
Copy link
Contributor Author

/test e2e-aws-ovn-windows-custom-vxlan

1 similar comment
@rcarrillocruz
Copy link
Contributor Author

/test e2e-aws-ovn-windows-custom-vxlan

@rcarrillocruz
Copy link
Contributor Author

/test e2e-gcp

@rcarrillocruz
Copy link
Contributor Author

/retest

1 similar comment
@rcarrillocruz
Copy link
Contributor Author

/retest

@abhat
Copy link
Contributor

abhat commented Dec 4, 2020

/override ci/prow/e2e-openstack-ovn
/override ci/prow/e2e-ovn-ipsec-step-registry

@openshift-ci-robot
Copy link
Contributor

@abhat: Overrode contexts on behalf of abhat: ci/prow/e2e-openstack-ovn, ci/prow/e2e-ovn-ipsec-step-registry

In response to this:

/override ci/prow/e2e-openstack-ovn
/override ci/prow/e2e-ovn-ipsec-step-registry

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhat
Copy link
Contributor

abhat commented Dec 4, 2020

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2020
@abhat
Copy link
Contributor

abhat commented Dec 4, 2020

@rcarrillocruz noticed that the network operator reconcile loop is showing a bunch of errors with unregistered GVK

prow log for cno:

E1204 19:27:19.939188       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:20.031564       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:20.135990       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:23.402481       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:28.910603       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:28.918887       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:29.213138       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:29.221347       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:29.726810       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:30.635573       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
E1204 19:27:30.643651       1 base_controller.go:250] "ConnectivityCheckController" controller failed to sync "key", err: cannot decode "pkg/operator/connectivitycheckcontroller/manifests/controlplane.operator.openshift.io_podnetworkconnectivitychecks.yaml": no kind "CustomResourceDefinition" is registered for version "apiextensions.k8s.io/v1" in scheme "pkg/runtime/scheme.go:101"
I1204 19:31:11.621560       1 log.go:181] Reconciling update to openshift-multus/multus

Holding off for now till you can ack these are benign or we have a fix.

@abhat
Copy link
Contributor

abhat commented Dec 4, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2020
@abhat
Copy link
Contributor

abhat commented Dec 4, 2020

Looks like the openstack-ovn job has been failing all over the place. Nothing to do with the PR itself. Cancelling the hold.

@abhat
Copy link
Contributor

abhat commented Dec 4, 2020

/test e2e-agnostic-upgrade

@abhat
Copy link
Contributor

abhat commented Dec 5, 2020

/test e2e-gcp
/test e2e-gcp-ovn
/test e2e-agnostic-upgrade

@openshift-merge-robot openshift-merge-robot merged commit fe0d11b into openshift:master Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet