point-to-point network check tool #846

sanchezl · 2020-05-01T18:34:24Z

deads2k · 2020-05-05T17:28:47Z

pkg/cmd/checkendpoint/cmd.go

+	cmd := &cobra.Command{
+		Use:   "check-endpoint",
+		Short: "Checks that a tcp connection can be opened to an endpoint.",
+		Args:  cobra.MinimumNArgs(1),


we take named flags, not positional args

we take named flags, not positional args

nm, interesting. This is just a command where you explore the logic? ok.

sttts · 2020-05-28T08:24:41Z

bindata/v4.1.0/kube-apiserver/check-endpoint-config-cm.yaml

+    apiVersion: operator.openshift.io/v1
+    kind: GenericOperatorConfig
+    leaderElection:
+      disable: true


this manifest surprises me. Where is it needed?

The library-go based controllers have leader election enable by default, limiting control loops to only one per namespace. I need this config to disable the leader election logic.

The library-go based controllers have leader election enable by default, limiting control loops to only one per namespace. I need this config to disable the leader election logic.

update library-go to have a WithoutLeaderElection

@sttts I've updated the ControllerCommandOptions to achieve the same result without the extra resources.

sttts · 2020-05-28T08:27:35Z

pkg/cmd/checkendpoint/cmd.go

+	config.Config()
+	cmd := config.NewCommandWithContext(context.Background())
+	cmd.Use = "check-endpoint"
+	cmd.Short = "Checks that a tcp connection can be opened to an endpoint."


nit: falling over the singular use of "endpoint" everywhere in the PR. Isn't this checking a number of them potentially?

Done. Reviewed the plurality of endpoint where used and made adjustments as needed to clarify.

sttts · 2020-05-28T08:30:13Z

pkg/cmd/checkendpoint/controller/controller.go

+	for _, check := range checks {
+		if updater := c.updaters[check.Name]; updater == nil {
+			c.updaters[check.Name] = NewStatusUpdater(c, check.Name, c.recorder)
+			c.updaters[check.Name].Start(ctx)


are we guaranteed that this context is long-living, i.e. over different Sync calls? cc @mfojtik

sttts · 2020-05-28T08:31:53Z

pkg/cmd/checkendpoint/controller/controller.go

+	checksGetter operatorcontrolplaneclientv1alpha1.PodNetworkConnectivityCheckInterface
+	checkLister  v1alpha1.PodNetworkConnectivityCheckNamespaceLister
+	recorder     events.Recorder
+	sync.Mutex


comment what it protect and/or put in some empty lines to make it visually clear

and don't embed it, but give it a name.

Removed, as this was moved into the status updater.

sttts · 2020-05-28T08:38:32Z

pkg/cmd/checkendpoint/controller/status_updater.go

+	"k8s.io/klog"
+)
+
+type StatusUpdater interface {


what is a status updater? Missing high level doc.

If I get it right, this queues up updateStatusFunc and calls them in batches every second?

@sttts, correct. I have updated the godoc.

sttts · 2020-05-28T08:38:52Z

pkg/cmd/checkendpoint/controller/status_updater.go

+}
+
+type statusUpdater struct {
+	sync.Mutex


don't embed

sttts · 2020-05-28T08:40:20Z

pkg/cmd/checkendpoint/controller/status_updater.go

+		_, _, err := v1alpha1helpers.UpdateStatus(ctx, s.client, s.checkName, s.updates...)
+		if err != nil {
+			klog.Warningf("Unable to update %s: %v", s.checkName, err)
+			s.recorder.Warningf("UpdateFailed", "Unable to update %s: %v", s.checkName, err)


do we really need an event for this? At worst this triggers every second.

Done. Removed.

sttts · 2020-05-28T08:46:04Z

pkg/cmd/checkendpoint/controller/controller.go

+	}
+
+	for _, check := range checks {
+		c.checkEndpoint(ctx, check)


what is the reason that we have the updater once-per-second-loops, but this here every 10 sec? Why do we decouple checks from status updates like that?

The discrepancy in loop frequencies was an oversight. The update is separate from the check so that check can continue even if the updates cannot be applied.

deads2k · 2020-06-05T23:01:25Z

bindata/v4.1.0/kube-apiserver/pod.yaml

@@ -161,6 +161,34 @@ spec:
      requests:
        memory: 50Mi
        cpu: 10m
+  - name: kube-apiserver-check-endpoints
+    image: ${OPERATOR_IMAGE}
+    imagePullPolicy: Always


IfNotPresent. Also, this is worth a bug or jira to write an e2e test for at some later date. cc @sttts

test for what?

@deads2k I think there actually is one, I've seen it fail.

deads2k · 2020-06-05T23:02:49Z

pkg/cmd/checkendpoints/cmd.go

+		<-ctx.Done()
+		return nil
+	})
+	config.DisableServing = true


you'll need metrics eventually.

deads2k · 2020-06-05T23:03:52Z

Update the clusteroperator relatedResources to gather the new api resources. We should be able to see these in our runs

deads2k · 2020-06-05T23:04:46Z

pkg/cmd/checkendpoints/controller/controller.go

+	}
+	c.Controller = factory.New().
+		WithSync(c.Sync).
+		WithInformers(checkInformer.Informer()).


This will cause self-triggering hot looping when you update status, won't it?

Done. Used WithBareInformers instead.

deads2k · 2020-06-05T23:14:41Z

pkg/cmd/checkendpoints/controller/controller.go

+
+// Returns a new PodNetworkConnectivityCheckController that performs network connectivity checks
+// as specified in the PodNetworkConnectivityChecks defined in the specified namespace, for the specified pod.
+func New(podName, podNamespace string, checksGetter operatorcontrolplaneclientv1alpha1.PodNetworkConnectivityChecksGetter, checkInformer alpha1.PodNetworkConnectivityCheckInformer, recorder events.Recorder) PodNetworkConnectivityCheckController {


the golang "standard" sucks. Name this something real. Same with the filename.

Done. renamed.

deads2k · 2020-06-11T21:02:18Z

pkg/cmd/checkendpoints/controller/pod_network_connectivity_check_controller.go

+	c.Controller = factory.New().
+		WithSync(c.Sync).
+		WithBareInformers(checkInformer.Informer()).
+		ResyncEvery(1*time.Second).


the new structure means you only need to trigger on check informer updating and you only need to handle one time.

At the very least, make this a sync every minute so we don't burn cpu

Done. Kept only sync on one minute. Didn't want to self trigger on status updates.

deads2k · 2020-06-11T21:02:40Z

pkg/cmd/checkendpoints/controller/pod_network_connectivity_check_controller.go

+}
+
+// UpdateStatus implements v1alpha1helpers.PodNetworkConnectivityCheckClient
+func (c *controller) UpdateStatus(ctx context.Context, check *operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, opts metav1.UpdateOptions) (*operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, error) {


is this used?

Yes. Implements PodNetworkConnectivityCheckClient client that ConnectionChecker uses.

deads2k · 2020-06-11T21:02:47Z

pkg/cmd/checkendpoints/controller/pod_network_connectivity_check_controller.go

+}
+
+// Get implements PodNetworkConnectivityCheckClient
+func (c *controller) Get(name string) (*operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, error) {


no need for a one liner

Implements PodNetworkConnectivityCheckClient client that ConnectionChecker uses.

deads2k · 2020-06-11T23:37:29Z

pkg/operator/targetconfigcontroller/targetconfigcontroller.go

@@ -341,6 +358,115 @@ func manageKubeAPIServerCABundle(lister corev1listers.ConfigMapLister, client co
 	return resourceapply.ApplyConfigMap(client, recorder, requiredConfigMap)
 }

+func managePodNetworkConnectivityChecks(ctx context.Context, client kubernetes.Interface,


Can we find a way to isolate this in another control loop?

deads2k · 2020-06-11T23:41:52Z

manifests/0000_20_kube-apiserver-operator_07_clusteroperator.yaml

@@ -39,3 +39,6 @@ status:
      resource: mutatingwebhookconfigurations
    - group: admissionregistration.k8s.io
      resource: validatingwebhookconfigurations
+    - group: controlplane.operator.openshift.io


you also need to add this to the bit in starter.go that duplicates this information. This is used before the operator starts, the other is used afterwards.

@sanchezl can you remind me that I want to fix up oc admin inspect to put all the instances in a single file?

well, I do and I don't :( I guess I want a visualizer for it. ugh.

deads2k · 2020-06-11T23:47:53Z

This is great! Even without must-gather, the upgrade shows outages on masters. I think some events may be missing. @mfojtik do we always use the unlimited even emitter?

@sanchezl I think a descriptive name in the events (maybe the name of the check object?) will help a lot "Connectivity outage detected: Failed to establish a TCP connection to 10.130.0.54:8443: dial tcp 10.130.0.54:8443: connect: connection refused" is much better than we had before, but still don't know what that IP is.

The "no route to host" and "connection refused" show clearly. I suspect we may want one more sentence about which team to contact :)

bump(*): update ControllerCommandConfig

deads2k · 2020-06-16T17:06:06Z

/refresh
/retest

deads2k · 2020-06-16T17:26:43Z

pkg/operator/connectivitycheckcontroller/connectivity_check_controller.go

+
+	var addresses []string
+	// each etcd
+	addresses = append(addresses, listAddressesForEtcd(operatorSpec, recorder)...)


In a future PR, I want some way to understand what is connection to what in english terms in the names.

deads2k · 2020-06-16T17:27:21Z

pkg/operator/connectivitycheckcontroller/connectivity_check_controller.go

+		LabelSelector: labels.Set{"node-role.kubernetes.io/master": ""}.AsSelector().String(),
+	})
+	if err != nil {
+		recorder.Warningf("EndpointDetectionFailure", "failed to list master nodes: %v", err)


you can return right here

you can return right here

nm

deads2k · 2020-06-16T17:28:48Z

pkg/operator/connectivitycheckcontroller/connectivity_check_controller.go

+	return nil
+}
+
+func managePodNetworkConnectivityChecks(ctx context.Context, client kubernetes.Interface,


I don't see where you're cleaning up if a master is removed.

You can make a bug and fix it post-merge

You can make a bug and fix it post-merge

and when you do, you'll need a test

deads2k · 2020-06-16T17:30:47Z

pkg/operator/starter.go

@@ -72,6 +78,7 @@ func RunOperator(ctx context.Context, controllerContext *controllercmd.Controlle
 		operatorclient.TargetNamespace,
 		operatorclient.OperatorNamespace,
 		"openshift-etcd",
+		"openshift-apiserver",


explain why in followup

deads2k · 2020-06-16T17:32:42Z

there are follow-ups

cleanup of no longer needed checks
add to names so we know what each one is logically checking
add metrics for collection
improve message in events so that we know what became unavailable

but this is a great start.

/lgtm

openshift-ci-robot · 2020-06-16T17:33:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sanchezl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [deads2k]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deads2k · 2020-06-16T18:17:53Z

cleanup of no longer needed checks

Ah, I remember this. I wanted a lifespan on them so I could keep dead ones around for a while.

sanchezl · 2020-06-16T21:35:22Z

/test e2e-aws-upgrade

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 1, 2020

openshift-ci-robot requested review from soltysh and sttts May 1, 2020 18:34

sanchezl mentioned this pull request May 4, 2020

stability: point-to-point network check prototype openshift/enhancements#304

Closed

6 tasks

deads2k reviewed May 5, 2020

View reviewed changes

sanchezl force-pushed the p2pcheck branch from 9059be6 to 9317f8d Compare May 8, 2020 23:34

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 12, 2020

sanchezl force-pushed the p2pcheck branch 2 times, most recently from dc0b0fb to 64de6aa Compare May 15, 2020 06:17

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2020

sanchezl force-pushed the p2pcheck branch from 64de6aa to 44e19ac Compare May 21, 2020 03:59

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2020

sanchezl force-pushed the p2pcheck branch from d4f06fa to 9e4ab8c Compare May 27, 2020 13:15

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 27, 2020

sttts reviewed May 28, 2020

View reviewed changes

sanchezl force-pushed the p2pcheck branch 3 times, most recently from 4ccfd76 to 655a514 Compare June 4, 2020 01:20

deads2k reviewed Jun 5, 2020

View reviewed changes

pkg/cmd/checkendpoints/cmd.go

<-ctx.Done()

return nil

})

config.DisableServing = true

Copy link

Contributor

deads2k Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll need metrics eventually.

deads2k reviewed Jun 5, 2020

View reviewed changes

deads2k reviewed Jun 11, 2020

View reviewed changes

sanchezl force-pushed the p2pcheck branch 3 times, most recently from 2e7d73c to 33c0329 Compare June 16, 2020 04:48

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2020

sanchezl force-pushed the p2pcheck branch from 33c0329 to 379e03f Compare June 16, 2020 04:52

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2020

sanchezl added 4 commits June 16, 2020 09:04

bump(*): pickup operatorcontrolplane/v1alpha1 client

4bde510

bump(*): update ControllerCommandConfig

point-to-point network check: add check-endpoints command

71bf449

point-to-point network check: add/update manifests

dcb8d6a

point-to-point network check: create target resources

7b16870

sanchezl force-pushed the p2pcheck branch from 379e03f to 7b16870 Compare June 16, 2020 13:06

deads2k reviewed Jun 16, 2020

View reviewed changes

openshift-ci-robot assigned deads2k Jun 16, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2020

openshift-merge-robot merged commit d647303 into openshift:master Jun 16, 2020

point-to-point network check tool #846

point-to-point network check tool #846

Conversation

sanchezl commented May 1, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Jun 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Jun 11, 2020

deads2k commented Jun 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deads2k commented Jun 16, 2020

openshift-ci-robot commented Jun 16, 2020

deads2k commented Jun 16, 2020

sanchezl commented Jun 16, 2020