Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

point-to-point network check tool #846

Merged
merged 4 commits into from Jun 16, 2020

Conversation

sanchezl
Copy link
Contributor

@sanchezl sanchezl commented May 1, 2020

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 1, 2020
cmd := &cobra.Command{
Use: "check-endpoint",
Short: "Checks that a tcp connection can be opened to an endpoint.",
Args: cobra.MinimumNArgs(1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we take named flags, not positional args

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we take named flags, not positional args

nm, interesting. This is just a command where you explore the logic? ok.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 12, 2020
@sanchezl sanchezl force-pushed the p2pcheck branch 2 times, most recently from dc0b0fb to 64de6aa Compare May 15, 2020 06:17
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 15, 2020
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 26, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 27, 2020
apiVersion: operator.openshift.io/v1
kind: GenericOperatorConfig
leaderElection:
disable: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this manifest surprises me. Where is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library-go based controllers have leader election enable by default, limiting control loops to only one per namespace. I need this config to disable the leader election logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The library-go based controllers have leader election enable by default, limiting control loops to only one per namespace. I need this config to disable the leader election logic.

update library-go to have a WithoutLeaderElection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts I've updated the ControllerCommandOptions to achieve the same result without the extra resources.

config.Config()
cmd := config.NewCommandWithContext(context.Background())
cmd.Use = "check-endpoint"
cmd.Short = "Checks that a tcp connection can be opened to an endpoint."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: falling over the singular use of "endpoint" everywhere in the PR. Isn't this checking a number of them potentially?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Reviewed the plurality of endpoint where used and made adjustments as needed to clarify.

for _, check := range checks {
if updater := c.updaters[check.Name]; updater == nil {
c.updaters[check.Name] = NewStatusUpdater(c, check.Name, c.recorder)
c.updaters[check.Name].Start(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we guaranteed that this context is long-living, i.e. over different Sync calls? cc @mfojtik

checksGetter operatorcontrolplaneclientv1alpha1.PodNetworkConnectivityCheckInterface
checkLister v1alpha1.PodNetworkConnectivityCheckNamespaceLister
recorder events.Recorder
sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment what it protect and/or put in some empty lines to make it visually clear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and don't embed it, but give it a name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, as this was moved into the status updater.

"k8s.io/klog"
)

type StatusUpdater interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is a status updater? Missing high level doc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I get it right, this queues up updateStatusFunc and calls them in batches every second?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sttts, correct. I have updated the godoc.

}

type statusUpdater struct {
sync.Mutex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't embed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

_, _, err := v1alpha1helpers.UpdateStatus(ctx, s.client, s.checkName, s.updates...)
if err != nil {
klog.Warningf("Unable to update %s: %v", s.checkName, err)
s.recorder.Warningf("UpdateFailed", "Unable to update %s: %v", s.checkName, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need an event for this? At worst this triggers every second.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Removed.

}

for _, check := range checks {
c.checkEndpoint(ctx, check)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason that we have the updater once-per-second-loops, but this here every 10 sec? Why do we decouple checks from status updates like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discrepancy in loop frequencies was an oversight. The update is separate from the check so that check can continue even if the updates cannot be applied.

@sanchezl sanchezl force-pushed the p2pcheck branch 3 times, most recently from 4ccfd76 to 655a514 Compare June 4, 2020 01:20
@@ -161,6 +161,34 @@ spec:
requests:
memory: 50Mi
cpu: 10m
- name: kube-apiserver-check-endpoints
image: ${OPERATOR_IMAGE}
imagePullPolicy: Always
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IfNotPresent. Also, this is worth a bug or jira to write an e2e test for at some later date. cc @sttts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test for what?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deads2k I think there actually is one, I've seen it fail.

<-ctx.Done()
return nil
})
config.DisableServing = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll need metrics eventually.

@deads2k
Copy link
Contributor

deads2k commented Jun 5, 2020

Update the clusteroperator relatedResources to gather the new api resources. We should be able to see these in our runs

}
c.Controller = factory.New().
WithSync(c.Sync).
WithInformers(checkInformer.Informer()).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause self-triggering hot looping when you update status, won't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Used WithBareInformers instead.


// Returns a new PodNetworkConnectivityCheckController that performs network connectivity checks
// as specified in the PodNetworkConnectivityChecks defined in the specified namespace, for the specified pod.
func New(podName, podNamespace string, checksGetter operatorcontrolplaneclientv1alpha1.PodNetworkConnectivityChecksGetter, checkInformer alpha1.PodNetworkConnectivityCheckInformer, recorder events.Recorder) PodNetworkConnectivityCheckController {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the golang "standard" sucks. Name this something real. Same with the filename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. renamed.

c.Controller = factory.New().
WithSync(c.Sync).
WithBareInformers(checkInformer.Informer()).
ResyncEvery(1*time.Second).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new structure means you only need to trigger on check informer updating and you only need to handle one time.

At the very least, make this a sync every minute so we don't burn cpu

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Kept only sync on one minute. Didn't want to self trigger on status updates.

}

// UpdateStatus implements v1alpha1helpers.PodNetworkConnectivityCheckClient
func (c *controller) UpdateStatus(ctx context.Context, check *operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, opts metav1.UpdateOptions) (*operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Implements PodNetworkConnectivityCheckClient client that ConnectionChecker uses.

}

// Get implements PodNetworkConnectivityCheckClient
func (c *controller) Get(name string) (*operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for a one liner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implements PodNetworkConnectivityCheckClient client that ConnectionChecker uses.

@@ -341,6 +358,115 @@ func manageKubeAPIServerCABundle(lister corev1listers.ConfigMapLister, client co
return resourceapply.ApplyConfigMap(client, recorder, requiredConfigMap)
}

func managePodNetworkConnectivityChecks(ctx context.Context, client kubernetes.Interface,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we find a way to isolate this in another control loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -39,3 +39,6 @@ status:
resource: mutatingwebhookconfigurations
- group: admissionregistration.k8s.io
resource: validatingwebhookconfigurations
- group: controlplane.operator.openshift.io
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you also need to add this to the bit in starter.go that duplicates this information. This is used before the operator starts, the other is used afterwards.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanchezl can you remind me that I want to fix up oc admin inspect to put all the instances in a single file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, I do and I don't :( I guess I want a visualizer for it. ugh.

@deads2k
Copy link
Contributor

deads2k commented Jun 11, 2020

This is great! Even without must-gather, the upgrade shows outages on masters. I think some events may be missing. @mfojtik do we always use the unlimited even emitter?

@sanchezl I think a descriptive name in the events (maybe the name of the check object?) will help a lot "Connectivity outage detected: Failed to establish a TCP connection to 10.130.0.54:8443: dial tcp 10.130.0.54:8443: connect: connection refused" is much better than we had before, but still don't know what that IP is.

The "no route to host" and "connection refused" show clearly. I suspect we may want one more sentence about which team to contact :)

@sanchezl sanchezl force-pushed the p2pcheck branch 3 times, most recently from 2e7d73c to 33c0329 Compare June 16, 2020 04:48
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2020
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 16, 2020
@deads2k
Copy link
Contributor

deads2k commented Jun 16, 2020

/refresh
/retest


var addresses []string
// each etcd
addresses = append(addresses, listAddressesForEtcd(operatorSpec, recorder)...)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR, I want some way to understand what is connection to what in english terms in the names.

LabelSelector: labels.Set{"node-role.kubernetes.io/master": ""}.AsSelector().String(),
})
if err != nil {
recorder.Warningf("EndpointDetectionFailure", "failed to list master nodes: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can return right here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can return right here

nm

return nil
}

func managePodNetworkConnectivityChecks(ctx context.Context, client kubernetes.Interface,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where you're cleaning up if a master is removed.

You can make a bug and fix it post-merge

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make a bug and fix it post-merge

and when you do, you'll need a test

@@ -72,6 +78,7 @@ func RunOperator(ctx context.Context, controllerContext *controllercmd.Controlle
operatorclient.TargetNamespace,
operatorclient.OperatorNamespace,
"openshift-etcd",
"openshift-apiserver",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why in followup

@deads2k
Copy link
Contributor

deads2k commented Jun 16, 2020

there are follow-ups

  1. cleanup of no longer needed checks
  2. add to names so we know what each one is logically checking
  3. add metrics for collection
  4. improve message in events so that we know what became unavailable

but this is a great start.

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 16, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, sanchezl

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 16, 2020
@deads2k
Copy link
Contributor

deads2k commented Jun 16, 2020

  1. cleanup of no longer needed checks

Ah, I remember this. I wanted a lifespan on them so I could keep dead ones around for a while.

@sanchezl
Copy link
Contributor Author

/test e2e-aws-upgrade

@openshift-merge-robot openshift-merge-robot merged commit d647303 into openshift:master Jun 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants