New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
point-to-point network check tool #846
Conversation
pkg/cmd/checkendpoint/cmd.go
Outdated
cmd := &cobra.Command{ | ||
Use: "check-endpoint", | ||
Short: "Checks that a tcp connection can be opened to an endpoint.", | ||
Args: cobra.MinimumNArgs(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we take named flags, not positional args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we take named flags, not positional args
nm, interesting. This is just a command where you explore the logic? ok.
dc0b0fb
to
64de6aa
Compare
apiVersion: operator.openshift.io/v1 | ||
kind: GenericOperatorConfig | ||
leaderElection: | ||
disable: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this manifest surprises me. Where is it needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The library-go based controllers have leader election enable by default, limiting control loops to only one per namespace. I need this config to disable the leader election logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The library-go based controllers have leader election enable by default, limiting control loops to only one per namespace. I need this config to disable the leader election logic.
update library-go to have a WithoutLeaderElection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts I've updated the ControllerCommandOptions
to achieve the same result without the extra resources.
pkg/cmd/checkendpoint/cmd.go
Outdated
config.Config() | ||
cmd := config.NewCommandWithContext(context.Background()) | ||
cmd.Use = "check-endpoint" | ||
cmd.Short = "Checks that a tcp connection can be opened to an endpoint." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: falling over the singular use of "endpoint" everywhere in the PR. Isn't this checking a number of them potentially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Reviewed the plurality of endpoint where used and made adjustments as needed to clarify.
for _, check := range checks { | ||
if updater := c.updaters[check.Name]; updater == nil { | ||
c.updaters[check.Name] = NewStatusUpdater(c, check.Name, c.recorder) | ||
c.updaters[check.Name].Start(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we guaranteed that this context is long-living, i.e. over different Sync calls? cc @mfojtik
checksGetter operatorcontrolplaneclientv1alpha1.PodNetworkConnectivityCheckInterface | ||
checkLister v1alpha1.PodNetworkConnectivityCheckNamespaceLister | ||
recorder events.Recorder | ||
sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment what it protect and/or put in some empty lines to make it visually clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and don't embed it, but give it a name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, as this was moved into the status updater.
"k8s.io/klog" | ||
) | ||
|
||
type StatusUpdater interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a status updater? Missing high level doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I get it right, this queues up updateStatusFunc and calls them in batches every second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sttts, correct. I have updated the godoc.
} | ||
|
||
type statusUpdater struct { | ||
sync.Mutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't embed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
_, _, err := v1alpha1helpers.UpdateStatus(ctx, s.client, s.checkName, s.updates...) | ||
if err != nil { | ||
klog.Warningf("Unable to update %s: %v", s.checkName, err) | ||
s.recorder.Warningf("UpdateFailed", "Unable to update %s: %v", s.checkName, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need an event for this? At worst this triggers every second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Removed.
} | ||
|
||
for _, check := range checks { | ||
c.checkEndpoint(ctx, check) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the reason that we have the updater once-per-second-loops, but this here every 10 sec? Why do we decouple checks from status updates like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The discrepancy in loop frequencies was an oversight. The update is separate from the check so that check can continue even if the updates cannot be applied.
4ccfd76
to
655a514
Compare
@@ -161,6 +161,34 @@ spec: | |||
requests: | |||
memory: 50Mi | |||
cpu: 10m | |||
- name: kube-apiserver-check-endpoints | |||
image: ${OPERATOR_IMAGE} | |||
imagePullPolicy: Always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IfNotPresent. Also, this is worth a bug or jira to write an e2e test for at some later date. cc @sttts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test for what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deads2k I think there actually is one, I've seen it fail.
<-ctx.Done() | ||
return nil | ||
}) | ||
config.DisableServing = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll need metrics eventually.
Update the clusteroperator relatedResources to gather the new api resources. We should be able to see these in our runs |
} | ||
c.Controller = factory.New(). | ||
WithSync(c.Sync). | ||
WithInformers(checkInformer.Informer()). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will cause self-triggering hot looping when you update status, won't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Used WithBareInformers
instead.
|
||
// Returns a new PodNetworkConnectivityCheckController that performs network connectivity checks | ||
// as specified in the PodNetworkConnectivityChecks defined in the specified namespace, for the specified pod. | ||
func New(podName, podNamespace string, checksGetter operatorcontrolplaneclientv1alpha1.PodNetworkConnectivityChecksGetter, checkInformer alpha1.PodNetworkConnectivityCheckInformer, recorder events.Recorder) PodNetworkConnectivityCheckController { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the golang "standard" sucks. Name this something real. Same with the filename.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. renamed.
c.Controller = factory.New(). | ||
WithSync(c.Sync). | ||
WithBareInformers(checkInformer.Informer()). | ||
ResyncEvery(1*time.Second). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new structure means you only need to trigger on check informer updating and you only need to handle one time.
At the very least, make this a sync every minute so we don't burn cpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Kept only sync on one minute. Didn't want to self trigger on status updates.
} | ||
|
||
// UpdateStatus implements v1alpha1helpers.PodNetworkConnectivityCheckClient | ||
func (c *controller) UpdateStatus(ctx context.Context, check *operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, opts metav1.UpdateOptions) (*operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Implements PodNetworkConnectivityCheckClient client that ConnectionChecker uses.
} | ||
|
||
// Get implements PodNetworkConnectivityCheckClient | ||
func (c *controller) Get(name string) (*operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for a one liner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implements PodNetworkConnectivityCheckClient client that ConnectionChecker uses.
@@ -341,6 +358,115 @@ func manageKubeAPIServerCABundle(lister corev1listers.ConfigMapLister, client co | |||
return resourceapply.ApplyConfigMap(client, recorder, requiredConfigMap) | |||
} | |||
|
|||
func managePodNetworkConnectivityChecks(ctx context.Context, client kubernetes.Interface, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we find a way to isolate this in another control loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -39,3 +39,6 @@ status: | |||
resource: mutatingwebhookconfigurations | |||
- group: admissionregistration.k8s.io | |||
resource: validatingwebhookconfigurations | |||
- group: controlplane.operator.openshift.io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you also need to add this to the bit in starter.go that duplicates this information. This is used before the operator starts, the other is used afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanchezl can you remind me that I want to fix up oc admin inspect
to put all the instances in a single file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, I do and I don't :( I guess I want a visualizer for it. ugh.
This is great! Even without must-gather, the upgrade shows outages on masters. I think some events may be missing. @mfojtik do we always use the unlimited even emitter? @sanchezl I think a descriptive name in the events (maybe the name of the check object?) will help a lot "Connectivity outage detected: Failed to establish a TCP connection to 10.130.0.54:8443: dial tcp 10.130.0.54:8443: connect: connection refused" is much better than we had before, but still don't know what that IP is. The "no route to host" and "connection refused" show clearly. I suspect we may want one more sentence about which team to contact :) |
2e7d73c
to
33c0329
Compare
/refresh |
|
||
var addresses []string | ||
// each etcd | ||
addresses = append(addresses, listAddressesForEtcd(operatorSpec, recorder)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future PR, I want some way to understand what is connection to what in english terms in the names.
LabelSelector: labels.Set{"node-role.kubernetes.io/master": ""}.AsSelector().String(), | ||
}) | ||
if err != nil { | ||
recorder.Warningf("EndpointDetectionFailure", "failed to list master nodes: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can return right here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can return right here
nm
return nil | ||
} | ||
|
||
func managePodNetworkConnectivityChecks(ctx context.Context, client kubernetes.Interface, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where you're cleaning up if a master is removed.
You can make a bug and fix it post-merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make a bug and fix it post-merge
and when you do, you'll need a test
@@ -72,6 +78,7 @@ func RunOperator(ctx context.Context, controllerContext *controllercmd.Controlle | |||
operatorclient.TargetNamespace, | |||
operatorclient.OperatorNamespace, | |||
"openshift-etcd", | |||
"openshift-apiserver", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain why in followup
there are follow-ups
but this is a great start. /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: deads2k, sanchezl The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Ah, I remember this. I wanted a lifespan on them so I could keep dead ones around for a while. |
/test e2e-aws-upgrade |
For use in openshift/enhancements#289