Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve LB session affinity tests #92427

Merged
merged 1 commit into from Jun 26, 2020

Conversation

wojtek-t
Copy link
Member

@wojtek-t wojtek-t commented Jun 23, 2020

In preparation of enabling those tests for LB session affinity (5k-node) clusters.

Ref: #56138

NONE

@wojtek-t wojtek-t added sig/network Categorizes an issue or PR as relevant to SIG Network. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jun 23, 2020
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jun 23, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 23, 2020
@wojtek-t
Copy link
Member Author

service is not reachable within 2m0s timeout on endpoint affinity-clusterip-transition:80 over TCP protocol doesn't seem to be related to my change

/retest

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like these tests were marked [DisabledForLargeClusters] because there were problems with NodePort services in very large clusters... and it doesn't seem like this patch addresses that problem? Why exactly is this needed for re-enabling the tests?

if execPod == nil {
timeout = e2eservice.GetServiceLoadBalancerPropagationTimeout(cs)
interval = 2 * time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh... this code is a mess; if execPod != nil then each iteration of the PollImmediate makes AffinityConfirmCount attempts to connect, but if execPod == nil then each iteration makes a single attempt to connect. So... the change to interval here is correct, but it would be better to rewrite this some more...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship - would splitting these into two somewhat separate functions address this concern?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first thought, but there's a lot of code that's shared too so that might result in too much duplication? I dunno. Maybe just make the non-execPod case inside the poll also do AffinityConfirmCount tries?

@wojtek-t
Copy link
Member Author

It looks like these tests were marked [DisabledForLargeClusters] because there were problems with NodePort services in very large clusters... and it doesn't seem like this patch addresses that problem? Why exactly is this needed for re-enabling the tests?

We enabled these tests couple weeks ago and they were flaky back then (mostly timeouts). I think that may contribute to improving them (I would like to re-enable them again). So I actually think it may potentially help.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 24, 2020
}

interval, timeout, getHosts := pollingArgs()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship - thanks a lot for the first pass of review; I refactored this test a bit and hopefully it's much cleaner now.
PTAL

@wojtek-t
Copy link
Member Author

/retest

@wojtek-t
Copy link
Member Author

@danwinship - friendly ping


return func() (time.Duration, time.Duration, func() []string) {
return interval, timeout, getHosts
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird to have these return a function that returns those arguments, rather than just returning the arguments themselves

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function itself is slightly longer then, but I'm fine with changing too.
Done

@@ -147,6 +192,7 @@ func checkAffinity(cs clientset.Interface, execPod *v1.Pod, serviceIP string, se
return false
}
if !trackerFulfilled {
serviceIPPort := net.JoinHostPort(serviceIP, strconv.Itoa(servicePort))
checkAffinityFailed(tracker, fmt.Sprintf("Connection to %s timed out or not enough responses.", serviceIPPort))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(you could just remove serviceIPPort from the error message... it's obvious from context)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member Author

@wojtek-t wojtek-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danwinship - thanks; PTAL


return func() (time.Duration, time.Duration, func() []string) {
return interval, timeout, getHosts
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function itself is slightly longer then, but I'm fine with changing too.
Done

@@ -147,6 +192,7 @@ func checkAffinity(cs clientset.Interface, execPod *v1.Pod, serviceIP string, se
return false
}
if !trackerFulfilled {
serviceIPPort := net.JoinHostPort(serviceIP, strconv.Itoa(servicePort))
checkAffinityFailed(tracker, fmt.Sprintf("Connection to %s timed out or not enough responses.", serviceIPPort))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wojtek-t wojtek-t added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Jun 26, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-kind Indicates a PR lacks a `kind/foo` label and requires one. label Jun 26, 2020
@wojtek-t
Copy link
Member Author

/retest

@danwinship
Copy link
Contributor

/lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants