New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests: Add DaemonSet with LB rolling update test #114052
tests: Add DaemonSet with LB rolling update test #114052
Conversation
Please note that we're already in Test Freeze for the Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Mon Nov 21 21:28:31 UTC 2022. |
@ionutbalutoiu: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Hi @ionutbalutoiu. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc: @andrewsykim |
we had these tests in openshift and is not possible to guarantee 0 error on tests (neither in reality), you have to operate with one margin of error, most of the flakes or errors comes from the fact that the connection through internet can not be guaranteed , i.e. from the client to the Loadbalancer, Internet doesn't guarantee 0 packet loss |
/ok-to-test I'd like to see the assertion as a percentage of successes that we can iterate, let's say a 95% to start with? we can start increasing it later |
Good point! I've taken into consideration this, and that's why I'm using Basically, within So, the test will fail only if the load balancer doesn't properly update the backend pods during rolling updates. |
I like this suggestion as well. We could drop the retry logic for the HTTP requests, and assert the percentage of failed requests from the total of requests made during the rolling update. I'm fine with any of the implementations for the test. |
I don't have a strong preference, just we have to have granularity here or failing tests will be undebuggable, i.e. if you do a rolling update, you have to keep stats per each one, ... |
886e699
to
f26e460
Compare
I ended up implementing the logic to assert a minimum success rate from the total of HTTP requests (and don't use retries when doing each HTTP request individually). There's a minimum threshold of 95% success rate of the total load balancer HTTP requests. If this is not achieved, the test will fail. In this way, we achieve granularity here, as we record exactly how many requests failed or succeeded. When you get the chance, please see the updated PR. |
f26e460
to
8ae45e7
Compare
195f5cf
to
2f332c1
Compare
/hold cancel Thanks! |
test/e2e/network/loadbalancer.go
Outdated
ns := f.Namespace.Name | ||
name := "test-lb-rolling-update" | ||
labels := map[string]string{"name": name} | ||
gracePeriod := int64(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend using a much higher value here, like 30. Even higher like 60 might be better, although that might make the test significantly slower.
Reason is that this value has to be at least ("LB health check interval" x "Unhealthy threshold") + some buffer space. The value ends up depending on the LB configuration for the healthcheck, so something like 30s or 60s would likely work across most default health check intervals. Extreme cases would be something like 10s interval with 3-5 unhealthy threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we increase the grace period to something like 60s, we might want to tweak the RollingUpdate strategy to use a higher maxUnavailable
and maxSurge
value, like 10%. This way the test time is still reasonable when testing against really large clusters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense. I updated the PR to use:
gracePeriod
set to60
secondsRollingUpdateDaemonSet.MaxUnavailable
set to10%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @wojtek-t re: large scale scalability test -- I think 60s termination grace period is fine with maxUnavailable: 10% but I could be missing something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't look into PR, so commenting only based on the comment thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry - clicked send too quickly...
60s grace period with 10% unavailable, means rolling update takes 10m+ - it's somewhat large, but assuming we don't have multiple such upgrades, it seems acceptable
2f332c1
to
48fc80e
Compare
Awesome :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
@aojea can you do another pass?
// We start with a low but reasonable threshold to analyze the results. | ||
// The goal is to achieve 99% minimum success rate. | ||
// TODO: We should do incremental steps toward the goal. | ||
minSuccessRate := 0.95 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on success rate in here, I'm still inclined to make this value higher, but I think we can increase this in a follow-up after we have some some CI run data
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andrewsykim, ionutbalutoiu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
48fc80e
to
f9cec3c
Compare
Add a test case with a DaemonSet behind a simple load balancer whose address is being constantly hit via HTTP requests. The test passes if there are no errors when doing HTTP requests to the load balancer address, during DaemonSet `RollingUpdate` operations. Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
Hello everyone, I've noticed this into some of my testing:
It seems that the polling logic after the pods environment was updated was bugged sometimes. I've updated the polling logic to also validate the updated pod containers environment, before considering a rolling update complete. Also, a rebase was needed to have this merged. @andrewsykim @aojea please take a look at the PR again when you got the chance. |
f9cec3c
to
3feea9d
Compare
If you still need this PR then please rebase, if not, please close the PR |
I already rebased. The branch behind the pull request is already in-sync with latest upstream
This was an FYI that I already did it together with the mentioned changes. |
/retest-require |
/retest |
/lgtm Thanks, great work, let's see how it evolves on CI |
What type of PR is this?
/sig network
What this PR does / why we need it:
Add a test case with a DaemonSet behind a simple load balancer whose address is being constantly hit via HTTP requests.
The test passes if there are no errors when doing HTTP requests to the load balancer address, during DaemonSet
RollingUpdate
operations.Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
Tested against an Azure (AKS) cluster:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: