Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Add DaemonSet with LB rolling update test #114052

Merged

Conversation

ionutbalutoiu
Copy link
Contributor

@ionutbalutoiu ionutbalutoiu commented Nov 21, 2022

What type of PR is this?

/sig network

What this PR does / why we need it:

Add a test case with a DaemonSet behind a simple load balancer whose address is being constantly hit via HTTP requests.

The test passes if there are no errors when doing HTTP requests to the load balancer address, during DaemonSet RollingUpdate operations.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

Tested against an Azure (AKS) cluster:

./e2e.test --ginkgo.focus="should not have connectivity disruption during rolling update" --kubeconfig $HOME/.kube/config --provider azure --cloud-config-file /workspace/azure.json

...

Ran 1 of 7070 Specs in 137.425 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 7069 Skipped
PASS

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/network Categorizes an issue or PR as relevant to SIG Network. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 21, 2022
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.26 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.26.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Mon Nov 21 21:28:31 UTC 2022.

@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 21, 2022
@k8s-ci-robot
Copy link
Contributor

@ionutbalutoiu: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 21, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @ionutbalutoiu. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ionutbalutoiu
Copy link
Contributor Author

/cc: @andrewsykim

@aojea
Copy link
Member

aojea commented Nov 21, 2022

we had these tests in openshift and is not possible to guarantee 0 error on tests (neither in reality), you have to operate with one margin of error, most of the flakes or errors comes from the fact that the connection through internet can not be guaranteed , i.e. from the client to the Loadbalancer, Internet doesn't guarantee 0 packet loss

@aojea
Copy link
Member

aojea commented Nov 21, 2022

/ok-to-test

I'd like to see the assertion as a percentage of successes that we can iterate, let's say a 95% to start with? we can start increasing it later

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 21, 2022
@ionutbalutoiu
Copy link
Contributor Author

we had these tests in openshift and is not possible to guarantee 0 error on tests (neither in reality), you have to operate with one margin of error, most of the flakes or errors comes from the fact that the connection through internet can not be guaranteed , i.e. from the client to the Loadbalancer, Internet doesn't guarantee 0 packet loss

Good point!

I've taken into consideration this, and that's why I'm using wait.PollImmediate when doing HTTP requests.

Basically, within e2eservice.LoadBalancerLagTimeoutDefault (which is set to 2 minutes), failed requests are retried before they are considered failed. Therefore, we cover the cases where we have transient networking errors (unrelated to K8s code).

So, the test will fail only if the load balancer doesn't properly update the backend pods during rolling updates.

@ionutbalutoiu
Copy link
Contributor Author

/ok-to-test

I'd like to see the assertion as a percentage of successes that we can iterate, let's say a 95% to start with? we can start increasing it later

I like this suggestion as well.

We could drop the retry logic for the HTTP requests, and assert the percentage of failed requests from the total of requests made during the rolling update.

I'm fine with any of the implementations for the test.
Which one do you consider more robust ?

@aojea
Copy link
Member

aojea commented Nov 22, 2022

I don't have a strong preference, just we have to have granularity here or failing tests will be undebuggable, i.e. if you do a rolling update, you have to keep stats per each one, ...

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 22, 2022
@ionutbalutoiu
Copy link
Contributor Author

I don't have a strong preference, just we have to have granularity here or failing tests will be undebuggable, i.e. if you do a rolling update, you have to keep stats per each one, ...

I ended up implementing the logic to assert a minimum success rate from the total of HTTP requests (and don't use retries when doing each HTTP request individually).

There's a minimum threshold of 95% success rate of the total load balancer HTTP requests. If this is not achieved, the test will fail.

In this way, we achieve granularity here, as we record exactly how many requests failed or succeeded.

When you get the chance, please see the updated PR.

@wojtek-t
Copy link
Member

/hold cancel

Thanks!

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 24, 2022
ns := f.Namespace.Name
name := "test-lb-rolling-update"
labels := map[string]string{"name": name}
gracePeriod := int64(10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend using a much higher value here, like 30. Even higher like 60 might be better, although that might make the test significantly slower.

Reason is that this value has to be at least ("LB health check interval" x "Unhealthy threshold") + some buffer space. The value ends up depending on the LB configuration for the healthcheck, so something like 30s or 60s would likely work across most default health check intervals. Extreme cases would be something like 10s interval with 3-5 unhealthy threshold

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we increase the grace period to something like 60s, we might want to tweak the RollingUpdate strategy to use a higher maxUnavailable and maxSurge value, like 10%. This way the test time is still reasonable when testing against really large clusters

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense. I updated the PR to use:

  • gracePeriod set to 60 seconds
  • RollingUpdateDaemonSet.MaxUnavailable set to 10%

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @wojtek-t re: large scale scalability test -- I think 60s termination grace period is fine with maxUnavailable: 10% but I could be missing something

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look into PR, so commenting only based on the comment thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - clicked send too quickly...

60s grace period with 10% unavailable, means rolling update takes 10m+ - it's somewhat large, but assuming we don't have multiple such upgrades, it seems acceptable

@andrewsykim
Copy link
Member

Dec  5 13:50:26.824: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 3
Dec  5 13:50:26.824: INFO: Node bootstrap-e2e-minion-group-cw8h is running 0 daemon pod, expected 1
Dec  5 13:50:28.824: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 4
Dec  5 13:50:28.824: INFO: Number of running nodes: 4, number of available pods: 4 in daemonset test-lb-rolling-update
Dec  5 13:50:28.824: INFO: Load Balancer total HTTP requests: 3148
Dec  5 13:50:28.824: INFO: Network errors: 0
Dec  5 13:50:28.824: INFO: HTTP errors: 0
Dec  5 13:50:28.824: INFO: Success rate: 100.00%
Dec  5 13:50:28.824: INFO: Update daemon pods environment: [{"name":"VERSION","value":"4"}]
Dec  5 13:50:28.872: INFO: Check that daemon pods are still running on every node of the cluster.

Awesome :)

Copy link
Member

@andrewsykim andrewsykim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@aojea can you do another pass?

// We start with a low but reasonable threshold to analyze the results.
// The goal is to achieve 99% minimum success rate.
// TODO: We should do incremental steps toward the goal.
minSuccessRate := 0.95
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on success rate in here, I'm still inclined to make this value higher, but I think we can increase this in a follow-up after we have some some CI run data

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, ionutbalutoiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 5, 2022
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2022
Add a test case with a DaemonSet behind a simple load balancer whose
address is being constantly hit via HTTP requests.

The test passes if there are no errors when doing HTTP requests to the
load balancer address, during DaemonSet `RollingUpdate` operations.

Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>
@ionutbalutoiu
Copy link
Contributor Author

Hello everyone,

I've noticed this into some of my testing:

Dec 12 02:30:29.502: INFO: Update daemon pods environment: [{"name":"VERSION","value":"1"}]
Dec 12 02:30:29.508: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:30:29.512: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:30:29.512: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:30:29.512: INFO: Load Balancer total HTTP requests: 7
Dec 12 02:30:29.512: INFO: Network errors: 0
Dec 12 02:30:29.512: INFO: HTTP errors: 0
Dec 12 02:30:29.512: INFO: Success rate: 100.00%
Dec 12 02:30:29.512: INFO: Update daemon pods environment: [{"name":"VERSION","value":"2"}]
Dec 12 02:30:29.517: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:30:29.521: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:30:29.521: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:30:29.521: INFO: Load Balancer total HTTP requests: 9
Dec 12 02:30:29.521: INFO: Network errors: 0
Dec 12 02:30:29.521: INFO: HTTP errors: 0
Dec 12 02:30:29.521: INFO: Success rate: 100.00%
Dec 12 02:30:29.521: INFO: Update daemon pods environment: [{"name":"VERSION","value":"3"}]
Dec 12 02:30:29.526: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:30:29.529: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:29.529: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:31.535: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:31.535: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:33.534: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:33.534: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:35.534: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:35.534: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:37.534: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:37.534: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:39.535: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:39.535: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
....
....
....
Dec 12 02:32:53.536: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:32:53.536: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:32:53.536: INFO: Load Balancer total HTTP requests: 74741
Dec 12 02:32:53.536: INFO: Network errors: 0
Dec 12 02:32:53.536: INFO: HTTP errors: 0
Dec 12 02:32:53.536: INFO: Success rate: 100.00%
Dec 12 02:32:53.536: INFO: Update daemon pods environment: [{"name":"VERSION","value":"4"}]
Dec 12 02:32:53.542: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:32:53.545: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:32:53.545: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:32:53.545: INFO: Load Balancer total HTTP requests: 0
Dec 12 02:32:53.545: INFO: Network errors: 0
Dec 12 02:32:53.545: INFO: HTTP errors: 0
Dec 12 02:32:53.545: INFO: Success rate: NaN%
Dec 12 02:32:53.545: INFO: Update daemon pods environment: [{"name":"VERSION","value":"5"}]
Dec 12 02:32:53.551: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:32:53.554: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:32:53.554: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:32:53.554: INFO: Load Balancer total HTTP requests: 0
Dec 12 02:32:53.554: INFO: Network errors: 0
Dec 12 02:32:53.554: INFO: HTTP errors: 0
Dec 12 02:32:53.554: INFO: Success rate: NaN%
Dec 12 02:32:53.554: INFO: Poking "http://40.89.240.212:80/echo?msg=hello"
Dec 12 02:32:53.555: INFO: Poke("http://40.89.240.212:80/echo?msg=hello"): success

It seems that the polling logic after the pods environment was updated was bugged sometimes.

I've updated the polling logic to also validate the updated pod containers environment, before considering a rolling update complete.

Also, a rebase was needed to have this merged.

@andrewsykim @aojea please take a look at the PR again when you got the chance.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2022
@dims
Copy link
Member

dims commented Dec 12, 2022

If you still need this PR then please rebase, if not, please close the PR

@ionutbalutoiu
Copy link
Contributor Author

If you still need this PR then please rebase, if not, please close the PR

I already rebased. The branch behind the pull request is already in-sync with latest upstream master branch.

Also, a rebase was needed to have this merged.

This was an FYI that I already did it together with the mentioned changes.

@ionutbalutoiu
Copy link
Contributor Author

/retest-require

@aojea
Copy link
Member

aojea commented Dec 13, 2022

/retest

@aojea
Copy link
Member

aojea commented Dec 13, 2022

/lgtm

Thanks, great work, let's see how it evolves on CI

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 13, 2022
@k8s-ci-robot k8s-ci-robot merged commit 73ed9e7 into kubernetes:master Dec 13, 2022
@k8s-ci-robot k8s-ci-robot modified the milestones: v1.27, v1.26 Dec 13, 2022
@ionutbalutoiu ionutbalutoiu deleted the tests/lb-rolling-update branch December 13, 2022 10:17
@liggitt liggitt modified the milestones: v1.26, v1.27 Dec 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants