tests: Add DaemonSet with LB rolling update test #114052

ionutbalutoiu · 2022-11-21T21:57:54Z

What type of PR is this?

/sig network

What this PR does / why we need it:

Add a test case with a DaemonSet behind a simple load balancer whose address is being constantly hit via HTTP requests.

The test passes if there are no errors when doing HTTP requests to the load balancer address, during DaemonSet RollingUpdate operations.

Which issue(s) this PR fixes:

N/A

Special notes for your reviewer:

Tested against an Azure (AKS) cluster:

./e2e.test --ginkgo.focus="should not have connectivity disruption during rolling update" --kubeconfig $HOME/.kube/config --provider azure --cloud-config-file /workspace/azure.json

...

Ran 1 of 7070 Specs in 137.425 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 7069 Skipped
PASS

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2022-11-21T21:57:59Z

Please note that we're already in Test Freeze for the release-1.26 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.26.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Mon Nov 21 21:28:31 UTC 2022.

k8s-ci-robot · 2022-11-21T21:58:02Z

@ionutbalutoiu: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2022-11-21T21:58:04Z

Hi @ionutbalutoiu. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ionutbalutoiu · 2022-11-21T22:01:08Z

/cc: @andrewsykim

aojea · 2022-11-21T22:34:49Z

we had these tests in openshift and is not possible to guarantee 0 error on tests (neither in reality), you have to operate with one margin of error, most of the flakes or errors comes from the fact that the connection through internet can not be guaranteed , i.e. from the client to the Loadbalancer, Internet doesn't guarantee 0 packet loss

aojea · 2022-11-21T22:37:06Z

/ok-to-test

I'd like to see the assertion as a percentage of successes that we can iterate, let's say a 95% to start with? we can start increasing it later

ionutbalutoiu · 2022-11-21T23:02:28Z

we had these tests in openshift and is not possible to guarantee 0 error on tests (neither in reality), you have to operate with one margin of error, most of the flakes or errors comes from the fact that the connection through internet can not be guaranteed , i.e. from the client to the Loadbalancer, Internet doesn't guarantee 0 packet loss

Good point!

I've taken into consideration this, and that's why I'm using wait.PollImmediate when doing HTTP requests.

Basically, within e2eservice.LoadBalancerLagTimeoutDefault (which is set to 2 minutes), failed requests are retried before they are considered failed. Therefore, we cover the cases where we have transient networking errors (unrelated to K8s code).

So, the test will fail only if the load balancer doesn't properly update the backend pods during rolling updates.

ionutbalutoiu · 2022-11-21T23:04:28Z

/ok-to-test

I'd like to see the assertion as a percentage of successes that we can iterate, let's say a 95% to start with? we can start increasing it later

I like this suggestion as well.

We could drop the retry logic for the HTTP requests, and assert the percentage of failed requests from the total of requests made during the rolling update.

I'm fine with any of the implementations for the test.
Which one do you consider more robust ?

aojea · 2022-11-22T01:03:27Z

I don't have a strong preference, just we have to have granularity here or failing tests will be undebuggable, i.e. if you do a rolling update, you have to keep stats per each one, ...

ionutbalutoiu · 2022-11-22T12:33:40Z

I don't have a strong preference, just we have to have granularity here or failing tests will be undebuggable, i.e. if you do a rolling update, you have to keep stats per each one, ...

I ended up implementing the logic to assert a minimum success rate from the total of HTTP requests (and don't use retries when doing each HTTP request individually).

There's a minimum threshold of 95% success rate of the total load balancer HTTP requests. If this is not achieved, the test will fail.

In this way, we achieve granularity here, as we record exactly how many requests failed or succeeded.

When you get the chance, please see the updated PR.

test/e2e/network/loadbalancer.go

wojtek-t · 2022-11-24T10:04:20Z

/hold cancel

Thanks!

andrewsykim · 2022-12-01T02:56:55Z

test/e2e/network/loadbalancer.go

+	ns := f.Namespace.Name
+	name := "test-lb-rolling-update"
+	labels := map[string]string{"name": name}
+	gracePeriod := int64(10)


I recommend using a much higher value here, like 30. Even higher like 60 might be better, although that might make the test significantly slower.

Reason is that this value has to be at least ("LB health check interval" x "Unhealthy threshold") + some buffer space. The value ends up depending on the LB configuration for the healthcheck, so something like 30s or 60s would likely work across most default health check intervals. Extreme cases would be something like 10s interval with 3-5 unhealthy threshold

If we increase the grace period to something like 60s, we might want to tweak the RollingUpdate strategy to use a higher maxUnavailable and maxSurge value, like 10%. This way the test time is still reasonable when testing against really large clusters

It makes sense. I updated the PR to use:

gracePeriod set to 60 seconds

RollingUpdateDaemonSet.MaxUnavailable set to 10%

cc @wojtek-t re: large scale scalability test -- I think 60s termination grace period is fine with maxUnavailable: 10% but I could be missing something

Didn't look into PR, so commenting only based on the comment thread.

Sorry - clicked send too quickly...

60s grace period with 10% unavailable, means rolling update takes 10m+ - it's somewhat large, but assuming we don't have multiple such upgrades, it seems acceptable

andrewsykim · 2022-12-05T15:06:16Z

Dec  5 13:50:26.824: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 3
Dec  5 13:50:26.824: INFO: Node bootstrap-e2e-minion-group-cw8h is running 0 daemon pod, expected 1
Dec  5 13:50:28.824: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 4
Dec  5 13:50:28.824: INFO: Number of running nodes: 4, number of available pods: 4 in daemonset test-lb-rolling-update
Dec  5 13:50:28.824: INFO: Load Balancer total HTTP requests: 3148
Dec  5 13:50:28.824: INFO: Network errors: 0
Dec  5 13:50:28.824: INFO: HTTP errors: 0
Dec  5 13:50:28.824: INFO: Success rate: 100.00%
Dec  5 13:50:28.824: INFO: Update daemon pods environment: [{"name":"VERSION","value":"4"}]
Dec  5 13:50:28.872: INFO: Check that daemon pods are still running on every node of the cluster.

Awesome :)

andrewsykim

/approve

@aojea can you do another pass?

andrewsykim · 2022-12-05T15:07:23Z

test/e2e/network/loadbalancer.go

+		// We start with a low but reasonable threshold to analyze the results.
+		// The goal is to achieve 99% minimum success rate.
+		// TODO: We should do incremental steps toward the goal.
+		minSuccessRate := 0.95


Based on success rate in here, I'm still inclined to make this value higher, but I think we can increase this in a follow-up after we have some some CI run data

k8s-ci-robot · 2022-12-05T15:12:41Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andrewsykim, ionutbalutoiu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/e2e/network/OWNERS~~ [andrewsykim]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Add a test case with a DaemonSet behind a simple load balancer whose address is being constantly hit via HTTP requests. The test passes if there are no errors when doing HTTP requests to the load balancer address, during DaemonSet `RollingUpdate` operations. Signed-off-by: Ionut Balutoiu <ibalutoiu@cloudbasesolutions.com>

ionutbalutoiu · 2022-12-12T15:18:42Z

Hello everyone,

I've noticed this into some of my testing:

Dec 12 02:30:29.502: INFO: Update daemon pods environment: [{"name":"VERSION","value":"1"}]
Dec 12 02:30:29.508: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:30:29.512: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:30:29.512: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:30:29.512: INFO: Load Balancer total HTTP requests: 7
Dec 12 02:30:29.512: INFO: Network errors: 0
Dec 12 02:30:29.512: INFO: HTTP errors: 0
Dec 12 02:30:29.512: INFO: Success rate: 100.00%
Dec 12 02:30:29.512: INFO: Update daemon pods environment: [{"name":"VERSION","value":"2"}]
Dec 12 02:30:29.517: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:30:29.521: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:30:29.521: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:30:29.521: INFO: Load Balancer total HTTP requests: 9
Dec 12 02:30:29.521: INFO: Network errors: 0
Dec 12 02:30:29.521: INFO: HTTP errors: 0
Dec 12 02:30:29.521: INFO: Success rate: 100.00%
Dec 12 02:30:29.521: INFO: Update daemon pods environment: [{"name":"VERSION","value":"3"}]
Dec 12 02:30:29.526: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:30:29.529: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:29.529: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:31.535: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:31.535: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:33.534: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:33.534: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:35.534: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:35.534: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:37.534: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:37.534: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
Dec 12 02:30:39.535: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 1
Dec 12 02:30:39.535: INFO: Node akswinagt000001 is running 0 daemon pod, expected 1
....
....
....
Dec 12 02:32:53.536: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:32:53.536: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:32:53.536: INFO: Load Balancer total HTTP requests: 74741
Dec 12 02:32:53.536: INFO: Network errors: 0
Dec 12 02:32:53.536: INFO: HTTP errors: 0
Dec 12 02:32:53.536: INFO: Success rate: 100.00%
Dec 12 02:32:53.536: INFO: Update daemon pods environment: [{"name":"VERSION","value":"4"}]
Dec 12 02:32:53.542: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:32:53.545: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:32:53.545: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:32:53.545: INFO: Load Balancer total HTTP requests: 0
Dec 12 02:32:53.545: INFO: Network errors: 0
Dec 12 02:32:53.545: INFO: HTTP errors: 0
Dec 12 02:32:53.545: INFO: Success rate: NaN%
Dec 12 02:32:53.545: INFO: Update daemon pods environment: [{"name":"VERSION","value":"5"}]
Dec 12 02:32:53.551: INFO: Check that daemon pods are still running on every node of the cluster.
Dec 12 02:32:53.554: INFO: Number of nodes with available pods controlled by daemonset test-lb-rolling-update: 2
Dec 12 02:32:53.554: INFO: Number of running nodes: 2, number of available pods: 2 in daemonset test-lb-rolling-update
Dec 12 02:32:53.554: INFO: Load Balancer total HTTP requests: 0
Dec 12 02:32:53.554: INFO: Network errors: 0
Dec 12 02:32:53.554: INFO: HTTP errors: 0
Dec 12 02:32:53.554: INFO: Success rate: NaN%
Dec 12 02:32:53.554: INFO: Poking "http://40.89.240.212:80/echo?msg=hello"
Dec 12 02:32:53.555: INFO: Poke("http://40.89.240.212:80/echo?msg=hello"): success

It seems that the polling logic after the pods environment was updated was bugged sometimes.

I've updated the polling logic to also validate the updated pod containers environment, before considering a rolling update complete.

Also, a rebase was needed to have this merged.

@andrewsykim @aojea please take a look at the PR again when you got the chance.

dims · 2022-12-12T15:36:57Z

If you still need this PR then please rebase, if not, please close the PR

ionutbalutoiu · 2022-12-12T15:40:11Z

If you still need this PR then please rebase, if not, please close the PR

I already rebased. The branch behind the pull request is already in-sync with latest upstream master branch.

Also, a rebase was needed to have this merged.

This was an FYI that I already did it together with the mentioned changes.

ionutbalutoiu · 2022-12-12T22:57:47Z

/retest-require

aojea · 2022-12-13T00:48:12Z

/retest

aojea · 2022-12-13T00:50:23Z

/lgtm

Thanks, great work, let's see how it evolves on CI

k8s-ci-robot requested review from danwinship and MrHohn November 21, 2022 21:57

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Nov 21, 2022

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 21, 2022

ionutbalutoiu mentioned this pull request Nov 21, 2022

tests: Add deployment with LB scale test #113920

Closed

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 21, 2022

ionutbalutoiu force-pushed the tests/lb-rolling-update branch from 886e699 to f26e460 Compare November 22, 2022 12:27

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 22, 2022

ionutbalutoiu force-pushed the tests/lb-rolling-update branch from f26e460 to 8ae45e7 Compare November 22, 2022 12:39

aojea reviewed Nov 22, 2022

View reviewed changes

test/e2e/network/loadbalancer.go Outdated Show resolved Hide resolved

ionutbalutoiu force-pushed the tests/lb-rolling-update branch from 195f5cf to 2f332c1 Compare November 24, 2022 08:37

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 24, 2022

andrewsykim reviewed Dec 1, 2022

View reviewed changes

ionutbalutoiu force-pushed the tests/lb-rolling-update branch from 2f332c1 to 48fc80e Compare December 5, 2022 11:46

andrewsykim reviewed Dec 5, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 5, 2022

ionutbalutoiu force-pushed the tests/lb-rolling-update branch from 48fc80e to f9cec3c Compare December 12, 2022 15:04

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2022

ionutbalutoiu force-pushed the tests/lb-rolling-update branch from f9cec3c to 3feea9d Compare December 12, 2022 15:18

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 12, 2022

k8s-ci-robot assigned aojea Dec 13, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 13, 2022

k8s-ci-robot merged commit 73ed9e7 into kubernetes:master Dec 13, 2022

k8s-ci-robot modified the milestones: v1.27, v1.26 Dec 13, 2022

ionutbalutoiu deleted the tests/lb-rolling-update branch December 13, 2022 10:17

liggitt modified the milestones: v1.26, v1.27 Dec 13, 2022

andrewsykim mentioned this pull request Feb 2, 2023

kep-1669: add GA graduation criterias for v1.27 kubernetes/enhancements#3825

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests: Add DaemonSet with LB rolling update test #114052

tests: Add DaemonSet with LB rolling update test #114052

ionutbalutoiu commented Nov 21, 2022 •

edited

k8s-ci-robot commented Nov 21, 2022

k8s-ci-robot commented Nov 21, 2022

k8s-ci-robot commented Nov 21, 2022

ionutbalutoiu commented Nov 21, 2022

aojea commented Nov 21, 2022 •

edited

aojea commented Nov 21, 2022 •

edited

ionutbalutoiu commented Nov 21, 2022

ionutbalutoiu commented Nov 21, 2022

aojea commented Nov 22, 2022

ionutbalutoiu commented Nov 22, 2022

wojtek-t commented Nov 24, 2022

andrewsykim Dec 1, 2022 •

edited

andrewsykim Dec 1, 2022

ionutbalutoiu Dec 5, 2022

andrewsykim Dec 5, 2022

wojtek-t Dec 7, 2022

wojtek-t Dec 7, 2022

andrewsykim commented Dec 5, 2022

andrewsykim left a comment

andrewsykim Dec 5, 2022

k8s-ci-robot commented Dec 5, 2022

ionutbalutoiu commented Dec 12, 2022

dims commented Dec 12, 2022

ionutbalutoiu commented Dec 12, 2022

ionutbalutoiu commented Dec 12, 2022

aojea commented Dec 13, 2022

aojea commented Dec 13, 2022

tests: Add DaemonSet with LB rolling update test #114052

tests: Add DaemonSet with LB rolling update test #114052

Conversation

ionutbalutoiu commented Nov 21, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Nov 21, 2022

k8s-ci-robot commented Nov 21, 2022

k8s-ci-robot commented Nov 21, 2022

ionutbalutoiu commented Nov 21, 2022

aojea commented Nov 21, 2022 • edited

aojea commented Nov 21, 2022 • edited

ionutbalutoiu commented Nov 21, 2022

ionutbalutoiu commented Nov 21, 2022

aojea commented Nov 22, 2022

ionutbalutoiu commented Nov 22, 2022

wojtek-t commented Nov 24, 2022

andrewsykim Dec 1, 2022 • edited

Choose a reason for hiding this comment

andrewsykim Dec 1, 2022

Choose a reason for hiding this comment

ionutbalutoiu Dec 5, 2022

Choose a reason for hiding this comment

andrewsykim Dec 5, 2022

Choose a reason for hiding this comment

wojtek-t Dec 7, 2022

Choose a reason for hiding this comment

wojtek-t Dec 7, 2022

Choose a reason for hiding this comment

andrewsykim commented Dec 5, 2022

andrewsykim left a comment

Choose a reason for hiding this comment

andrewsykim Dec 5, 2022

Choose a reason for hiding this comment

k8s-ci-robot commented Dec 5, 2022

ionutbalutoiu commented Dec 12, 2022

dims commented Dec 12, 2022

ionutbalutoiu commented Dec 12, 2022

ionutbalutoiu commented Dec 12, 2022

aojea commented Dec 13, 2022

aojea commented Dec 13, 2022

ionutbalutoiu commented Nov 21, 2022 •

edited

aojea commented Nov 21, 2022 •

edited

aojea commented Nov 21, 2022 •

edited

andrewsykim Dec 1, 2022 •

edited