Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shorten waits in TestWatchStreamSeparation #123888

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

serathius
Copy link
Contributor

/kind cleanup

Ref #123850

NONE

Before

kubernetes $ go test ./staging/src/k8s.io/apiserver/pkg/storage/cacher --run TestWatchStreamSeparation --count 1
ok      k8s.io/apiserver/pkg/storage/cacher     13.804s

After

go test ./staging/src/k8s.io/apiserver/pkg/storage/cacher --run TestWatchStreamSeparation --count 1
ok      k8s.io/apiserver/pkg/storage/cacher     7.783s

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Mar 12, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: serathius
Once this PR has been reviewed and has the lgtm label, please assign wojtek-t for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Mar 12, 2024
err = cacher.storage.RequestWatchProgress(metadata.NewOutgoingContext(context.Background(), contextMetadata))
if err != nil {
t.Fatal(err)
}
// Give time for bookmark to arrive
time.Sleep(time.Second)
time.Sleep(100 * time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to stress the test to make sure it is not flaky ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run the test couple of times, however I forgot to use stress let me test it again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(stress with a binary compiled with / without -race is most helpful)

Copy link
Contributor Author

@serathius serathius Mar 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done:

kubernetes $ rm cacher.test 
kubernetes $ go test ./staging/src/k8s.io/apiserver/pkg/storage/cacher/ --run TestWatchStreamSeparation -c -race
kubernetes $ stress ./cacher.test --test.run TestWatchStreamSeparation 
5s: 0 runs so far, 0 failures
10s: 12 runs so far, 0 failures
15s: 12 runs so far, 0 failures
20s: 24 runs so far, 0 failures
25s: 29 runs so far, 0 failures
30s: 36 runs so far, 0 failures
35s: 46 runs so far, 0 failures
40s: 48 runs so far, 0 failures
45s: 59 runs so far, 0 failures
50s: 63 runs so far, 0 failures
55s: 72 runs so far, 0 failures
1m0s: 78 runs so far, 0 failures
1m5s: 84 runs so far, 0 failures
1m10s: 94 runs so far, 0 failures
1m15s: 98 runs so far, 0 failures
1m20s: 108 runs so far, 0 failures
1m25s: 116 runs so far, 0 failures
1m30s: 122 runs so far, 0 failures
1m35s: 129 runs so far, 0 failures
1m40s: 135 runs so far, 0 failures
1m45s: 143 runs so far, 0 failures
^C

@@ -2396,7 +2396,7 @@ func TestWatchStreamSeparation(t *testing.T) {
defer cacher.watchCache.RUnlock()
return cacher.watchCache.resourceVersion
}
waitContext, cancel := context.WithTimeout(context.Background(), 2*time.Second)
waitContext, cancel := context.WithTimeout(context.Background(), time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this (ctx) only matters when something goes wrong and the fails, right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, this is the upper-bound we will wait for before failing the test by returning a blank RV. I don't think shortening this will actually make the test pass faster, just increase the possibility of failing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will because there are subtests where we expect to not get bookmark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see. are we confident a second is long enough to not flake on the tests where we do expect a bookmark event?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will because there are subtests where we expect to not get bookmark.

Actually, if we choose our wait time differently for these two types of tests, we can leave the wait as-is (2 seconds) when we do expect to receive a bookmark, and shorten the wait more aggressively (500ms? 250ms?) when we don't expect to receive a bookmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't such asymmetry make the test less accurate? I understand that we can lower the timeout when we don't expect the bookmark as we don't need to wait, however on the other hand it makes it easy to miss if a regression is introduce later. In my opinion making a comparison test like this, we should keep the setup symmetrical to be sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we'd lower it so much it would be easy to miss a regression. We should pick a time large enough that consistently would receive an incorrect bookmark event and fail the test.

@p0lyn0mial
Copy link
Contributor

xref: #123685

err = cacher.storage.RequestWatchProgress(metadata.NewOutgoingContext(context.Background(), contextMetadata))
if err != nil {
t.Fatal(err)
}
// Give time for bookmark to arrive
time.Sleep(time.Second)
time.Sleep(100 * time.Millisecond)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to sleep at all if the next thing we do is call waitForEtcdBookmark() which blocks until the event is received?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the bookmark in watch cache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but... waitForEtcdBookmark() will wait until we receive it, right? why do we need to sleep?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this is related but 100ms wasn't enough for #123926

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, increased to 200ms

@jiahuif
Copy link
Member

jiahuif commented Mar 12, 2024

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 12, 2024
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

historically, millisecond-level timing has proven flaky in resource-constrained CI environments

I really don't think we should try to time these sleeps that tightly.

For positive tests, where we expect to receive an event, can we eliminate the sleep entirely and switch to event-based or poll-based checks (on a short interval, e.g. 10-50ms)?

For negative tests, where we expect not to receive an event, can we pick a sleep period ~10x the normal time it takes to receive the event that we will wait to ensure we didn't get an unexpected event?

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants