-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed: 113377 Unit test failures that may be caused by performance ask attached #113378
Conversation
Hi @aimuz. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@aimuz: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind failing-test |
/ok-to-test |
Would you try this and share 1000 run results? |
What is this parameter? I don't see this cj on my side
|
It is a go tool. You can install like
And it will be in your bin file like |
46c8d71
to
3ce10f8
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aimuz The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Retuned to avoid deadlocks @pacoxu
|
/lgtm |
|
||
// it can make the test more stable under a large number of stress tests | ||
// the absence of this can lead to occasional lock gap problems under extensive stress testing | ||
time.Sleep(time.Millisecond * 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a smell that the async aspects of the test aren't fully correct. Can we write the test in a way such that this isn't needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the test is flaking badly, I'd rather skip it for now while this is worked on than add a 10 ms sleep
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/golang/sync/blob/master/singleflight/singleflight_test.go#L62
A similar test exists in the golang code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current test situation is still a relatively low problem, generally speaking, the test has problems, retesting can solve the problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
retesting is a problem, we want to keep a high bar and avoid flaky tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm verifying a new method and am testing it with stress, if there are no errors in the long case, I will commit
I removed sleep, I verified locally for 10 minutes and no errors were found @pacoxu @liggitt @aojea
|
lruItemObj, err, _ = l.group.Do(a.GetNamespace(), func() (interface{}, error) { | ||
// Fixed: #22422 | ||
// use singleflight to alleviate simultaneous calls to | ||
lruItemObj, err, _ := l.group.Do(a.GetNamespace(), func() (interface{}, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to change the implementation to make cache accesses go through a singleflight to deflake the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do call after reading lru, under normal circumstances, this process is very fast, as long as no lru is read, it will directly to singleflight, that will be singleflight, and when the test is more concentrated, what is more frequent use of resources, it will Do, he has a certain time delay, in this time delay process, the last singleflight.
Modifying the implementation is a more feasible way to get the test to pass steadily without increasing the sleep case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really follow, can you explain the problem in a terms of calls and timestamps
t0 - call A1 arrives and checks cache
t1 - Do(A1) start
t2 - call A2 arrives and checks cache
t3 - A2 joins A1
t4 - DO(A1) finish and answer boths A1 and A2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t0 - call A1 arrives and checks cache
t1 - Do(A1) start
t2 - call A2 arrives and checks cache
t3 - A2 joins A1
t4 - call A3 arrives and checks cache
t5 - Do(A1) finish and answer boths A1 and A2
t6 - Do(A3) start // DO(A1) has already returned and cannot join, so it has to open a new call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is that the test deadlocks on
/usr/local/google/home/aojea/src/kubernetes/plugin/pkg/admission/limitranger/admission_test.go:934 +0x1854
kubernetes/plugin/pkg/admission/limitranger/admission_test.go
Lines 929 to 936 in 5bc094d
// unhold all the calls with the same namespace handler.GetLimitRanges(attributes) calls, that have to be aggregated | |
unhold <- struct{}{} | |
go func() { | |
unhold <- struct{}{} | |
}() | |
// and here we wait for all the goroutines | |
wg.Wait() |
you can see the stack running with stress ./limitranger.test -test.run ^TestLimitRanger_GetLimitRangesFixed22422$ -test.timeout 32s
, it panics in less than a minute
The problem is that before unhold, we need to be sure all the goroutines are "captured" in the syncflight group, just adding a small sleep the test doesn´t flake
diff --git a/plugin/pkg/admission/limitranger/admission_test.go b/plugin/pkg/admission/limitranger/admission_test.go
index 01a83816037..e1b5e002fac 100644
--- a/plugin/pkg/admission/limitranger/admission_test.go
+++ b/plugin/pkg/admission/limitranger/admission_test.go
@@ -926,11 +926,10 @@ func TestLimitRanger_GetLimitRangesFixed22422(t *testing.T) {
}
}()
}
+ time.Sleep(1 * time.Second)
// unhold all the calls with the same namespace handler.GetLimitRanges(attributes) calls, that have to be aggregated
unhold <- struct{}{}
- go func() {
- unhold <- struct{}{}
- }()
+ unhold <- struct{}{}
// and here we wait for all the goroutines
wg.Wait()
stress ./limitranger.test -test.run ^TestLimitRanger_GetLimitRangesFixed22422$ -test.timeout 32s
5s: 192 runs so far, 0 failures
10s: 386 runs so far, 0 failures
15s: 624 runs so far, 0 failures
20s: 828 runs so far, 0 failures
25s: 1056 runs so far, 0 failures
30s: 1274 runs so far, 0 failures
35s: 1495 runs so far, 0 failures
40s: 1713 runs so far, 0 failures
45s: 1935 runs so far, 0 failures
50s: 2162 runs so far, 0 failures
55s: 2381 runs so far, 0 failures
1m0s: 2603 runs so far, 0 failures
1m5s: 2819 runs so far, 0 failures
1m10s: 3045 runs so far, 0 failures
1m15s: 3267 runs so far, 0 failures
1m20s: 3486 runs so far, 0 failures
1m25s: 3711 runs so far, 0 failures
1m30s: 3927 runs so far, 0 failures
1m35s: 4153 runs so far, 0 failures
1m40s: 4374 runs so far, 0 failures
1m45s: 4594 runs so far, 0 failures
1m50s: 4819 runs so far, 0 failures
1m55s: 5034 runs so far, 0 failures
2m0s: 5262 runs so far, 0 failures
2m5s: 5482 runs so far, 0 failures
2m10s: 5700 runs so far, 0 failures
2m15s: 5927 runs so far, 0 failures
2m20s: 6143 runs so far, 0 failures
2m25s: 6369 runs so far, 0 failures
2m30s: 6590 runs so far, 0 failures
2m35s: 6810 runs so far, 0 failures
2m40s: 7034 runs so far, 0 failures
2m45s: 7253 runs so far, 0 failures
2m50s: 7476 runs so far, 0 failures
2m55s: 7698 runs so far, 0 failures
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t0 - call A1 arrives and checks cache t1 - Do(A1) start t2 - call A2 arrives and checks cache t3 - A2 joins A1 t4 - call A3 arrives and checks cache t5 - Do(A1) finish and answer boths A1 and A2 t6 - Do(A3) start // DO(A1) has already returned and cannot join, so it has to open a new call
Do(A3)
After execution, it blocks because there is no value in chan
we should fix the test, that is the one with the race, and keep the implementation as it is #113378 (comment) |
Yes, it can be solved with sleep, so should I roll back to the previous fix? Can I create another PR to change the implementation? It's like liggitt said
|
…k attached Signed-off-by: aimuz <mr.imuz@gmail.com>
/retest |
|
let's try to address this after code freeze and see if we can find an easier way to test it ... |
deflaking the test by sleeping long enough for goroutines to execute is not a reliable solution. The failure mode of the sleep not being long enough is still a deadlock. I opened #113736 as an alternative, which makes the handler slow enough for concurrent requests to accumulate in the singleflight. The failure mode of the sleep not being long enough there is that some requests don't hit the singleflight and hit the cache, and the test still passes. |
let's go with the other approach, Thank you very much @aimuz for working so hard no this, appreciate it, keep it up /close |
@aojea: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: aimuz mr.imuz@gmail.com
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #113377
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: