Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed: 113377 Unit test failures that may be caused by performance ask attached #113378

Closed
wants to merge 1 commit into from

Conversation

aimuz
Copy link
Contributor

@aimuz aimuz commented Oct 27, 2022

Signed-off-by: aimuz mr.imuz@gmail.com

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #113377

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 27, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @aimuz. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot
Copy link
Contributor

@aimuz: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Oct 27, 2022
@aimuz
Copy link
Contributor Author

aimuz commented Oct 27, 2022

/kind failing-test
/sig scalability

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 27, 2022
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 27, 2022
@pacoxu
Copy link
Member

pacoxu commented Oct 27, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 27, 2022
@pacoxu
Copy link
Member

pacoxu commented Oct 27, 2022

go test -c -race ./plugin/pkg/admission/limitranger  -run ^TestLimitRanger_GetLimitRangesFixed22422$
stress -p 4 ./limitranger.test

Would you try this and share 1000 run results?

@aimuz
Copy link
Contributor Author

aimuz commented Oct 27, 2022

stress -p 4

What is this parameter? I don't see this cj on my side

 stress --help
`stress' imposes certain types of compute stress on your system

Usage: stress [OPTION [ARG]] ...
 -?, --help         show this help statement
     --version      show version statement
 -v, --verbose      be verbose
 -q, --quiet        be quiet
 -n, --dry-run      show what would have been done
 -t, --timeout N    timeout after N seconds
     --backoff N    wait factor of N microseconds before work starts
 -c, --cpu N        spawn N workers spinning on sqrt()
 -i, --io N         spawn N workers spinning on sync()
 -m, --vm N         spawn N workers spinning on malloc()/free()
     --vm-bytes B   malloc B bytes per vm worker (default is 256MB)
     --vm-stride B  touch a byte every B bytes (default is 4096)
     --vm-hang N    sleep N secs before free (default none, 0 is inf)
     --vm-keep      redirty memory instead of freeing and reallocating
 -d, --hdd N        spawn N workers spinning on write()/unlink()
     --hdd-bytes B  write B bytes per hdd worker (default is 1GB)

Example: stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M --timeout 10s

Note: Numbers may be suffixed with s,m,h,d,y (time) or B,K,M,G (size).

@pacoxu
Copy link
Member

pacoxu commented Oct 27, 2022

It is a go tool. You can install like

go install golang.org/x/tools/cmd/stress@latest

And it will be in your bin file like /Users/pacoxu/go/bin/stress.

@aimuz aimuz force-pushed the fix-113377 branch 2 times, most recently from 46c8d71 to 3ce10f8 Compare October 27, 2022 08:19
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 27, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aimuz
Once this PR has been reviewed and has the lgtm label, please ask for approval from liggitt by writing /assign @liggitt in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@aimuz
Copy link
Contributor Author

aimuz commented Oct 27, 2022

Retuned to avoid deadlocks @pacoxu


stress -o log -p 4 -timeout 1m  ./limitranger.test
5s: 84 runs so far, 0 failures
10s: 200 runs so far, 0 failures
15s: 325 runs so far, 0 failures
20s: 449 runs so far, 0 failures
25s: 572 runs so far, 0 failures
30s: 701 runs so far, 0 failures
35s: 831 runs so far, 0 failures
40s: 964 runs so far, 0 failures
45s: 1099 runs so far, 0 failures
50s: 1228 runs so far, 0 failures
55s: 1359 runs so far, 0 failures
1m0s: 1494 runs so far, 0 failures
1m5s: 1628 runs so far, 0 failures
1m10s: 1755 runs so far, 0 failures
1m15s: 1886 runs so far, 0 failures
1m20s: 2018 runs so far, 0 failures
1m25s: 2154 runs so far, 0 failures
1m30s: 2288 runs so far, 0 failures
1m35s: 2419 runs so far, 0 failures
1m40s: 2553 runs so far, 0 failures
1m45s: 2689 runs so far, 0 failures
1m50s: 2820 runs so far, 0 failures
1m55s: 2951 runs so far, 0 failures
2m0s: 3077 runs so far, 0 failures
2m5s: 3207 runs so far, 0 failures
2m10s: 3338 runs so far, 0 failures
2m15s: 3470 runs so far, 0 failures
2m20s: 3602 runs so far, 0 failures
2m25s: 3730 runs so far, 0 failures
2m30s: 3864 runs so far, 0 failures
2m35s: 3996 runs so far, 0 failures
2m40s: 4126 runs so far, 0 failures
2m45s: 4258 runs so far, 0 failures
2m50s: 4384 runs so far, 0 failures
2m55s: 4516 runs so far, 0 failures
3m0s: 4647 runs so far, 0 failures
3m5s: 4776 runs so far, 0 failures
3m10s: 4911 runs so far, 0 failures
3m15s: 5038 runs so far, 0 failures
3m20s: 5171 runs so far, 0 failures
3m25s: 5304 runs so far, 0 failures
3m30s: 5438 runs so far, 0 failures
3m35s: 5576 runs so far, 0 failures
3m40s: 5712 runs so far, 0 failures
3m45s: 5848 runs so far, 0 failures
3m50s: 5984 runs so far, 0 failures
3m55s: 6116 runs so far, 0 failures
4m0s: 6252 runs so far, 0 failures
4m5s: 6384 runs so far, 0 failures
4m10s: 6516 runs so far, 0 failures
4m15s: 6650 runs so far, 0 failures
4m20s: 6780 runs so far, 0 failures
4m25s: 6912 runs so far, 0 failures
4m30s: 7046 runs so far, 0 failures
4m35s: 7174 runs so far, 0 failures
4m40s: 7308 runs so far, 0 failures
4m45s: 7436 runs so far, 0 failures
4m50s: 7568 runs so far, 0 failures
4m55s: 7694 runs so far, 0 failures
5m0s: 7826 runs so far, 0 failures
5m5s: 7945 runs so far, 0 failures
5m10s: 8068 runs so far, 0 failures
5m15s: 8196 runs so far, 0 failures
5m20s: 8318 runs so far, 0 failures
5m25s: 8415 runs so far, 0 failures
5m30s: 8530 runs so far, 0 failures
5m35s: 8643 runs so far, 0 failures
5m40s: 8752 runs so far, 0 failures
5m45s: 8860 runs so far, 0 failures
5m50s: 8956 runs so far, 0 failures
5m55s: 9035 runs so far, 0 failures
6m0s: 9132 runs so far, 0 failures
6m5s: 9240 runs so far, 0 failures
6m10s: 9335 runs so far, 0 failures
6m15s: 9435 runs so far, 0 failures
6m20s: 9537 runs so far, 0 failures
6m25s: 9636 runs so far, 0 failures
6m30s: 9736 runs so far, 0 failures
6m35s: 9836 runs so far, 0 failures
6m40s: 9930 runs so far, 0 failures
6m45s: 9987 runs so far, 0 failures
6m50s: 10088 runs so far, 0 failures
6m55s: 10204 runs so far, 0 failures
7m0s: 10326 runs so far, 0 failures
7m5s: 10456 runs so far, 0 failures
7m10s: 10596 runs so far, 0 failures
7m15s: 10740 runs so far, 0 failures
7m20s: 10892 runs so far, 0 failures
7m25s: 11050 runs so far, 0 failures
7m30s: 11206 runs so far, 0 failures
7m35s: 11365 runs so far, 0 failures
7m40s: 11525 runs so far, 0 failures
7m45s: 11681 runs so far, 0 failures
7m50s: 11833 runs so far, 0 failures
7m55s: 11992 runs so far, 0 failures
8m0s: 12147 runs so far, 0 failures
8m5s: 12301 runs so far, 0 failures
8m10s: 12469 runs so far, 0 failures
8m15s: 12629 runs so far, 0 failures
8m20s: 12773 runs so far, 0 failures
8m25s: 12927 runs so far, 0 failures
8m30s: 13085 runs so far, 0 failures
8m35s: 13241 runs so far, 0 failures
8m40s: 13393 runs so far, 0 failures
8m45s: 13549 runs so far, 0 failures
8m50s: 13703 runs so far, 0 failures
8m55s: 13858 runs so far, 0 failures
9m0s: 14013 runs so far, 0 failures
9m5s: 14165 runs so far, 0 failures
9m10s: 14323 runs so far, 0 failures
9m15s: 14477 runs so far, 0 failures
9m20s: 14629 runs so far, 0 failures
9m25s: 14781 runs so far, 0 failures
9m30s: 14936 runs so far, 0 failures
9m35s: 15085 runs so far, 0 failures
9m40s: 15224 runs so far, 0 failures
9m45s: 15371 runs so far, 0 failures
9m50s: 15513 runs so far, 0 failures
9m55s: 15668 runs so far, 0 failures
10m0s: 15821 runs so far, 0 failures
10m5s: 15973 runs so far, 0 failures
10m10s: 16125 runs so far, 0 failures
10m15s: 16279 runs so far, 0 failures
10m20s: 16433 runs so far, 0 failures
10m25s: 16585 runs so far, 0 failures
10m30s: 16733 runs so far, 0 failures
10m35s: 16885 runs so far, 0 failures
10m40s: 17037 runs so far, 0 failures
10m45s: 17193 runs so far, 0 failures
10m50s: 17337 runs so far, 0 failures
10m55s: 17484 runs so far, 0 failures
  

@pacoxu
Copy link
Member

pacoxu commented Oct 27, 2022

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 27, 2022

// it can make the test more stable under a large number of stress tests
// the absence of this can lead to occasional lock gap problems under extensive stress testing
time.Sleep(time.Millisecond * 10)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a smell that the async aspects of the test aren't fully correct. Can we write the test in a way such that this isn't needed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the test is flaking badly, I'd rather skip it for now while this is worked on than add a 10 ms sleep

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current test situation is still a relatively low problem, generally speaking, the test has problems, retesting can solve the problem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

retesting is a problem, we want to keep a high bar and avoid flaky tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm verifying a new method and am testing it with stress, if there are no errors in the long case, I will commit

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 28, 2022
@aimuz
Copy link
Contributor Author

aimuz commented Oct 28, 2022

I removed sleep, I verified locally for 10 minutes and no errors were found @pacoxu @liggitt @aojea


5s: 76 runs so far, 0 failures
10s: 156 runs so far, 0 failures
15s: 236 runs so far, 0 failures
20s: 324 runs so far, 0 failures
25s: 400 runs so far, 0 failures
30s: 492 runs so far, 0 failures
35s: 569 runs so far, 0 failures
40s: 639 runs so far, 0 failures
45s: 717 runs so far, 0 failures
50s: 795 runs so far, 0 failures
55s: 873 runs so far, 0 failures
1m0s: 949 runs so far, 0 failures
1m5s: 1019 runs so far, 0 failures
1m10s: 1076 runs so far, 0 failures
1m15s: 1142 runs so far, 0 failures
1m20s: 1194 runs so far, 0 failures
1m25s: 1250 runs so far, 0 failures
1m30s: 1302 runs so far, 0 failures
1m35s: 1353 runs so far, 0 failures
1m40s: 1405 runs so far, 0 failures
1m45s: 1453 runs so far, 0 failures
1m50s: 1498 runs so far, 0 failures
1m55s: 1555 runs so far, 0 failures
2m0s: 1610 runs so far, 0 failures
2m5s: 1672 runs so far, 0 failures
2m10s: 1718 runs so far, 0 failures
2m15s: 1770 runs so far, 0 failures
2m20s: 1818 runs so far, 0 failures
2m25s: 1870 runs so far, 0 failures
2m30s: 1928 runs so far, 0 failures
2m35s: 1975 runs so far, 0 failures
2m40s: 2036 runs so far, 0 failures
2m45s: 2107 runs so far, 0 failures
2m50s: 2179 runs so far, 0 failures
2m55s: 2251 runs so far, 0 failures
3m0s: 2323 runs so far, 0 failures
3m5s: 2395 runs so far, 0 failures
3m10s: 2473 runs so far, 0 failures
3m15s: 2546 runs so far, 0 failures
3m20s: 2617 runs so far, 0 failures
3m25s: 2691 runs so far, 0 failures
3m30s: 2762 runs so far, 0 failures
3m35s: 2835 runs so far, 0 failures
3m40s: 2908 runs so far, 0 failures
3m45s: 2976 runs so far, 0 failures
3m50s: 3048 runs so far, 0 failures
3m55s: 3114 runs so far, 0 failures
4m0s: 3198 runs so far, 0 failures
4m5s: 3273 runs so far, 0 failures
4m10s: 3360 runs so far, 0 failures
4m15s: 3440 runs so far, 0 failures
4m20s: 3512 runs so far, 0 failures
4m25s: 3594 runs so far, 0 failures
4m30s: 3668 runs so far, 0 failures
4m35s: 3754 runs so far, 0 failures
4m40s: 3849 runs so far, 0 failures
4m45s: 3949 runs so far, 0 failures
4m50s: 4066 runs so far, 0 failures
4m55s: 4191 runs so far, 0 failures
5m0s: 4324 runs so far, 0 failures
5m5s: 4460 runs so far, 0 failures
5m10s: 4614 runs so far, 0 failures
5m15s: 4760 runs so far, 0 failures
5m20s: 4907 runs so far, 0 failures
5m25s: 5056 runs so far, 0 failures
5m30s: 5209 runs so far, 0 failures
5m35s: 5370 runs so far, 0 failures
5m40s: 5516 runs so far, 0 failures
5m45s: 5679 runs so far, 0 failures
5m50s: 5841 runs so far, 0 failures
5m55s: 5986 runs so far, 0 failures
6m0s: 6136 runs so far, 0 failures
6m5s: 6278 runs so far, 0 failures
6m10s: 6412 runs so far, 0 failures
6m15s: 6531 runs so far, 0 failures
6m20s: 6652 runs so far, 0 failures
6m25s: 6757 runs so far, 0 failures
6m30s: 6854 runs so far, 0 failures
6m35s: 6954 runs so far, 0 failures
6m40s: 7044 runs so far, 0 failures
6m45s: 7121 runs so far, 0 failures
6m50s: 7204 runs so far, 0 failures
6m55s: 7295 runs so far, 0 failures
7m0s: 7371 runs so far, 0 failures
7m5s: 7449 runs so far, 0 failures
7m10s: 7535 runs so far, 0 failures
7m15s: 7607 runs so far, 0 failures
7m20s: 7687 runs so far, 0 failures
7m25s: 7743 runs so far, 0 failures
7m30s: 7791 runs so far, 0 failures
7m35s: 7855 runs so far, 0 failures
7m40s: 7918 runs so far, 0 failures
7m45s: 7979 runs so far, 0 failures
7m50s: 8034 runs so far, 0 failures
7m55s: 8073 runs so far, 0 failures
8m0s: 8116 runs so far, 0 failures
8m5s: 8153 runs so far, 0 failures
8m10s: 8209 runs so far, 0 failures
8m15s: 8257 runs so far, 0 failures
8m20s: 8305 runs so far, 0 failures
8m25s: 8361 runs so far, 0 failures
8m30s: 8429 runs so far, 0 failures
8m35s: 8497 runs so far, 0 failures
8m40s: 8560 runs so far, 0 failures
8m45s: 8629 runs so far, 0 failures
8m50s: 8696 runs so far, 0 failures
8m55s: 8748 runs so far, 0 failures
9m0s: 8810 runs so far, 0 failures
9m5s: 8877 runs so far, 0 failures
9m10s: 8943 runs so far, 0 failures
9m15s: 9010 runs so far, 0 failures
9m20s: 9075 runs so far, 0 failures
9m25s: 9126 runs so far, 0 failures
9m30s: 9154 runs so far, 0 failures
9m35s: 9184 runs so far, 0 failures
9m40s: 9219 runs so far, 0 failures
9m45s: 9281 runs so far, 0 failures
9m50s: 9345 runs so far, 0 failures
9m55s: 9419 runs so far, 0 failures
10m0s: 9499 runs so far, 0 failures
10m5s: 9575 runs so far, 0 failures
10m10s: 9664 runs so far, 0 failures
10m15s: 9758 runs so far, 0 failures
10m20s: 9850 runs so far, 0 failures
10m25s: 9946 runs so far, 0 failures
10m30s: 10039 runs so far, 0 failures
10m35s: 10133 runs so far, 0 failures
10m40s: 10223 runs so far, 0 failures
10m45s: 10319 runs so far, 0 failures
10m50s: 10415 runs so far, 0 failures

lruItemObj, err, _ = l.group.Do(a.GetNamespace(), func() (interface{}, error) {
// Fixed: #22422
// use singleflight to alleviate simultaneous calls to
lruItemObj, err, _ := l.group.Do(a.GetNamespace(), func() (interface{}, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to change the implementation to make cache accesses go through a singleflight to deflake the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do call after reading lru, under normal circumstances, this process is very fast, as long as no lru is read, it will directly to singleflight, that will be singleflight, and when the test is more concentrated, what is more frequent use of resources, it will Do, he has a certain time delay, in this time delay process, the last singleflight.

Modifying the implementation is a more feasible way to get the test to pass steadily without increasing the sleep case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really follow, can you explain the problem in a terms of calls and timestamps

t0 - call A1 arrives and checks cache
t1 - Do(A1) start
t2 - call A2 arrives and checks cache
t3 - A2 joins A1
t4 - DO(A1) finish and answer boths A1 and A2

Copy link
Contributor Author

@aimuz aimuz Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t0 - call A1 arrives and checks cache
t1 - Do(A1) start
t2 - call A2 arrives and checks cache
t3 - A2 joins A1
t4 - call A3 arrives and checks cache
t5 - Do(A1) finish and answer boths A1 and A2
t6 - Do(A3) start // DO(A1) has already returned and cannot join, so it has to open a new call

Copy link
Member

@aojea aojea Nov 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that the test deadlocks on

   /usr/local/google/home/aojea/src/kubernetes/plugin/pkg/admission/limitranger/admission_test.go:934 +0x1854

// unhold all the calls with the same namespace handler.GetLimitRanges(attributes) calls, that have to be aggregated
unhold <- struct{}{}
go func() {
unhold <- struct{}{}
}()
// and here we wait for all the goroutines
wg.Wait()

you can see the stack running with stress ./limitranger.test -test.run ^TestLimitRanger_GetLimitRangesFixed22422$ -test.timeout 32s, it panics in less than a minute

The problem is that before unhold, we need to be sure all the goroutines are "captured" in the syncflight group, just adding a small sleep the test doesn´t flake

diff --git a/plugin/pkg/admission/limitranger/admission_test.go b/plugin/pkg/admission/limitranger/admission_test.go
index 01a83816037..e1b5e002fac 100644
--- a/plugin/pkg/admission/limitranger/admission_test.go
+++ b/plugin/pkg/admission/limitranger/admission_test.go
@@ -926,11 +926,10 @@ func TestLimitRanger_GetLimitRangesFixed22422(t *testing.T) {
                        }
                }()
        }
+       time.Sleep(1 * time.Second)
        // unhold all the calls with the same namespace handler.GetLimitRanges(attributes) calls, that have to be aggregated
        unhold <- struct{}{}
-       go func() {
-               unhold <- struct{}{}
-       }()
+       unhold <- struct{}{}

        // and here we wait for all the goroutines
        wg.Wait()
 stress ./limitranger.test -test.run ^TestLimitRanger_GetLimitRangesFixed22422$ -test.timeout 32s
5s: 192 runs so far, 0 failures
10s: 386 runs so far, 0 failures
15s: 624 runs so far, 0 failures
20s: 828 runs so far, 0 failures
25s: 1056 runs so far, 0 failures
30s: 1274 runs so far, 0 failures
35s: 1495 runs so far, 0 failures
40s: 1713 runs so far, 0 failures
45s: 1935 runs so far, 0 failures
50s: 2162 runs so far, 0 failures
55s: 2381 runs so far, 0 failures
1m0s: 2603 runs so far, 0 failures
1m5s: 2819 runs so far, 0 failures
1m10s: 3045 runs so far, 0 failures
1m15s: 3267 runs so far, 0 failures
1m20s: 3486 runs so far, 0 failures
1m25s: 3711 runs so far, 0 failures
1m30s: 3927 runs so far, 0 failures
1m35s: 4153 runs so far, 0 failures
1m40s: 4374 runs so far, 0 failures
1m45s: 4594 runs so far, 0 failures
1m50s: 4819 runs so far, 0 failures
1m55s: 5034 runs so far, 0 failures
2m0s: 5262 runs so far, 0 failures
2m5s: 5482 runs so far, 0 failures
2m10s: 5700 runs so far, 0 failures
2m15s: 5927 runs so far, 0 failures
2m20s: 6143 runs so far, 0 failures
2m25s: 6369 runs so far, 0 failures
2m30s: 6590 runs so far, 0 failures
2m35s: 6810 runs so far, 0 failures
2m40s: 7034 runs so far, 0 failures
2m45s: 7253 runs so far, 0 failures
2m50s: 7476 runs so far, 0 failures
2m55s: 7698 runs so far, 0 failures

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t0 - call A1 arrives and checks cache t1 - Do(A1) start t2 - call A2 arrives and checks cache t3 - A2 joins A1 t4 - call A3 arrives and checks cache t5 - Do(A1) finish and answer boths A1 and A2 t6 - Do(A3) start // DO(A1) has already returned and cannot join, so it has to open a new call

Do(A3)

After execution, it blocks because there is no value in chan

@aojea
Copy link
Member

aojea commented Nov 3, 2022

we should fix the test, that is the one with the race, and keep the implementation as it is #113378 (comment)

@aimuz
Copy link
Contributor Author

aimuz commented Nov 3, 2022

Yes, it can be solved with sleep, so should I roll back to the previous fix?

Can I create another PR to change the implementation? It's like liggitt said

This is a smell that the async aspects of the test aren't fully correct

…k attached

Signed-off-by: aimuz <mr.imuz@gmail.com>
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Nov 3, 2022
@aimuz
Copy link
Contributor Author

aimuz commented Nov 3, 2022

/retest

@aimuz
Copy link
Contributor Author

aimuz commented Nov 8, 2022

@aojea @liggitt Can you review it?

@pacoxu
Copy link
Member

pacoxu commented Nov 8, 2022

7m50s: 1676 runs so far, 0 failures
/lgtm
the date race is fixed and no failure is got.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 8, 2022
@aojea
Copy link
Member

aojea commented Nov 8, 2022

let's try to address this after code freeze and see if we can find an easier way to test it ...

@liggitt
Copy link
Member

liggitt commented Nov 8, 2022

deflaking the test by sleeping long enough for goroutines to execute is not a reliable solution. The failure mode of the sleep not being long enough is still a deadlock.

I opened #113736 as an alternative, which makes the handler slow enough for concurrent requests to accumulate in the singleflight. The failure mode of the sleep not being long enough there is that some requests don't hit the singleflight and hit the cache, and the test still passes.

@aojea
Copy link
Member

aojea commented Nov 8, 2022

deflaking the test by sleeping long enough for goroutines to execute is not a reliable solution. The failure mode of the sleep not being long enough is still a deadlock.

I opened #113736 as an alternative, which makes the handler slow enough for concurrent requests to accumulate in the singleflight. The failure mode of the sleep not being long enough there is that some requests don't hit the singleflight and hit the cache, and the test still passes.

let's go with the other approach, Thank you very much @aimuz for working so hard no this, appreciate it, keep it up

/close

@k8s-ci-robot
Copy link
Contributor

@aojea: Closed this PR.

In response to this:

deflaking the test by sleeping long enough for goroutines to execute is not a reliable solution. The failure mode of the sleep not being long enough is still a deadlock.

I opened #113736 as an alternative, which makes the handler slow enough for concurrent requests to accumulate in the singleflight. The failure mode of the sleep not being long enough there is that some requests don't hit the singleflight and hit the cache, and the test still passes.

let's go with the other approach, Thank you very much @aimuz for working so hard no this, appreciate it, keep it up

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestLimitRanger_GetLimitRangesFixed22422 flakes
5 participants