Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speculative workaround for #74890 #75305

Merged

Conversation

@justinsb
Copy link
Member

justinsb commented Mar 12, 2019

We try using an atomic with a CAS, as a potential workaround for
issue #74890.

Kudos to @neolit123 for the investigation & idea.

This is a speculative workaround - we are really seeing if this is
better; if not we should revert.

If it is better we should file a bug against go 1.12, and then revert!

Issue #74890

/kind failing-test

NONE
Speculative workaround for #74890
We try using an atomic with a CAS, as a potential workaround for
issue #74890.

Kudos to @neolit123 for the investigation & idea.

This is a speculative workaround - we are really seeing if this is
better; if not we should revert.

If it is better we should file a bug against go 1.12, and then revert!

Issue #74890
@mariantalla

This comment has been minimized.

Copy link
Contributor

mariantalla commented Mar 12, 2019

Thanks @justinsb & @neolit123 ! We'll keep an eye on this tomorrow.

If it does end up fixing #74890, I'm guessing this means we should probably see where this pattern is used elsewhere in the codebase, is that right?

@neolit123
Copy link
Member

neolit123 left a comment

thanks, @justinsb .
if we get a race at least we know we are doing "something"... :[
i did not confirm if go 1.12 caused this and also it's really hard to find when this started...

even if merged this should have at least a couple of runs before making conclusions.

/lgtm
/assign @fejta @spiffxp

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 12, 2019

@mariantalla

Thanks @justinsb & @neolit123 ! We'll keep an eye on this tomorrow.

thanks.

If it does end up fixing #74890, I'm guessing this means we should probably see where this pattern is used elsewhere in the codebase, is that right?

long term, we need to investigate some sort of real workaround in the code that is flaking (the code that this PR touches).

@justinsb

This comment has been minimized.

Copy link
Member Author

justinsb commented Mar 12, 2019

Yes, the failures seemed to start happening around March 1st, which seems to coincide with moving from 1.11 to 1.12

My 2c:

Hopefully this does not work, and we keep looking and see a mistake somewhere else.

If it does work, I suggest we bring this to the go team, see if they understand the root cause and can either tell us we're doing something wrong, or can repro as a go bug. Speculation: If it's a go bug, maybe we find out that it's unique to local sync.Once blocks with a defer, in which case we can probably just fix one or two occurrences (I'd guess). If it's confirmed as a more general problem we'd have to wait for a fix (e.g. go 1.12.1) or revert to the go 1.11 branch.

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 12, 2019

Yes, the failures seemed to start happening around March 1st, which seems to coincide with moving from 1.11 to 1.12

ok, so this is concerning and we should get more eyes on it.

Hopefully this does not work, and we keep looking and see a mistake somewhere else.

i agree.

If it does work, I suggest we bring this to the go team, see if they understand the root cause and can either tell us we're doing something wrong, or can repro as a go bug. Speculation: If it's a go bug, maybe we find out that it's unique to local sync.Once blocks with a defer, in which case we can probably just fix one or two occurrences (I'd guess). If it's confirmed as a more general problem we'd have to wait for a fix (e.g. go 1.12.1) or revert to the go 1.11 branch.

as far as i know RSC is one of the approvers in the mutex space.
revert to 1.11 does sound like a plan. 1.12 is quite desired for other reasons and if we don't want to be stuck to 1.11 in k8s 1.13 (unless bump it in a patch release).

@BenTheElder

This comment has been minimized.

Copy link
Member

BenTheElder commented Mar 12, 2019

/priority important-soon

@BenTheElder
Copy link
Member

BenTheElder left a comment

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 12, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@BenTheElder

This comment has been minimized.

Copy link
Member

BenTheElder commented Mar 12, 2019

/unassign fejta spiffxp

@BenTheElder

This comment has been minimized.

Copy link
Member

BenTheElder commented Mar 12, 2019

/retest

@mariantalla

This comment has been minimized.

Copy link
Contributor

mariantalla commented Mar 13, 2019

/milestone v1.14

@k8s-ci-robot k8s-ci-robot added this to the v1.14 milestone Mar 13, 2019

@k8s-ci-robot k8s-ci-robot merged commit 2748ac6 into kubernetes:master Mar 13, 2019

16 of 17 checks passed

pull-kubernetes-kubemark-e2e-gce-big Job triggered.
Details
cla/linuxfoundation justinsb authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@dims

This comment has been minimized.

Copy link
Member

dims commented Mar 13, 2019

@justinsb @neolit123 seems like a change similar to this was proposed once in

See why it was rejected here:

for the record it was

The current implementation guarantees that Do only returns after f() returned (either explicitly or by panic'king) but this change does not.

With this change, concurrent calls to Do could return before the call to f() returned (or even started!) which would break most uses of sync.Once.

Also, looks like nothing has changed in once.go in a while:
https://github.com/golang/go/blame/release-branch.go1.12/src/sync/once.go

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 13, 2019

See why it was rejected here:

the reasoning is absolutely valid. our temporary fix is far from ideal.

Also, looks like nothing has changed in once.go in a while:

that was my observation in the go mutex history as well. the only recent change in our code is this linter fix:
justinsb@6322025#diff-306b2e3f3c8babf6ecd6e4a13ebb9b7e
and i don't see anything breaking here.

so this could simply be a case of resource exhaustion for other unknown reasons, somehow matching the 1.12 switch timing.

@dims

This comment has been minimized.

Copy link
Member

dims commented Mar 13, 2019

@neolit123 seen this? golang/go@91fd14b seems like this was backported for golang/go#30470 and will be available in 1.12.1 in the future (can't seem to find dates on when they are likely to ship 1.12.1)

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 13, 2019

@dims good find.

a bug in deferred stack allocated closures sounds like a lot of fun,
so technically this can be a problem outside of Once too.

one more reason for me to not like passing function objects around in this language (i.e. no root object code symbols).

this link suggests a monthly cadence for PATCH releases:
https://golang.org/doc/devel/release.html#go1.11.minor

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Mar 13, 2019

posted comment related to the closure bug here:
#74890 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.