New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent data race in version negotiation #42379
base: master
Are you sure you want to change the base?
Conversation
if cli.version != "" { | ||
v := strings.TrimPrefix(cli.version, "v") | ||
if v := cli.ClientVersion(); v != "" { | ||
v = strings.TrimPrefix(v, "v") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If cli.version
is now only set through setVersion
, perhaps we should move the TrimPrefix
there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense, but I'm concerned about making this change, because:
- It causes test failures:
=== FAIL: client TestGetAPIPath (0.00s)
client_test.go:133: assertion failed: /vv1.22/containers/json (actual string) != /v1.22/containers/json (testcase.expected string)
client_test.go:133: assertion failed: /vv1.22/containers/json (actual string) != /v1.22/containers/json (testcase.expected string)
client_test.go:133: assertion failed: /vv1.22/containers/json?s=c (actual string) != /v1.22/containers/json?s=c (testcase.expected string)
client_test.go:133: assertion failed: /vv1.22/networks/kiwl$%25%5E (actual string) != /v1.22/networks/kiwl$%25%5E (testcase.expected string)
- It implies that every other version usage expects a version without
v
prefix, and I think this is a dangerous assumption: users can set their versions usingWithVersion("v1.xx")
and expectv1.xx
in return fromClientVersion
. Trimmingv
prefix insetVersion
will change the behavior.
b1a72c7
to
32f5b5a
Compare
Hi @thaJeztah, |
Concurrent operations on the same client lead to data race during version negotiation. This commit adds locks to `version` and `negotiated` fields, and replaces all usages by methods protected by locks. Co-authored-by: Sebastiaan van Stijn <github@gone.nl> Signed-off-by: Yury Fedorov <orlangure@gmail.com>
7463ef3
to
8f64a1e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay; this one dropped off my radar. I did a quick rebase, and applied the changes I commented in the last push (https://github.com/moby/moby/compare/7463ef32fbf132cc6d97df6dd418f3f79a50d8d1..8f64a1ec53121ef4a2750307613340b2c2eaafda)
LGTM, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I wonder: why introduce such mutex-based synchronized access to version, and not just run API version negotiation from NewClientWithOpts
as such an initializer is expected to?
I think we can't do this in the initializer because the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to also have some tests for go test -race
that catch the issue to avoid regressions in the future
sur, it can be called at any time later, but what would be the benefit? You can't use the client without this to happen once, so better just return a fully initialized client, vs doing some lazy-init with synchronized code |
The idea of doing this at a later stage ("lazily") is to only connect with the daemon if an API call is to be made, which is not always known up-front when the client is created. |
@thaJeztah as |
Sorry for chiming in: in the "worst" case, the daemon might not be online at the time the client gets created and a properly written software (I dunno what this is, haven't reached that stage yet myself) is fine to (re) connect during long-term operation anyway. And in consequence that might even raise the question whether to renegotiate after daemon connection-loss. Anyway, connection and reconnecting including API version negotiation looks to me like a "late binding" over the long lifetime of services using the Docker API. To me, such use cases match with the current design where establishing a daemon service connection and the API negotiation only takes place later, but not when creating a client. |
lazy-init might indeed make sense in some circumstances, and as noted being able to re-negociate connection after failure would be even nicer. Just for the current design doing lazy init sounds to me some additional complexity added for a minimal benefits (most client are just used a few ms after creation, aren't they?). But I'm not against this, already gave my LGTM vote, just wonder why we would want this to be more complicated that it needs to be. |
Yes, some things were moved to be connecting only if needed, as (for example, in the Also note that the client may be used in different contexts than a "cli" |
I'm perfectly fine with the current fix and wanted to just give a view onto further use cases, but didn't intend to throw in a spanner into the works. I'm actually using Docker clients almost always outside any CLI context but instead in service contexts to diagnose OT container workloads in (in both meanings) production systems. |
As I have a large set of unit tests for the diagnosis services that often need to talk to a real Docker daemon under test, the race condition currently blocks me from using |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Synchronizing reads and writes to the struct fields is enough to quiet Go's race detector, but there is still a logical race condition which could result in the client spamming the daemon with redundant pings if multiple API requests are made concurrently. I think that also needs to be resolved before we can claim that the client is safe for concurrent use in multiple goroutines.
In the meantime, it looks like both the data race and the logical race can be worked around by calling (*Client).NegotiateAPIVersion
synchronously and waiting for it to return before using the client concurrently from more than one goroutine. Is that enough to unblock you @thediveo?
if cli.negotiateVersion && !cli.isVersionNegotiated() { | ||
cli.NegotiateAPIVersion(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is nothing preventing the client from calling NegotiateAPIVersion
concurrently from multiple goroutines, so an unbounded number of duplicate ping requests could be in-flight at once. A client only needs one version negotiation to be in-flight at a time; any more only serves to increase the API request load on the daemon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corhere I'll check to be sure, and expect this to be a sufficient fix unless I miss something; in my case of the whalewatcher
OpenSource module there is a "safe" location before kicking off the separate event-streaming and processing goroutines and as the race detector seems to triggered exactly when using this module I assume that running an explicit NegotiateAPIVersion
should do the trick and additionally avoiding spamming the Docker daemon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@crigor 👍🏼 fixes my issues for the moment. Thank you very much for the help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@corhere we could also use a single flight for that I guess? thaJeztah@fac9fc2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually wondering now if a sync.Once
would do the job (will have to check the logic on what we currently do on a failed negotiation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error return from cli.Ping
is ignored...
Lines 223 to 228 in a347f79
func (cli *Client) NegotiateAPIVersion(ctx context.Context) { | |
if !cli.manualOverride { | |
ping, _ := cli.Ping(ctx) | |
cli.negotiateAPIVersionPing(ping) | |
} | |
} |
...so a failed request or canceled context will result in the version being "negotiated" to
1.24
for all subsequent requests, or until the consumer explicitly calls NegotiateAPIRequest
or NegotiateAPIVersionPing
.Lines 249 to 272 in a347f79
// negotiateAPIVersionPing queries the API and updates the version to match the | |
// API version from the ping response. | |
func (cli *Client) negotiateAPIVersionPing(pingResponse types.Ping) { | |
// default to the latest version before versioning headers existed | |
if pingResponse.APIVersion == "" { | |
pingResponse.APIVersion = "1.24" | |
} | |
// if the client is not initialized with a version, start with the latest supported version | |
if cli.version == "" { | |
cli.version = api.DefaultVersion | |
} | |
// if server version is lower than the client version, downgrade | |
if versions.LessThan(pingResponse.APIVersion, cli.version) { | |
cli.version = pingResponse.APIVersion | |
} | |
// Store the results, so that automatic API version negotiation (if enabled) | |
// won't be performed on the next request. | |
if cli.negotiateVersion { | |
cli.negotiated = true | |
} | |
} |
That doesn't sound like desirable behaviour; having a canceled context for one request negatively effect subsequent contexts. Anyway, I am pretty sure replacing the
negotiated
flag with a sync.Once
would achieve the same effective semantics in a concurrent world while suppressing redundant pings.
func (cli *Client) getAPIPath(ctx context.Context, p string, query url.Values) string {
if cli.negotiateVersion {
cli.negotiateOnce.Do(func() { cli.NegotiateAPIVersion(ctx) })
}
/* ... */
}
I'm now seeing a different read/write race between Should this be handled as part of this PR or shall I file a new, separate issue?
|
@thediveo looking at your whalewatcher |
@corhere ah! thank you very much for pointing out my botched fix attempt, as I obviously got the gist wrong the first time. As I don't want to run the NegotiateAPIVersion on the background context I had to implement it slightly differently, but my tests now all pass. However, this also means that there might be downstream false positives so to enable -race in downstream tests I will need for this PR to finally become merged and part of a new version tag. If I'm getting this correct now, then the reason is that I will always end up in cases where the currently race-triggering API calls will be on different goroutines/threads. Thank you very much again for your help! |
The race detector generally does not have false positives; if it reports that a race has been detected, the program violates the Go memory model in some way.
I still don't think you're getting it, as you're still calling |
my false positive was misleading, sorry. I forgot to put it into context of a correct positive but in an upstream module, so to a downstream unit test it's a false positive from the module unit test perspective, but not a Go race detector false positive. Sorry for confusing you. I think I understood this reasoning correctly, but it is good that you reinforce the point again. And this is what I'm referring to in the bottom part of my comment: I ensured your point (if I'm not mistaken) now both in the whalewatcher code under test as well in a particular test case by ensuring that the "synchronous" API version negation safely takes place in a single place and completes, before issuing further API calls from multiple goroutines. What I wanted to hint at while this fix on my side should be as you suggested, except put into a later place because of the required context, other (private) modules downstream to whalewatcher might not be refactor-able to also implement this fix. Instead these private modules need to wait for the PR fix in order to enable test -race in the future on them. |
Concurrent operations on the same client lead to data race during
version negotiation. This commit adds locks to
version
andnegotiated
fields, and replaces all usages by methods protected bylocks.
- What I did
Fixed a data race that occurs during concurrent version negotiation.
- How I did it
I created two locks and made sure all concurrent usages of
version
andnegotiated
fields are protected.- How to verify it
Create and use a client concurrently with race detector on.
- Description for the changelog
Prevent data race in version negotiation.
- A picture of a cute animal (not mandatory but encouraged)
😼