Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent data race in version negotiation #42379

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

orlangure
Copy link

@orlangure orlangure commented May 15, 2021

Concurrent operations on the same client lead to data race during
version negotiation. This commit adds locks to version and
negotiated fields, and replaces all usages by methods protected by
locks.

- What I did
Fixed a data race that occurs during concurrent version negotiation.
- How I did it
I created two locks and made sure all concurrent usages of version and negotiated fields are protected.
- How to verify it
Create and use a client concurrently with race detector on.
- Description for the changelog

Prevent data race in version negotiation.

- A picture of a cute animal (not mandatory but encouraged)
😼

if cli.version != "" {
v := strings.TrimPrefix(cli.version, "v")
if v := cli.ClientVersion(); v != "" {
v = strings.TrimPrefix(v, "v")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If cli.version is now only set through setVersion, perhaps we should move the TrimPrefix there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense, but I'm concerned about making this change, because:

  1. It causes test failures:
=== FAIL: client TestGetAPIPath (0.00s)
    client_test.go:133: assertion failed: /vv1.22/containers/json (actual string) != /v1.22/containers/json (testcase.expected string)
    client_test.go:133: assertion failed: /vv1.22/containers/json (actual string) != /v1.22/containers/json (testcase.expected string)
    client_test.go:133: assertion failed: /vv1.22/containers/json?s=c (actual string) != /v1.22/containers/json?s=c (testcase.expected string)
    client_test.go:133: assertion failed: /vv1.22/networks/kiwl$%25%5E (actual string) != /v1.22/networks/kiwl$%25%5E (testcase.expected string)
  1. It implies that every other version usage expects a version without v prefix, and I think this is a dangerous assumption: users can set their versions using WithVersion("v1.xx") and expect v1.xx in return from ClientVersion. Trimming v prefix in setVersion will change the behavior.

client/client.go Outdated Show resolved Hide resolved
@orlangure
Copy link
Author

Hi @thaJeztah,
Thank you for taking a look at this change.
I addressed your comments and pushed an updated version.

Concurrent operations on the same client lead to data race during
version negotiation. This commit adds locks to `version` and
`negotiated` fields, and replaces all usages by methods protected by
locks.

Co-authored-by: Sebastiaan van Stijn <github@gone.nl>
Signed-off-by: Yury Fedorov <orlangure@gmail.com>
@thaJeztah thaJeztah force-pushed the fix-negotiation-race branch 2 times, most recently from 7463ef3 to 8f64a1e Compare June 21, 2022 09:39
Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the long delay; this one dropped off my radar. I did a quick rebase, and applied the changes I commented in the last push (https://github.com/moby/moby/compare/7463ef32fbf132cc6d97df6dd418f3f79a50d8d1..8f64a1ec53121ef4a2750307613340b2c2eaafda)

LGTM, thanks!

client/client.go Outdated Show resolved Hide resolved
client/options.go Outdated Show resolved Hide resolved
@thaJeztah
Copy link
Member

@tianon @corhere @rumpl @ndeloof PTAL

Copy link
Contributor

@ndeloof ndeloof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I wonder: why introduce such mutex-based synchronized access to version, and not just run API version negotiation from NewClientWithOpts as such an initializer is expected to?

@vvoland
Copy link
Contributor

vvoland commented Jun 21, 2022

LGTM but I wonder: why introduce such mutex-based synchronized access to version, and not just run API version negotiation from NewClientWithOpts as such an initializer is expected to?

I think we can't do this in the initializer because the NegotiateAPIVersion/NegotiateAPIVersionPing can be called at any time later.

Copy link
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to also have some tests for go test -race that catch the issue to avoid regressions in the future

@ndeloof
Copy link
Contributor

ndeloof commented Jun 21, 2022

sur, it can be called at any time later, but what would be the benefit? You can't use the client without this to happen once, so better just return a fully initialized client, vs doing some lazy-init with synchronized code

@thaJeztah
Copy link
Member

The idea of doing this at a later stage ("lazily") is to only connect with the daemon if an API call is to be made, which is not always known up-front when the client is created.

@ndeloof
Copy link
Contributor

ndeloof commented Jun 21, 2022

@thaJeztah as client.host is set in the constructor initializer, how could you use such a client with a not known daemon?

@thediveo
Copy link

Sorry for chiming in: in the "worst" case, the daemon might not be online at the time the client gets created and a properly written software (I dunno what this is, haven't reached that stage yet myself) is fine to (re) connect during long-term operation anyway. And in consequence that might even raise the question whether to renegotiate after daemon connection-loss.

Anyway, connection and reconnecting including API version negotiation looks to me like a "late binding" over the long lifetime of services using the Docker API. To me, such use cases match with the current design where establishing a daemon service connection and the API negotiation only takes place later, but not when creating a client.

@ndeloof
Copy link
Contributor

ndeloof commented Jun 21, 2022

lazy-init might indeed make sense in some circumstances, and as noted being able to re-negociate connection after failure would be even nicer. Just for the current design doing lazy init sounds to me some additional complexity added for a minimal benefits (most client are just used a few ms after creation, aren't they?). But I'm not against this, already gave my LGTM vote, just wonder why we would want this to be more complicated that it needs to be.

@thaJeztah
Copy link
Member

Yes, some things were moved to be connecting only if needed, as (for example, in the docker cli), the client may be created during startup of the CLI, but may not need to make an actual connection when invoked (think of docker --help or docker <some command that doesn't make an API call>).

Also note that the client may be used in different contexts than a "cli"

@thediveo
Copy link

I'm perfectly fine with the current fix and wanted to just give a view onto further use cases, but didn't intend to throw in a spanner into the works. I'm actually using Docker clients almost always outside any CLI context but instead in service contexts to diagnose OT container workloads in (in both meanings) production systems.

@thediveo
Copy link

As I have a large set of unit tests for the diagnosis services that often need to talk to a real Docker daemon under test, the race condition currently blocks me from using go test -race: this stll floods my unit tests with positives that are upstream ... and this unfortunately means I cannot run race tests in the pipelines on a regular basis. I would be very thankful if this PR could be merged so it soon trickles down into not only all my tests.

Copy link
Contributor

@corhere corhere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronizing reads and writes to the struct fields is enough to quiet Go's race detector, but there is still a logical race condition which could result in the client spamming the daemon with redundant pings if multiple API requests are made concurrently. I think that also needs to be resolved before we can claim that the client is safe for concurrent use in multiple goroutines.

In the meantime, it looks like both the data race and the logical race can be worked around by calling (*Client).NegotiateAPIVersion synchronously and waiting for it to return before using the client concurrently from more than one goroutine. Is that enough to unblock you @thediveo?

Comment on lines +194 to 195
if cli.negotiateVersion && !cli.isVersionNegotiated() {
cli.NegotiateAPIVersion(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is nothing preventing the client from calling NegotiateAPIVersion concurrently from multiple goroutines, so an unbounded number of duplicate ping requests could be in-flight at once. A client only needs one version negotiation to be in-flight at a time; any more only serves to increase the API request load on the daemon.

Copy link

@thediveo thediveo Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corhere I'll check to be sure, and expect this to be a sufficient fix unless I miss something; in my case of the whalewatcher OpenSource module there is a "safe" location before kicking off the separate event-streaming and processing goroutines and as the race detector seems to triggered exactly when using this module I assume that running an explicit NegotiateAPIVersion should do the trick and additionally avoiding spamming the Docker daemon.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crigor 👍🏼 fixes my issues for the moment. Thank you very much for the help!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@corhere we could also use a single flight for that I guess? thaJeztah@fac9fc2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually wondering now if a sync.Once would do the job (will have to check the logic on what we currently do on a failed negotiation)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error return from cli.Ping is ignored...

moby/client/client.go

Lines 223 to 228 in a347f79

func (cli *Client) NegotiateAPIVersion(ctx context.Context) {
if !cli.manualOverride {
ping, _ := cli.Ping(ctx)
cli.negotiateAPIVersionPing(ping)
}
}

...so a failed request or canceled context will result in the version being "negotiated" to 1.24 for all subsequent requests, or until the consumer explicitly calls NegotiateAPIRequest or NegotiateAPIVersionPing.

moby/client/client.go

Lines 249 to 272 in a347f79

// negotiateAPIVersionPing queries the API and updates the version to match the
// API version from the ping response.
func (cli *Client) negotiateAPIVersionPing(pingResponse types.Ping) {
// default to the latest version before versioning headers existed
if pingResponse.APIVersion == "" {
pingResponse.APIVersion = "1.24"
}
// if the client is not initialized with a version, start with the latest supported version
if cli.version == "" {
cli.version = api.DefaultVersion
}
// if server version is lower than the client version, downgrade
if versions.LessThan(pingResponse.APIVersion, cli.version) {
cli.version = pingResponse.APIVersion
}
// Store the results, so that automatic API version negotiation (if enabled)
// won't be performed on the next request.
if cli.negotiateVersion {
cli.negotiated = true
}
}

That doesn't sound like desirable behaviour; having a canceled context for one request negatively effect subsequent contexts. Anyway, I am pretty sure replacing the negotiated flag with a sync.Once would achieve the same effective semantics in a concurrent world while suppressing redundant pings.

func (cli *Client) getAPIPath(ctx context.Context, p string, query url.Values) string {
	if cli.negotiateVersion {
		cli.negotiateOnce.Do(func() { cli.NegotiateAPIVersion(ctx) })
	}
	/* ... */
}

thediveo added a commit to thediveo/whalewatcher that referenced this pull request Jun 22, 2022
@thediveo
Copy link

thediveo commented Jun 22, 2022

I'm now seeing a different read/write race between NegotiateAPIVersion and getAPIPath after having successfully stopped the other race. I actually have issued an initial NegotiateAPIVersion call on the client before spinning separate goroutines issuing API calls but am now somewhat surprised that NetworkList/getAPIPath triggers a race.

Should this be handled as part of this PR or shall I file a new, separate issue?

WARNING: DATA RACE
Read at 0x00c00040e472 by goroutine 20:
  github.com/docker/docker/client.(*Client).getAPIPath()
      /home/.../go/pkg/mod/github.com/docker/docker@v20.10.17+incompatible/client/client.go:183 +0x8f
  github.com/docker/docker/client.(*Client).sendRequest()
      /home/.../go/pkg/mod/github.com/docker/docker@v20.10.17+incompatible/client/request.go:109 +0xbb
  github.com/docker/docker/client.(*Client).get()
      /home/.../go/pkg/mod/github.com/docker/docker@v20.10.17+incompatible/client/request.go:37 +0x294
  github.com/docker/docker/client.(*Client).NetworkList()
      /home/.../go/pkg/mod/github.com/docker/docker@v20.10.17+incompatible/client/network_list.go:25 +0x24f
  github.com/thediveo/whalewatcher/watcher/moby.glob..func1.3()
      /home/.../cscom/ghostwire-ws/whalewatcher/watcher/moby/moby_test.go:68 +0x593
  github.com/onsi/ginkgo/v2/internal.(*Suite).runNode.func2()
      /home/.../go/pkg/mod/github.com/onsi/ginkgo/v2@v2.1.4/internal/suite.go:596 +0xe7

Previous write at 0x00c00040e472 by goroutine 26:
  github.com/docker/docker/client.(*Client).negotiateAPIVersionPing()
      /home/.../go/pkg/mod/github.com/docker/docker@v20.10.17+incompatible/client/client.go:244 +0x192
  github.com/docker/docker/client.(*Client).NegotiateAPIVersion()
      /home/.../go/pkg/mod/github.com/docker/docker@v20.10.17+incompatible/client/client.go:208 +0x164
  github.com/thediveo/whalewatcher/engineclient/moby.(*MobyWatcher).List()
      /home/.../cscom/ghostwire-ws/whalewatcher/engineclient/moby/moby.go:136 +0x91
  github.com/thediveo/whalewatcher/watcher.(*watcher).list()
      /home/.../cscom/ghostwire-ws/whalewatcher/watcher/watcher.go:424 +0xcf
  github.com/thediveo/whalewatcher/watcher.(*watcher).Watch.func1.1()
      /home/.../cscom/ghostwire-ws/whalewatcher/watcher/watcher.go:232 +0x59

@corhere
Copy link
Contributor

corhere commented Jun 22, 2022

I'm now seeing a different read/write race between NegotiateAPIVersion and getAPIPath after having successfully stopped the other race. I actually have issued an initial NegotiateAPIVersion call on the client before spinning separate goroutines issuing API calls but am now somewhat surprised that NetworkList/getAPIPath triggers a race.

@thediveo looking at your whalewatcher develop branch, it appears to be a bug in your code. Your end-to-end test concurrently calls func (*Client) NegotiateAPIVersion and func (*Client) NetworkList from separate goroutines. You aren't waiting for NegotiateAPIVersion to return before using the client concurrently. Try reverting thediveo/whalewatcher@8ff72e9 and inserting the line moby.NegotiateAPIVersion(context.Background()) here instead.

@thediveo
Copy link

thediveo commented Jun 22, 2022

@corhere ah! thank you very much for pointing out my botched fix attempt, as I obviously got the gist wrong the first time. As I don't want to run the NegotiateAPIVersion on the background context I had to implement it slightly differently, but my tests now all pass. However, this also means that there might be downstream false positives so to enable -race in downstream tests I will need for this PR to finally become merged and part of a new version tag. If I'm getting this correct now, then the reason is that I will always end up in cases where the currently race-triggering API calls will be on different goroutines/threads.

Thank you very much again for your help!

@corhere
Copy link
Contributor

corhere commented Jun 22, 2022

However, this also means that there might be downstream false positives so to enable -race in downstream tests I will need for this PR to finally become merged and part of a new version tag.

The race detector generally does not have false positives; if it reports that a race has been detected, the program violates the Go memory model in some way.

If I'm getting this correct now, then the reason is that I will always end up in cases where the currently race-triggering API calls will be on different goroutines/threads.

I still don't think you're getting it, as you're still calling NegotiateAPIVersion more than once on a client. The client has an internal flag, negotiated, which is set to true by the NegotiateAPIVersion method. Every API method on the client checks this flag and calls NegotiateAPIVersion itself if the flag is false. So while the flag is false, concurrent calls to any API methods on a client will cause data races: unsynchronized read and write of the negotiated flag. But once the flag is true, concurrent API method calls will not race with each other as they won't call NegotiateAPIVersion themselves, but they will race with any explicit calls to NegotiateAPIVersion. So to work around the race on the v20.10 client, you need to get that flag set to true without triggering a race, and never call NegotiateAPIVersion again on that client. I hope this helps @thediveo.

@thediveo
Copy link

thediveo commented Jun 22, 2022

my false positive was misleading, sorry. I forgot to put it into context of a correct positive but in an upstream module, so to a downstream unit test it's a false positive from the module unit test perspective, but not a Go race detector false positive. Sorry for confusing you.

I think I understood this reasoning correctly, but it is good that you reinforce the point again. And this is what I'm referring to in the bottom part of my comment: I ensured your point (if I'm not mistaken) now both in the whalewatcher code under test as well in a particular test case by ensuring that the "synchronous" API version negation safely takes place in a single place and completes, before issuing further API calls from multiple goroutines.

What I wanted to hint at while this fix on my side should be as you suggested, except put into a later place because of the required context, other (private) modules downstream to whalewatcher might not be refactor-able to also implement this fix. Instead these private modules need to wait for the PR fix in order to enable test -race in the future on them.

@thaJeztah thaJeztah modified the milestones: 22.06.0, v-next Jun 30, 2022
@rumpl rumpl modified the milestones: 24.0.0, v-future Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[go client] negotiateAPIVersionPing triggers Go race detector
7 participants