Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

internal/singleflight: results returned by ForgetUnshared do not always seem to be accurate #55343

Closed
dchaofei opened this issue Sep 22, 2022 · 9 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@dchaofei
Copy link
Contributor

dchaofei commented Sep 22, 2022

What version of Go are you using (go version)?

$ go version
[1.16 1.19.1]

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="arm64"
GOHOSTARCH="arm64"
GOHOSTOS="darwin"

What did you do?

I find that the src/internal/singleflight/singleflight.go ForgetUnshared() method returns results that are not always expected

For this I wrote a test code, I copied the code in the src/internal/singleflight/singleflight.go file to the main package, and wrote a main function to test it, if ForgetUnshared() returns correctly, this code It should not panic, because I call cancel() only when there are no other coroutines to share, but the fact that it will panic every time it runs, is there something wrong with my understanding of ForgetUnshared()?

The test code cannot be run in goplay, so I posted a link:
https://gist.github.com/dchaofei/e07547bce17d94c3e05b1b2a7230f62f

In fact, this is a real online environment problem. My application uses http.Client.Do(), but it occasionally has errors: [lookup xxxxx on xxxxx: dial udp xxxxx: operation was canceled], after looking at the code, I found that it may be There is a problem with ForgetUnshared, lookupIPAddr uses ForgetUnshared:

go/src/net/lookup.go

Lines 319 to 337 in 4a4127b

ch, called := r.getLookupGroup().DoChan(lookupKey, func() (any, error) {
defer dnsWaitGroup.Done()
return testHookLookupIP(lookupGroupCtx, resolverFunc, network, host)
})
if !called {
dnsWaitGroup.Done()
}
select {
case <-ctx.Done():
// Our context was canceled. If we are the only
// goroutine looking up this key, then drop the key
// from the lookupGroup and cancel the lookup.
// If there are other goroutines looking up this key,
// let the lookup continue uncanceled, and let later
// lookups with the same key share the result.
// See issues 8602, 20703, 22724.
if r.getLookupGroup().ForgetUnshared(lookupKey) {
lookupGroupCancel()

(I have already asked on golang-nuts, but it seems not ideal, so I took the liberty to raise an issue on github😊)

What did you expect to see?

I hope no panic occurs

What did you see instead?

panic: callUUID=[9314284969] err=[context canceled] currentUUId=[6980556786]

@seankhliao
Copy link
Member

working as intended.

ForgetUnshared doesn't prevent writes, only that the next call will run.

Unlike many projects, the Go project does not use GitHub Issues for general discussion or asking questions. GitHub Issues are used for tracking bugs and proposals only.

For questions please refer to https://github.com/golang/go/wiki/Questions

@seankhliao seankhliao closed this as not planned Won't fix, can't repro, duplicate, stale Sep 22, 2022
@dchaofei
Copy link
Contributor Author

@seankhliao
Hello.
This seems to be a bug, Brian Candler found the problem in the golang-nuts forum, here is a quote from his reply:

OK, I think I have it. It's ugly.

Firstly, note that multiple instances of doCall can be running for the same key. This happens when:

  1. you invoke DoChan. This inserts a 'c' (call struct) into the map and starts doCall in a goroutine.
  2. at this point it's not shared: i.e. you don't call DoChan again with the same key (yet).
  3. you invoke ForgetUnshared on this key. This "detaches" it, but doCall carries on running. It has its own local copy of 'c' so it knows where to send the result, even though the map is now empty.
  4. you invoke DoChan again with the same key. This inserts a new 'c' into the map and starts a new doCall goroutine.

At this point, you have two instances of doCall running, and the map is pointing at the second one.

This is where it gets ugly.

  1. you invoke DoChan yet again with the same key. This turns it into a shared task, with c.dups > 0, len(c.chans) > 1.
  2. the first instance of doCall terminates. At this point it unconditionally removes the key from the map - even though it had previously been removed by ForgetUnshared!
func (g *Group) doCall(c *call, key string, fn func() (interface{}, error)) {
        c.val, c.err = fn()
        c.wg.Done()

        g.mu.Lock()
        delete(g.m, key)    // <<<< NOTE
        for _, ch := range c.chans {
                ch <- Result{c.val, c.err, c.dups > 0}
        }
        g.mu.Unlock()
}

So, even though it's the first instance of doCall which is terminating, it's removing the second instance of doCall from the map. This is now also a detached task.

  1. In one of the two goroutines, the timeout event occurs. It calls ForgetUnshared, which happily returns true because the key does not exist in the map - and therefore you proceed to cancel the context.

But actually a task with this key is running; and furthermore, it is a shared task, with 2 channel receivers.

  1. Once the sleep has completed in the task function, it notices that the context is cancelled and returns an error.

  2. doCall sends the resulting error down multiple channels (those you started in steps 4 and 5 above)

  3. The select { case res := <-ch } triggers in the other goroutine - the one which didn't have a timeout. Hence it receives the error, and that's where you panic().

And here's a proof-of-concept fix which seems to do the job:

--- main.go.orig    2022-09-21 13:14:10.000000000 +0100
+++ main.go    2022-09-22 23:19:54.000000000 +0100
@@ -27,6 +27,7 @@
     // not written after the WaitGroup is done.
     dups  int
     chans []chan<- Result
+    forgotten bool
 }

 // Group represents a class of work and forms a namespace in
@@ -101,7 +102,9 @@
     c.wg.Done()

     g.mu.Lock()
-    delete(g.m, key)
+    if !c.forgotten {
+        delete(g.m, key)
+    }

     for _, ch := range c.chans {
         ch <- Result{c.val, c.err, c.dups > 0}
     }
@@ -121,6 +124,7 @@
         return true
     }
     if c.dups == 0 {
+        c.forgotten = true
         delete(g.m, key)
         return true

@dchaofei
Copy link
Contributor Author

Is it possible to reopen this issue?

@cuonglm
Copy link
Member

cuonglm commented Sep 23, 2022

Duplicated of #31420?

The fix seems to not be ported to internal/singleflight.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/433315 mentions this issue: internal/singleflight: fix duplicate deleting key when ForgetUnshared called

@dchaofei
Copy link
Contributor Author

@cuonglm Hi, will this fix be released as a minor version of go? 1.19.x? How long is it expected?

@cuonglm
Copy link
Member

cuonglm commented Sep 23, 2022

@cuonglm Hi, will this fix be released as a minor version of go? 1.19.x? How long is it expected?

I dont think this is qualified for backporting, so likely no fix for 1.19.x

@dchaofei
Copy link
Contributor Author

@cuonglm Hi, will this fix be released as a minor version of go? 1.19.x? How long is it expected?

I dont think this is qualified for backporting, so likely no fix for 1.19.x

So we have to wait until the 1.20 version on February 1st next year? ^_^

@dchaofei
Copy link
Contributor Author

This problem will affect Resolver.lookupIPAddr(), when the concurrency is too large, the dns resolution will return err: lookup xxxxx on xxxxx: dial udp xxxxx: operation was canceled

Based on the above questions, is it possible to publish to 1.19.x?
Of course, to assess compliance with https://github.com/golang/go/wiki/Go-Release-Cycle

@cherrymui cherrymui changed the title src/internal/singleflight: results returned by ForgetUnshared do not always seem to be accurate internal/singleflight: results returned by ForgetUnshared do not always seem to be accurate Sep 23, 2022
@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 23, 2022
@cherrymui cherrymui added this to the Go1.20 milestone Sep 23, 2022
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 30, 2022
@golang golang locked and limited conversation to collaborators Sep 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

7 participants