Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: deadlock in lookupIP upon error #24178

Closed
Kleissner opened this issue Feb 28, 2018 · 7 comments
Closed

net: deadlock in lookupIP upon error #24178

Kleissner opened this issue Feb 28, 2018 · 7 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@Kleissner
Copy link

Kleissner commented Feb 28, 2018

Latest version go1.10 windows/amd64

tl;dr: The function LookupHost leaks Go routines/handles and eventually completely deadlocks when you query an invalid domain with context cancelled

There is an obvious deadlock in the function net.lookupIP (file lookup_windows.go), this code: https://github.com/golang/go/blob/master/src/net/lookup_windows.go#L79

Here is the problem (commented also inline below in function):

  1. It creates the channel "ch" with buffer size 1 on L86
  2. It makes potentially 3 writes to the channel - if GetAddrInfoW fails, and then after the for loop
  3. However, it may only make 0 reads, when the context is cancelled.

While the function will return, the created Go routine deadlocks forever and will never release the thread that it acquired with acquireThread. Eventually if you make enough (500) IP lookups that fail, the function will lock forever and not return.

Potential solutions:

  1. Make the writes to the "ch" channel non-blocking via select
  2. Alternatively, detect the context cancellation in the Go routine
  3. Alternatively, increase channel size of ch to 3 so that writes will never block
  4. Alternatively, add return statement after write to ch channel so that it won't result in potential block
  5. A proper fix as suggested by alexbrainman

Code to demonstrate this bug

First: place this in lookup_windows.go:117 (right before the "ch <- ret" statement) so you can see the ongoing leaks:

		fmt.Printf("Creating leak, length %d / %d\n", len(threadLimit), cap(threadLimit))

Then you'll see the leaks in this special case. NOTES: This is Windows specific! Also your DNS provider/cache may brick this demo code; with the below domain it is expected that GetAddrInfoW fails.

package main

import (
	"context"
	"fmt"
	"net"
	"time"
)

const testDomain = "ty.laiwu.gov.cn"

func main() {
	fmt.Println("Start DNS Test, making lookups with context cancelled to a domain that will fail, causing leaks in lookupIP and eventually deadlock")

	ctx, cancel := context.WithCancel(context.Background())
	cancel()

	for n := 0; n < 1000; n++ {
		address := getIP(ctx, testDomain)
		fmt.Printf("%d: %s\n", n, address)

		time.Sleep(time.Second * 1)
	}

	fmt.Println("End DNS Test")
}

func getIP(ctx context.Context, Domain string) (addr string) {
	addrs, err := net.DefaultResolver.LookupHost(ctx, Domain)

	if err != nil {
		return err.Error()
	}

	if len(addrs) > 0 {
		return addrs[0]
	}

	return "None"
}

Output, you will see that net.threadLimit will eventually be full and then the function will lock forever:

Start DNS Test, making lookups with context cancelled to a domain that will fail, causing leaks in lookupIP and eventually deadlock
0: lookup ty.laiwu.gov.cn: context canceled
1: lookup ty.laiwu.gov.cn: context canceled
2: lookup ty.laiwu.gov.cn: context canceled
3: lookup ty.laiwu.gov.cn: context canceled
Creating leak, length 4 / 500
Creating leak, length 4 / 500
Creating leak, length 4 / 500
Creating leak, length 4 / 500
4: lookup ty.laiwu.gov.cn: context canceled
Creating leak, length 5 / 500
5: lookup ty.laiwu.gov.cn: context canceled
Creating leak, length 6 / 500
6: lookup ty.laiwu.gov.cn: context canceled
Creating leak, length 7 / 500
[...]
@odeke-em
Copy link
Member

odeke-em commented Mar 2, 2018

/cc @bradfitz @alexbrainman

@alexbrainman
Copy link
Member

It makes potentially 2 (!) writes to the channel - if GetAddrInfoW fails, and then after the for loop

I can actually see 3 writes to the ch variable - on lines 98, 113 and 116.

Eventually if you make enough (500) IP lookups that fail, the function will lock forever and not return.

Do you have a small example to demonstrate the problem? My hope is to use the code as part of the change to fix this issue.

Solution: Add return statements:

Yes, this will work. But that might break in the future, if someone adds more code that fails and forget to add write to ch or forget to add return. I would prefer we add new function that does not write into ch, but returns []IPAddr and error. Then we could use the function to write into ch once.

Would you like to send this fix yourself? This https://golang.org/doc/contribute.html is how to contribute. You could also send a Pull Request.

Thank you

Alex

@andybons andybons added the WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. label Mar 7, 2018
@andybons andybons added this to the Go1.11 milestone Mar 7, 2018
@andybons andybons added help wanted NeedsFix The path to resolution is known, but the work has not been done. and removed WaitingForInfo Issue is not actionable because of missing required information, which needs to be provided. labels Mar 7, 2018
@Kleissner
Copy link
Author

Kleissner commented Mar 12, 2018

@alexbrainman I clarified the problem and added demo code that shows this leak & eventual lock.

The problem occurs when calling this function with a cancelled context and invalid domain.
I will check if I can come up with a good fix and pull request tomorrow.

@alexbrainman
Copy link
Member

@Kleissner nice test, thank you. I adjusted your test to run faster and crash instead of hang:

package main

import (
	"context"
	"fmt"
	"net"
	"time"
)

const testDomain = "ty.laiwu.gov.cn"

func main() {
	const google = "www.google.com"

	_, err := net.LookupHost(google)
	if err != nil {
		panic(err)
	}

	fmt.Println("Start DNS Test, making lookups with context cancelled to a domain that will fail, causing leaks in lookupIP and eventually deadlock")
	ctx, cancel := context.WithCancel(context.Background())
	cancel()
	for n := 0; n < 1000; n++ {
		address := getIP(ctx, testDomain)
		fmt.Printf("%d: %s\n", n, address)
		time.Sleep(time.Millisecond * 1)
	}

	_, err = net.LookupHost(google)
	if err != nil {
		panic(err)
	}

	fmt.Println("End DNS Test")
}

func getIP(ctx context.Context, Domain string) (addr string) {
	addrs, err := net.DefaultResolver.LookupHost(ctx, Domain)
	if err != nil {
		return err.Error()
	}
	if len(addrs) > 0 {
		return addrs[0]
	}
	return "None"
}

I will check if I can come up with a good fix and pull request tomorrow.

Sure, have a go, hopefully with a test. Whenever you can. If you cannot I will fix this myself when I have time. Just let me know.

Thank you.

Alex

@Kleissner
Copy link
Author

Can you @alexbrainman come up with a fix? Unfortunately I don't have enough time to write a proper fix :(

@alexbrainman
Copy link
Member

Can you @alexbrainman come up with a fix?

Sure thing. I will put it on my TODO list, which is very long at this moment.

Unfortunately I don't have enough time to write a proper fix :(

That is quite OK. Someone else will fix this - we have beginning of a test, which is the hardest part.

Alex

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/111718 mentions this issue: net: stop multiple sends into single capacity channel in lookupIP

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

5 participants