Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: deadlock in lookupIP upon error #24178

Closed
Kleissner opened this issue Feb 28, 2018 · 7 comments

Comments

Projects
None yet
5 participants
@Kleissner
Copy link

commented Feb 28, 2018

Latest version go1.10 windows/amd64

tl;dr: The function LookupHost leaks Go routines/handles and eventually completely deadlocks when you query an invalid domain with context cancelled

There is an obvious deadlock in the function net.lookupIP (file lookup_windows.go), this code: https://github.com/golang/go/blob/master/src/net/lookup_windows.go#L79

Here is the problem (commented also inline below in function):

  1. It creates the channel "ch" with buffer size 1 on L86
  2. It makes potentially 3 writes to the channel - if GetAddrInfoW fails, and then after the for loop
  3. However, it may only make 0 reads, when the context is cancelled.

While the function will return, the created Go routine deadlocks forever and will never release the thread that it acquired with acquireThread. Eventually if you make enough (500) IP lookups that fail, the function will lock forever and not return.

Potential solutions:

  1. Make the writes to the "ch" channel non-blocking via select
  2. Alternatively, detect the context cancellation in the Go routine
  3. Alternatively, increase channel size of ch to 3 so that writes will never block
  4. Alternatively, add return statement after write to ch channel so that it won't result in potential block
  5. A proper fix as suggested by alexbrainman

Code to demonstrate this bug

First: place this in lookup_windows.go:117 (right before the "ch <- ret" statement) so you can see the ongoing leaks:

		fmt.Printf("Creating leak, length %d / %d\n", len(threadLimit), cap(threadLimit))

Then you'll see the leaks in this special case. NOTES: This is Windows specific! Also your DNS provider/cache may brick this demo code; with the below domain it is expected that GetAddrInfoW fails.

package main

import (
	"context"
	"fmt"
	"net"
	"time"
)

const testDomain = "ty.laiwu.gov.cn"

func main() {
	fmt.Println("Start DNS Test, making lookups with context cancelled to a domain that will fail, causing leaks in lookupIP and eventually deadlock")

	ctx, cancel := context.WithCancel(context.Background())
	cancel()

	for n := 0; n < 1000; n++ {
		address := getIP(ctx, testDomain)
		fmt.Printf("%d: %s\n", n, address)

		time.Sleep(time.Second * 1)
	}

	fmt.Println("End DNS Test")
}

func getIP(ctx context.Context, Domain string) (addr string) {
	addrs, err := net.DefaultResolver.LookupHost(ctx, Domain)

	if err != nil {
		return err.Error()
	}

	if len(addrs) > 0 {
		return addrs[0]
	}

	return "None"
}

Output, you will see that net.threadLimit will eventually be full and then the function will lock forever:

Start DNS Test, making lookups with context cancelled to a domain that will fail, causing leaks in lookupIP and eventually deadlock
0: lookup ty.laiwu.gov.cn: context canceled
1: lookup ty.laiwu.gov.cn: context canceled
2: lookup ty.laiwu.gov.cn: context canceled
3: lookup ty.laiwu.gov.cn: context canceled
Creating leak, length 4 / 500
Creating leak, length 4 / 500
Creating leak, length 4 / 500
Creating leak, length 4 / 500
4: lookup ty.laiwu.gov.cn: context canceled
Creating leak, length 5 / 500
5: lookup ty.laiwu.gov.cn: context canceled
Creating leak, length 6 / 500
6: lookup ty.laiwu.gov.cn: context canceled
Creating leak, length 7 / 500
[...]
@odeke-em

This comment has been minimized.

Copy link
Member

commented Mar 2, 2018

@alexbrainman

This comment has been minimized.

Copy link
Member

commented Mar 5, 2018

It makes potentially 2 (!) writes to the channel - if GetAddrInfoW fails, and then after the for loop

I can actually see 3 writes to the ch variable - on lines 98, 113 and 116.

Eventually if you make enough (500) IP lookups that fail, the function will lock forever and not return.

Do you have a small example to demonstrate the problem? My hope is to use the code as part of the change to fix this issue.

Solution: Add return statements:

Yes, this will work. But that might break in the future, if someone adds more code that fails and forget to add write to ch or forget to add return. I would prefer we add new function that does not write into ch, but returns []IPAddr and error. Then we could use the function to write into ch once.

Would you like to send this fix yourself? This https://golang.org/doc/contribute.html is how to contribute. You could also send a Pull Request.

Thank you

Alex

@andybons andybons added this to the Go1.11 milestone Mar 7, 2018

@Kleissner

This comment has been minimized.

Copy link
Author

commented Mar 12, 2018

@alexbrainman I clarified the problem and added demo code that shows this leak & eventual lock.

The problem occurs when calling this function with a cancelled context and invalid domain.
I will check if I can come up with a good fix and pull request tomorrow.

@alexbrainman

This comment has been minimized.

Copy link
Member

commented Mar 18, 2018

@Kleissner nice test, thank you. I adjusted your test to run faster and crash instead of hang:

package main

import (
	"context"
	"fmt"
	"net"
	"time"
)

const testDomain = "ty.laiwu.gov.cn"

func main() {
	const google = "www.google.com"

	_, err := net.LookupHost(google)
	if err != nil {
		panic(err)
	}

	fmt.Println("Start DNS Test, making lookups with context cancelled to a domain that will fail, causing leaks in lookupIP and eventually deadlock")
	ctx, cancel := context.WithCancel(context.Background())
	cancel()
	for n := 0; n < 1000; n++ {
		address := getIP(ctx, testDomain)
		fmt.Printf("%d: %s\n", n, address)
		time.Sleep(time.Millisecond * 1)
	}

	_, err = net.LookupHost(google)
	if err != nil {
		panic(err)
	}

	fmt.Println("End DNS Test")
}

func getIP(ctx context.Context, Domain string) (addr string) {
	addrs, err := net.DefaultResolver.LookupHost(ctx, Domain)
	if err != nil {
		return err.Error()
	}
	if len(addrs) > 0 {
		return addrs[0]
	}
	return "None"
}

I will check if I can come up with a good fix and pull request tomorrow.

Sure, have a go, hopefully with a test. Whenever you can. If you cannot I will fix this myself when I have time. Just let me know.

Thank you.

Alex

@Kleissner

This comment has been minimized.

Copy link
Author

commented Apr 25, 2018

Can you @alexbrainman come up with a fix? Unfortunately I don't have enough time to write a proper fix :(

@alexbrainman

This comment has been minimized.

Copy link
Member

commented Apr 26, 2018

Can you @alexbrainman come up with a fix?

Sure thing. I will put it on my TODO list, which is very long at this moment.

Unfortunately I don't have enough time to write a proper fix :(

That is quite OK. Someone else will fix this - we have beginning of a test, which is the hardest part.

Alex

@gopherbot

This comment has been minimized.

Copy link

commented May 6, 2018

Change https://golang.org/cl/111718 mentions this issue: net: stop multiple sends into single capacity channel in lookupIP

@gopherbot gopherbot closed this in 33b2b17 May 12, 2018

boboanvip pushed a commit to boboanvip/goproxy that referenced this issue Jul 11, 2018

boboanvip pushed a commit to boboanvip/goproxy that referenced this issue Jul 12, 2018

@golang golang locked and limited conversation to collaborators May 12, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.