Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FailOnNonTempDialError doesn't fail #2266

Closed
virtuald opened this issue Aug 15, 2018 · 10 comments
Closed

FailOnNonTempDialError doesn't fail #2266

virtuald opened this issue Aug 15, 2018 · 10 comments

Comments

@virtuald
Copy link
Contributor

virtuald commented Aug 15, 2018

What version of gRPC are you using?

07ef407

What version of Go are you using (go version)?

go version go1.10.3 darwin/amd64

What operating system (Linux, Windows, …) and version?

What did you do?

If possible, provide a recipe for reproducing the error.

package main

import (
	"context"
	"fmt"
	"google.golang.org/grpc"
	"net"
	"time"
)

type fatalError struct{}

func (fatalError) Error() string   { return "fatal" }
func (fatalError) Temporary() bool { return false }

func main() {

	dialer := grpc.WithDialer(func(string, time.Duration) (net.Conn, error) {
		fmt.Println("Fail")
		return nil, fatalError{}
	})

	conn, err := grpc.DialContext(context.Background(), "nope.server:8080",
		grpc.WithInsecure(),
		grpc.WithBlock(),
		grpc.FailOnNonTempDialError(true),
		dialer,
	)

	fmt.Printf("got %v %v\n", conn, err)
}

What did you expect to see?

The documentation for FailOnNonTempDialError says:

FailOnNonTempDialError returns a DialOption that specifies if gRPC fails on non-temporary dial errors. If f is true, and dialer returns a non-temporary error, gRPC will fail the connection to the network address and won't try to reconnect.

I would expect that if a non-temporary error was returned from the dialer, I would see the dial immediately return.

Fail
got <something>fatal

What did you see instead?

Instead, it just prints Fail over and over again since it keeps trying to retry (the thing it's not supposed to do).

Fail
Fail
Fail
...

Perhaps I'm misunderstanding the documentation?

@virtuald
Copy link
Contributor Author

To workaround this, I used a modified version of the blocking code in DialContext and ended up with using something similar to this instead of WithBlock:

errChan := make(chan error, 1)

dialer := grpc.WithDialer( func(addr, timeout) {
    conn, err := dial()
    if err != nil {
        select { 
        case errChan <- err:
        default:
        }
    }
    return conn, err
)

conn, err := grpc.DialContext(ctx, addr, dialer)
if err != nil {
    return err
}

for {
	s := conn.GetState()
	if s == connectivity.Ready {
		return conn, nil
	} else if s == connectivity.TransientFailure {
		select {
		case err := <-errChan:
			if permerror.IsTemporary(err) != permerror.Temporary {
				conn.Close()
				return nil, err
			}
		case <-ctx.Done():
			return nil, ctx.Err()
		}
	}
	if !conn.WaitForStateChange(ctx, s) {
		return nil, ctx.Err()
	}
}

@lyuxuan
Copy link
Contributor

lyuxuan commented Aug 15, 2018

Hi @virtuald, thanks for bringing up the issue. Actually, #1856 void the effect of FailOnNonTempDialError, and we need to update the comment on the API as it no longer has the effect that it supposes to. May I you your use case that requires the fail on non temp behavior? Thank you!

@virtuald
Copy link
Contributor Author

My use case is I was connecting to a GRPC server via a custom dialer. There are certain types of errors which I know I want to fail immediately so I was hoping this would do the trick. My client is effectively a proxy with expensive connection setup so I was caching my grpc connection objects, and if the underlying connection immediately fails I'd rather not cache it in that case.

I get that it's really difficult to determine if an error is permanent or not -- but presumably if someone sets FailOnNonTempDialError, they would expect failure to occur if a non-temporary failure occurs. You get what you asked for, right? And if FailOnNonTempDialError isn't set, then it should retry over and over again as it does now.

@menghanl menghanl added the P2 label Aug 16, 2018
@menghanl
Copy link
Contributor

It seems that what you need is to know the underlying connection error when a ClientConn is not working. Does that sound right?

There's #2055 to return the latest connection error in the ClientConn, but we haven't finalized on how to do that. Let us know if this solution will work for you.

@virtuald
Copy link
Contributor Author

Yes, that's about right. However, I also want it to fail fast -- basically, I want my workaround code above. I don't see how 2055 solves that.

@menghanl
Copy link
Contributor

With #2055 or something similar, you can wait on the ClientConn's connectivity state change. When it goes into transient failure, you can call the new function to look at the connection error, and choose to keep or close the ClientConn.

@dfawley
Copy link
Member

dfawley commented Aug 21, 2018

I think we can fix this so the old behavior keeps working -- i.e. FailOnNonTempDialError would be a fail-fast for blocking Dial calls. Blocking dials already poll connectivity status; we just need to check the last error when the state is transient failure and exit if it is of the non-temporary variety.

@dfawley
Copy link
Member

dfawley commented Aug 21, 2018

Increasing priority to keep it on our radar since this was an unintentional behavior change.

@dweomer
Copy link

dweomer commented Aug 24, 2018

I am seeing this in containerd/containerd#2576 when attempting to connect to unix sockets that either do not exist or the user doesn't have the proper permissions to read/write. My expectation is that the dial attempt would fail immediately (aka fail fast) in these two scenarios but instead I get "context deadline exceeded".

@virtuald
Copy link
Contributor Author

Pushed a potential fix for this in #2276

virtuald added a commit to virtuald/grpc-go that referenced this issue Aug 27, 2018
virtuald added a commit to virtuald/grpc-go that referenced this issue Aug 27, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Feb 23, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants