Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport is closing and transient failure errors since upgrading to 1.18 #2663

Closed
cristiangraz opened this issue Mar 1, 2019 · 9 comments

Comments

Projects
None yet
6 participants
@cristiangraz
Copy link

commented Mar 1, 2019

Please answer these questions before submitting your issue.

What version of gRPC are you using?

1.18.0

What version of Go are you using (go version)?

1.11.5

What operating system (Linux, Windows, …) and version?

Alpine Linux 3.9 running on Google Kubernetes Engine

What did you do?

Since upgrading, I've noticed a large increase in grpc errors leading to failed requests and service disruptions.

We don't use any connection settings on the clients, nor do we have any special settings on the server (keepalives, fastfail, etc). The only thing we have on the connections is Mutual TLS. It's roughly configured like this:

Client:

creds := grpc.WithTransportCredentials(credentials.NewTLS(&tls.Config{
		RootCAs:      certPool,
		Certificates: []tls.Certificate{cert},
		MinVersion:   tls.VersionTLS12,
}))
grpc.Dial("foo.foo.svc.cluster.local:80", creds)

Server:

creds := grpc.Creds(credentials.NewTLS(&tls.Config{
    Certificates:             []tls.Certificate{cert},
    ClientCAs:                certPool,
    ClientAuth:               tls.RequireAndVerifyClientCert,
    MinVersion:               tls.VersionTLS12,
    NextProtos:               []string{"h2"},
    PreferServerCipherSuites: true,
    CurvePreferences: []tls.CurveID{
        tls.CurveP256,
        tls.X25519,
    },
    CipherSuites: []uint16{
        tls.TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,
        tls.TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,
        tls.TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,
        tls.TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,
        tls.TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,
        tls.TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,
    },
}))
grpc.NewServer(creds)

This may be related to one of these other two issues
#2653
#2636

Based on the feedback in one of the above linked issues, I set these ENV vars on one service that had a lot of failures: GRPC_GO_LOG_VERBOSITY_LEVEL=99 GRPC_GO_LOG_SEVERITY_LEVEL=info

I added (__GRPC CALL HERE__) at the end of the log lines below where the error was logged as a failed grpc call that we made.

WARNING: 2019/03/01 04:41:42 grpc: addrConn.createTransport failed to connect to {foo.foo.svc.cluster.local:80 0  <nil>}. Err :connection error: desc = "transport: authentication handshake failed: context deadline exceeded". Reconnecting...
INFO: 2019/03/01 04:41:42 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:41:42 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
timestamp=2019-03-01T04:41:42.887562062Z level=error error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\"" (__GRPC CALL HERE))
INFO: 2019/03/01 04:41:43 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:41:43 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 04:41:44 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:41:44 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, READY
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:22 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 04:45:22 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY
INFO: 2019/03/01 04:45:28 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:34:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:25 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:35:25 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:27 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:35:32 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:35:59 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:35:59 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:36:01 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T05:38:19.972966567Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 05:38:19 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 05:38:19 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:19 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, TRANSIENT_FAILURE
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 pickfirstBalancer: HandleSubConnStateChange: 0xc000806290, CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 05:38:21 Subchannel Connectivity change to READY
INFO: 2019/03/01 05:38:21 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:13:05 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, TRANSIENT_FAILURE
INFO: 2019/03/01 16:13:05 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
timestamp=2019-03-01T16:13:05.92908361Z level=error error="rpc error: code = Unavailable desc = transport is closing" (__GRPC CALL HERE__)
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, CONNECTING
INFO: 2019/03/01 16:13:07 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:13:07 pickfirstBalancer: HandleSubConnStateChange: 0xc0008293d0, READY
INFO: 2019/03/01 16:20:17 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:20:17 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:20:18 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:20:23 Subchannel Connectivity change to READY
INFO: 2019/03/01 16:42:24 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 16:42:24 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 16:42:25 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:20:47 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:20:47 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:20:49 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:20:54 Subchannel Connectivity change to READY
INFO: 2019/03/01 17:43:01 Subchannel Connectivity change to TRANSIENT_FAILURE
INFO: 2019/03/01 17:43:01 transport: loopyWriter.run returning. connection error: desc = "transport is closing"
INFO: 2019/03/01 17:43:02 Subchannel Connectivity change to CONNECTING
INFO: 2019/03/01 17:43:07 Subchannel Connectivity change to READY
@menghanl

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2019

Can you also get server side logs to see if it contains more information?

The handshake deadline exceeded error looks very suspicious. It would help if we know why it happened.

@menghanl

This comment has been minimized.

Copy link
Contributor

commented Mar 1, 2019

One more thing to try: can you set GRPC_GO_REQUIRE_HANDSHAKE=off on client side and see if that changes anything? (#2406)

@kazegusuri

This comment has been minimized.

Copy link
Contributor

commented Mar 2, 2019

@menghanl I have experienced same issue here. Here is a reproduce procedure.

Example program

client

package main

import (
	"context"
	"log"
	"time"

	"github.com/golang/glog"
	"github.com/kazegusuri/grpc-test/pb"
	"google.golang.org/grpc"
	_ "google.golang.org/grpc/grpclog/glogger"
)

func main() {
	ctx := context.Background()
	conn1, err := grpc.DialContext(ctx, "127.0.0.1:9000", grpc.WithInsecure())
	if err != nil {
		log.Fatalf("grpc.DialContext failed: %v", err)
	}
	defer func() { _ = conn1.Close() }()

	cli1 := pb.NewEchoClient(conn1)
	for {
		time.Sleep(1 * time.Millisecond)
		_, err := cli1.Echo(ctx, &pb.EchoMessage{Message: "hello"})
		if err != nil {
			glog.Infof("goroutine: echo failed: %v", err)
		} else {
			glog.Infof("goroutine: echo succeeded")
		}
	}
}

server

func main() {
	server := newServer()
	s := grpc.NewServer(grpc.KeepaliveParams(keepalive.ServerParameters{
		MaxConnectionAge: 10 * time.Second, // this kills connection 10s after connected
	}))
	pb.RegisterEchoServer(s, server)

	lis, err := net.Listen("tcp", ":9000")
	if err != nil {
		log.Fatalf("failed to listen: %v", err)
	}

	if err := s.Serve(lis); err != nil {
		log.Fatalf("err %v\n", err)
	}

	select {}
}

Results

With above programs, I run the server in v1.19.0 and the client with v1.17, 1.18, 1.19.
The server kills the connection from the client 10 second after connected.
The client should reconnect to the server but its behavior is different in client versions.
With 1.18 and 1.19 the client takes over 1 second to recconect the server.

v1.17.0 (default)

I0303 02:22:59.745529    6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, TRANSIENT_FAILURE
I0303 02:22:59.745566    6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, CONNECTING
I0303 02:22:59.745732    6773 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00001e030, READY

v1.18.0 (default)

I0303 02:20:58.820793    5169 clientconn.go:941] Subchannel Connectivity change to TRANSIENT_FAILURE
I0303 02:20:58.820892    5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, TRANSIENT_FAILURE
I0303 02:20:59.821028    5169 clientconn.go:941] Subchannel Connectivity change to CONNECTING
I0303 02:20:59.821189    5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, CONNECTING
I0303 02:20:59.822058    5169 clientconn.go:941] Subchannel Connectivity change to READY
I0303 02:20:59.822155    5169 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00016e010, READY

v1.18.0 (GRPC_GO_REQUIRE_HANDSHAKE=off)

I0303 02:25:59.393033    8568 clientconn.go:941] Subchannel Connectivity change to TRANSIENT_FAILURE
I0303 02:25:59.393096    8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, TRANSIENT_FAILURE
I0303 02:26:00.393385    8568 clientconn.go:941] Subchannel Connectivity change to CONNECTING
I0303 02:26:00.393510    8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, CONNECTING
I0303 02:26:00.393937    8568 clientconn.go:941] Subchannel Connectivity change to READY
I0303 02:26:00.393979    8568 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc00017c010, READY


v1.19.0 (default)

I0303 02:28:57.655932   11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, TRANSIENT_FAILURE
I0303 02:28:58.656125   11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, CONNECTING
I0303 02:28:58.656489   11292 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000160010, READY

v1.19.0 (GRPC_GO_REQUIRE_HANDSHAKE=off)

I0303 02:30:30.642447   11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, TRANSIENT_FAILURE
I0303 02:30:31.642680   11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, CONNECTING
I0303 02:30:31.643004   11438 pickfirst.go:73] pickfirstBalancer: HandleSubConnStateChange: 0xc000162010, READY
@kazegusuri

This comment has been minimized.

Copy link
Contributor

commented Mar 2, 2019

This causes performance regression on the server which provides single subconn like Google Spanner.
Spanner seems to kill connections in the server-side in an hour, then the connection becomes transient failure. For a second in transient failure until becoming ready, the connection keeps unavaiable.

@kazegusuri

This comment has been minimized.

Copy link
Contributor

commented Mar 3, 2019

If it's okay to try to create transport immediately after transient failure for first retry, how about adding code like this here?

if ac.backoffIdx == 0 {
	backoffFor = 0
}
@vvakame

This comment has been minimized.

Copy link

commented Mar 6, 2019

I have a question.
I found baseDelay = 1.0 * time.Second in here.
I think this value is little bit big. Why we use 1 sec by default?
1 sec is a too big from user-facing frontend side.

@mpipet

This comment has been minimized.

Copy link

commented Mar 6, 2019

I encountered the same issue using the gc bigtable go client (grpc based), bigtable grpc server seems to have a connection max age of one hour, which means that every hour, I litteraly stop handling my traffic for one second.

I agree with @vvakame this default value is too high for a lot of use cases and should be easily configurable in the other google clients based on this repo.

@menghanl

This comment has been minimized.

Copy link
Contributor

commented Mar 6, 2019

There's a bug in the reconnect logic, the backoff time is not reset after creating a connection.

The fix is in #2669. Can you guys try it and see if it fixes the problems you are seeing? Thanks!

menghanl added a commit that referenced this issue Mar 8, 2019

@dfawley

This comment has been minimized.

Copy link
Contributor

commented Mar 14, 2019

Closing. Please comment if you are still experiencing problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.