Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner: session pool destruction with disconnected client #3685

Closed
bombsimon opened this issue Feb 9, 2021 · 3 comments
Closed

spanner: session pool destruction with disconnected client #3685

bombsimon opened this issue Feb 9, 2021 · 3 comments

Comments

@bombsimon
Copy link

@bombsimon bombsimon commented Feb 9, 2021

Client

Spanner

Environment

Any, in this example local Spanner emulator

Go Environment

$ go version
go version go1.15 darwin/amd64
go env
GO111MODULE="auto"
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/simon.sawert/Library/Caches/go-build"
GOENV="/Users/simon.sawert/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/simon.sawert/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/simon.sawert/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/simon.sawert/tmp/spanner-close/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/dy/tkxx13zx3n1d_m28k_3lh6w00000gr/T/go-build725805116=/tmp/go-build -gno-record-gcc-switches -fno-common"

Code

MVP to reproduce issue
package main

/*
By ensuring that we create the spanner client after the DB has been created, a
pool wil be createInstanceAndDatabase with a default of 100 items.

* Start emulator: docker run --rm -it -p 127.0.0.1:9010:9010 gcr.io/cloud-spanner-emulator/emulator:1.1.1
* Start code: go run main.go
* Stop the emulator (^C)
* Stop the code (^C)

To exit the application, kill the PID or restart the emulator.
*/

import (
	"context"
	"fmt"
	"log"
	"os"
	"os/signal"
	"syscall"
	"time"

	"cloud.google.com/go/spanner"
	"google.golang.org/api/option"
	"google.golang.org/grpc"

	dbadmin "cloud.google.com/go/spanner/admin/database/apiv1"
	instance "cloud.google.com/go/spanner/admin/instance/apiv1"
	databasepb "google.golang.org/genproto/googleapis/spanner/admin/database/v1"
	instancepb "google.golang.org/genproto/googleapis/spanner/admin/instance/v1"
)

const (
	emulatorProjectName  = "emulator-project"
	emulatorInstanceName = "emulator-instance"
	emulatorDBName       = "emulator"
	emulatorProjectPath  = "projects/" + emulatorProjectName
	emulatorInstancePath = emulatorProjectPath + "/instances/" + emulatorInstanceName
	emulatorDBPath       = emulatorInstancePath + "/databases/" + emulatorDBName
)

func main() {
	ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
	defer cancel()

	opts := []option.ClientOption{
		option.WithEndpoint("localhost:9010"),
		option.WithGRPCDialOption(grpc.WithInsecure()),
		option.WithoutAuthentication(),
	}

	createInstanceAndDatabase(ctx, opts)

	log.Printf("Creating Spanner client - this will create a pool of 100 sessions")
	client, err := spanner.NewClient(ctx, emulatorDBPath, opts...)
	if err != nil {
		panic(err)
	}

	gracefulStop := make(chan os.Signal, 1)
	signal.Notify(gracefulStop, syscall.SIGINT)

	log.Printf("Waiting for SIGINT (press Ctrl + C)")
	<-gracefulStop

	log.Printf("Closing client")
	client.Close()

	log.Printf("All done, exiting!")
}

func createInstanceAndDatabase(ctx context.Context, opts []option.ClientOption) {
	log.Printf("Creating Spanner instance admin client")
	instanceAdminClient, err := instance.NewInstanceAdminClient(ctx, opts...)
	if err != nil {
		panic(err)
	}

	log.Printf("Creating Spanner database admin client")
	databaseAdminClient, err := dbadmin.NewDatabaseAdminClient(ctx, opts...)
	if err != nil {
		panic(err)
	}

	log.Printf("Deleting instance %s (ignoreing non existing)", emulatorInstanceName)
	_ = instanceAdminClient.DeleteInstance(ctx, &instancepb.DeleteInstanceRequest{
		Name: emulatorInstancePath,
	})

	log.Printf("Creating instance %s", emulatorInstanceName)
	if _, err := instanceAdminClient.CreateInstance(ctx, &instancepb.CreateInstanceRequest{
		Parent:     emulatorProjectPath,
		InstanceId: emulatorInstanceName,
		Instance: &instancepb.Instance{
			Config:      emulatorProjectName,
			DisplayName: emulatorInstanceName,
			Name:        emulatorInstancePath,
		},
	}); err != nil {
		panic(err)
	}

	log.Printf("Creating database %s", emulatorDBName)
	if _, err := databaseAdminClient.CreateDatabase(ctx, &databasepb.CreateDatabaseRequest{
		Parent:          emulatorInstancePath,
		CreateStatement: fmt.Sprintf("CREATE DATABASE %s", emulatorDBName),
	}); err != nil {
		panic(err)
	}
}

Expected behavior

Even with a lost connection to the Spanner instance, a call to Close() should not hang - at least not for 15 seconds per item in the session pool (which by default is set to 100).

Actual behavior

Since the connection is lost, when calling Close() on a Spanner client, it will hang for 15 seconds before timing out for each session in the pool - defaulting the operation to 1500 seconds.

Screenshots

N/A

Additional context

To reproduce this, use the code example above combined with the Spanner emulator. The code will create the instance and the database so all you have to do is to start the emulator, run the code, exit the emulator and then try to interrupt the code (Ctrl + C). Easiest way to start the Spanner emulator would probably be with Docker:

docker run --rm -it -p 127.0.0.1:9010:9010 gcr.io/cloud-spanner-emulator/emulator:1.1.1

I've noticed that it's important to ensure the database exist before creating the Spanner client. Initially I thought a connection was not created when calling spanner.NewClient() but apparently depending on whether the instance and database exist or not the session pool will be populated. By doing it in the order of the code above, the client will have a pool consisting of 100 items.

To summarise the flow when calling Close() on the client, we can see that it takes the following route:

  • Calling close will call close() on all the idle sessions

    func (c *Client) Close() {
    if c.idleSessions != nil {
    c.idleSessions.close()
    }
    c.sc.close()
    }

  • Which will iterate over the pool one item at a time and call destroy()

    allSessions := make([]*session, len(p.hc.queue.sessions))
    copy(allSessions, p.hc.queue.sessions)
    p.hc.mu.Unlock()
    for _, s := range allSessions {
    s.destroy(false)
    }

  • Which will create a timeout of 15 seconds for each item in the pool and call delete()

    ctx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
    defer cancel()
    s.delete(ctx)

  • Which will hang for 15 seconds each time since the client cannot connect to the Spanner instance due to lost connection

    err := s.client.DeleteSession(contextWithOutgoingMetadata(ctx, s.md), &sppb.DeleteSessionRequest{Name: s.getID()})
    if err != nil {
    logf(s.logger, "Failed to delete session %v. Error: %v", s.getID(), err)
    }


This issue is mostly a question about the reasoning behind this timeout (15 seconds), the sequential call and what a proper way to handle this situation would be. I don't thing a lost connection is that uncommon and with a proper liveness check for a service, a call to Close() could very well be issued when the connection is detected to be lost (and not recoverable). When that happens, I would like my program to exit as quick as possible.

I can see three or four obvious solutions around this:

  1. Send a context (with an optional timeout) to Close(). This will probably not happen since this would change the signature and require a new major release of the package to conform to semver.
  2. Lower the timeout to something really low in the millisecond area so we only get a second or three when this happens.
  3. Run all of the destroy() operations concurrently. I haven't checked if this is actually resonable or could case problems.
  4. Abort the destruction loop all together if the first attempt times out. Even with this I think 15 seconds is a really long time however.

If any of these alternatives sounds suitable I'll gladly work on a PR for that!

@skuruppu
Copy link
Contributor

@skuruppu skuruppu commented Feb 12, 2021

@olavloite if you could take a look when you have a moment, that would be great.

olavloite added a commit that referenced this issue Feb 12, 2021
Closing the session pool when closing a Spanner client should be quick and
should ignore any errors that are caused by an unresponsive backend or by
loss of connectivity. Go does not support fire-and-forget RPC invocations.
Instead this solution uses parallel invocations of the DeleteSession RPC
with short timeouts that are ignored if they occur.

Fixes #3685
@olavloite
Copy link
Collaborator

@olavloite olavloite commented Feb 12, 2021

@bombsimon Thanks for the detailed report. I agree with you that closing a client (and its session pool) should be quick, even if connectivity has been lost or the backend is non-responsive for any other reason. Fire-and-forget RPC invocations are not really supported, especially not if the gRPC channel is closed while those are still in flight, so I think the best solution is to parallelize the deletion of the sessions and use a relatively short deadline that should normally only be exceeded in extreme cases. This still means that your specific scenario will need 5 seconds to close down, but that should still be a lot more workable than the current 1500 seconds.

@bombsimon
Copy link
Author

@bombsimon bombsimon commented Feb 12, 2021

@olavloite Thank you so much, and thanks for creating the PR! 5 seconds sounds much more reasonable! :)

gcf-merge-on-green bot pushed a commit that referenced this issue Feb 16, 2021
Closing the session pool when closing a Spanner client should be quick and should ignore any errors that are caused by an unresponsive backend or by loss of connectivity. Go does not support fire-and-forget RPC invocations. Instead this solution uses parallel invocations of the `DeleteSession` RPC with short timeouts that are ignored if they occur.

Fixes #3685
gcf-merge-on-green bot pushed a commit that referenced this issue Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

3 participants