Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner: timeout / context canceled during getting session #7527

Closed
ericwenn opened this issue Mar 9, 2023 · 11 comments
Closed

spanner: timeout / context canceled during getting session #7527

ericwenn opened this issue Mar 9, 2023 · 11 comments
Assignees
Labels
api: spanner Issues related to the Spanner API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release.

Comments

@ericwenn
Copy link

ericwenn commented Mar 9, 2023

Client

Spanner

Environment

Managed Cloud Run

Description
Since upgrading spanner library to v1.43.0 we have started seeing intermittent timeout/context canceled issues when getting sessions, with error message:

timeout / context canceled during getting session.
Enable SessionPoolConfig.TrackSessionHandles if you suspect a session leak to get more information about the checked out sessions.

After this happens once the Spanner client is unable to get any sessions, which means all requests to our service times out after 10s (our configured request deadline), until the instance is restarted.

Screenshot_20230309_170104
^Traces from when issue happens, until instance is manually restarted (~4:20 - 4:30).

Looking at the traces does not add more information for debugging. The only culprit is cloud.google.com/go/spanner.Query (9974.995 ms) which seems to block everything.

We have not been able to reproduce this issue consistently, it seems to happen randomly every 1-2 days.


When we noticed this first we rolled back to v1.42.0 and have not seen this issue on that version (running in production for ~1 week).

Looking at recent issues in this repository we tried to bump to the unreleased version based on the fix for this issue, but still saw the issues on that commit.

Judging from changes between v1.42.0 and v.1.43.0 these changes seem to be the culprit, but I'm not sure of that.

@ericwenn ericwenn added the triage me I really want to be triaged. label Mar 9, 2023
@product-auto-label product-auto-label bot added the api: spanner Issues related to the Spanner API. label Mar 9, 2023
@rahul2393
Copy link
Contributor

rahul2393 commented Mar 9, 2023

Hello @ericwenn
Can you help with your sessionConfig values, and what type of load distribution you have read:write ratio to help us replicate the issue.

@ericwenn
Copy link
Author

ericwenn commented Mar 9, 2023

We are using the default sessionConfig (ie spanner.NewClient(ctx, "[name]")).
Load distribution is mainly reads.

@rahul2393
Copy link
Contributor

Thanks for quick answer, do you have QPS numbers with 1 client

@ericwenn
Copy link
Author

ericwenn commented Mar 9, 2023

Avg 1 QPS

@jzelinskie
Copy link

We've got some reports (mentioned in GitHub above) from folks using v1.42.0

@ericwenn
Copy link
Author

@rahul2393 any ideas on this?

@rahul2393
Copy link
Contributor

Hello @ericwenn We are trying to replicate the issue at our end with session config having min/max=1 session, will update here if we find anything.

Feel free to share any code you have to help replicate quickly.

@ericwenn
Copy link
Author

Thanks for update. We have unfortunately not be able to replicate this issue consistently.

@rahul2393 rahul2393 added priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. and removed triage me I really want to be triaged. labels Mar 17, 2023
@ericwenn
Copy link
Author

ericwenn commented Mar 21, 2023

Hello @rahul2393,
After enabling tracking of session handles and waiting for the issue to re-appear we managed to track this down, and likely fix it.

We were opening a ReadOnlyTransaction without closing it after we're done. We are deploying that fix right now, and will keep an eye out.
Everything I said above regarding a specific version being at fault is likely wrong then, since we introduced the broken transaction at around the same time we bumped spanner client.

Off-topic: Do you have any ideas on how to systematically prevent these type of issues, for example when running tests not allowing the spanner client to have open sessions when it is shut down (or similar)?

@rahul2393
Copy link
Contributor

Nice @ericwenn, I think that's the reason I am not able to replicate the issue before because I was closing them in my replication code.

Currently we don't have a way, need to check in other languages if they are handling the scenario.
@olavloite any suggestions? One solution can be to maintain a client state & deferring any operation if its shutting down.

@rahul2393
Copy link
Contributor

@ericwenn Closing this ticket since you already found the issue, will create another ticket for preventing client to open sessions when it is shut down.
Feel free to open this if you feel this issue need more investigation.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the Spanner API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release.
Projects
None yet
Development

No branches or pull requests

3 participants