Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanner: Root cause of "Error: Unable to release unknown resource" #716

Closed
skuruppu opened this issue Oct 30, 2019 · 24 comments
Closed

spanner: Root cause of "Error: Unable to release unknown resource" #716

skuruppu opened this issue Oct 30, 2019 · 24 comments
Assignees
Labels
api: spanner Issues related to the googleapis/nodejs-spanner API. needs more info This issue needs more information from the customer to proceed. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@skuruppu
Copy link
Contributor

From Getting “Error: Unable to release unknown resource.\n at SessionPool.release”:

I am using GCP spanner as my project RDBMS . Lately I am getting a lot of error like the one mentioned above in the logs. Can someone help me figure out what is going wrong here ?

I tried going through google docs, but found very little on the above error . Is it something related to session pool configuration ? How this can be resolved ?

Note: I have only 1 spanner db, which is running on 1 instance on GCP.

Any help is deeply appreciated :)

Thanks

Error: Got unhandled rejection {"stack":"Error: Unable to release unknown resource.\n    at SessionPool.release (/testProject/node_modules/@google-cloud/spanner/build/src/session-pool.js:241:19)\n    at <anonymous>\n    at process._tickCallback (internal/process/next_tick.js:189:7)","message":"Unable to release unknown resource.","resource":{}}
at process.on (/testProject/index.js:22:17)
@callmehiphop callmehiphop added the needs more info This issue needs more information from the customer to proceed. label Oct 30, 2019
@yoshi-automation yoshi-automation added the triage me I really want to be triaged. label Oct 30, 2019
@AVaksman
Copy link
Contributor

AVaksman commented Oct 30, 2019

Error origin analysis
The error is thrown when the library is trying to release a nonexistent (previously deleted) session from the session-pool._inventory.borrowed collection back into the pool after the transaction has finished, or error-ed.

Perhaps one possibility of the error causing scenario?

  1. SessionPoolOptions.idlesAfter or SessionPoolOptions.keepAlive value is set to be greater than 1 hour
  2. The library will not _evictIdleSessions or _ping such session during the _startHouseKeeping call.
  3. However such session will be auto deleted from the .borrowed collection during the
    getRead/WriteSession => _acquire => _isValidSession() validations call as validation criteria is hard-coded to 1 hour.

@mf2199
Copy link

mf2199 commented Oct 31, 2019

@codemart786 You could post the requested information directly here.

@codemart786
Copy link

codemart786 commented Oct 31, 2019

@mf2199 I am using nodejs-spanner version 4.0.2 . Is there any other info which you need ?

@AVaksman actually this error comes in prod env a lot(not exactly sure under what circumstances). I am not able to reproduce this error in my test env. I am not using any session pool configurations in the code, have left everything on defaults .

  1. Could you pls let me know how this error be reproduced ?
  2. Also what code changes do I need to make(if any) from my side to resolve/prevent this error ?

Thanks :)

@AVaksman
Copy link
Contributor

@codemart786

Would you kindly provide some more details on your environment

Environment details

  • OS:
  • Node.js version:
  • npm version:
  • @google-cloud/spanner version:

More specific details (if code snippet is not available)

  • SessionPoolOptions values used
  • Type of operations performed
  • Type of transactions used, Snapshot(read-only), Transaction(read-write)

Thanks

@codemart786
Copy link

@AVaksman

Environment Details

  • OS: ubuntu
  • Node.js version: 8.16.2
  • npm version: 6.4.1
  • @google-cloud/spanner version: 3.1.0

PS: By mistake I earlier reported spanner version to be (4.0.2) .We are not using spanner@4.0.2 anywhere in prod . My Bad here .

More specific details

  • SessionPoolOptions values used: As stated earlier, All the session pool related configurations I have left to default . I have no where in the code mentioned anything related to session pool config.

  • Type of operations performed: There are all types of operations like read, write, update, delete, query operations on the tables. Update operations use dml, Partitioned DML as well as mutations.

  • Type of transactions used: both read-write and read-only transactions are being used in the code . These are distributed over multiple tables but in the same database and same spanner instance.
    All the transactions use commit() function not the end() function everywhere to commit the data.

Thanks

@bcoe bcoe added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. type: question Request for information or clarification. Not an issue. priority: p2 Moderately-important priority. Fix may not be included in next release. and removed type: question Request for information or clarification. Not an issue. labels Oct 31, 2019
@yoshi-automation yoshi-automation removed the triage me I really want to be triaged. label Oct 31, 2019
@codemart786
Copy link

@AVaksman @mf2199 any updates here ?

There is one additional question which I need to ask here ... will updating the @google-cloud/spanner to the latest(4.2.0) will help in resolving this? May be it had already been fixed in the latest version.
I am asking this since in my local and test env , I have latest @google-cloud/spanner module and have not encountered any such errors there(though these env(s) have very very less traffic as compared to prod env). This error is very frequent in prod.

Thanks

@AVaksman
Copy link
Contributor

AVaksman commented Nov 4, 2019

@codemart786 Yes, please try updating to the latest version and verify if you are still experiencing the issue.

Thanks

@codemart786
Copy link

codemart786 commented Nov 11, 2019

@AVaksman @mf2199

I am experiencing the same issue, after moving to latest version also.Now we are running @google-cloud/spanner version 4.2.0, with all other env variables same as before . But now also experiencing the same errors .

Thanks

@skuruppu
Copy link
Contributor Author

I also had a quick look through the logic. As @AVaksman mentioned earlier, you run into this issue if the burrowed session was deleted before you attempt to release it because it:

  1. Expired because the keep-alive time is > 1 hour.
  2. The session.keepAlive() call failed during _ping.
  3. You release a session then attempt to release it again accidentally.

Without knowing much about your code, it's hard to know which of this situations you're running into.

We can probably rule out (1) since you said that you're using default options. By default we call keepAlive every 50 mins.

We should be pinging idle sessions periodically as part of the housekeeping logic so keepAlive() would be called. It's possible that some of these calls fail for some reason. Can you check your logs or stats to see if this is happening?

For (3) I can't say without seeing your code.

@codemart786
Copy link

codemart786 commented Dec 7, 2019

@AVaksman @mf2199 @callmehiphop @skuruppu

So, there are certain updates on this issue which I would like to note here:

There was a bug in our code related to r/w transactions . We were not doing transaction.end(), at places where we decided to forego commit/rollback.
This was mentioned here. We implemented that . Now we are doing transaction.end() at all places,
except successful commit . Since on successful commit, end() is called to release the session back to pool.

But after doing this too, we are still viewing the above mentioned error logs.

Having said that, there is another issue which I would like to mention here is that sometimes spanner database sessions just shoots up . This happens at time of peak traffic.
I think that there is some issue with default SessionPoolConfigurations , due to which the spanner nodejs client just waits to acquire the session infinitely. Due to this wait, our micro-service api(s) timeouts after 10s.

So we decided to tweak SessionPoolConfigurations a bit . Currently we are using { max: 150000, writes: 0.3, concurrency: 75000 } as SessionPoolConfigurations while creating new database client in our production environment.

  • We also increased the spanner database nodes from 1 to 2.

This configuration has stabilised our application for certain time period but at very high load this also breaks and the api(s) start giving timeouts again.

There is one more observation I have, which is when the docker container shuts down , in the grace period our process tries to close database connection by doing
database.close() . There also we receive an error as

Error: 1 session leak(s) detected.

Can someone here help justify this behaviour ? I am not sure, whether all these things are due to something which we are missing/doing in our code. What is going wrong here ?
Why are we seeing so many opened db sessions ? How do we mitigate this issue ?

PS:

The type of operations we are performing using spanner are already mentioned above . No new type of operation is being added.
For an idea of traffic, the scale at which we receive operate is not very high . Normally it is 5k-6k requests per minute . At peak times it would reach 5x-6x like 30k to 40k requests per minute . This is the average
traffic received by our service . Some apis may receive extra load, some may receive very less.
But from graph, db session reaches upto 400k at times . I have attached the chart for reference.

Any help in this regard is deeply appreciated . :)

Screenshot 2019-12-07 at 10 45 08 PM

@callmehiphop
Copy link
Contributor

@codemart786 the max limit only applies to a single Database instance, that configuration is not shared across database instances even if they reference the same one. Since your max is 150k, if you were to have multiple services using different instances and getting a decent amount of traffic on each you could definitely hit 400k sessions in a short amount of time. If you only have a single cached instance, that sounds like a pretty serious bug, so please let us know!

For the session leak error, you can actually inspect the stack trace for each leak found.

try {
  await database.close();
} catch (e) {
  if (e.name === 'SessionLeakError') {
    e.messages.forEach(stackTrace => console.log(stackTrace));
  }
}

@codemart786
Copy link

codemart786 commented Dec 8, 2019

@callmehiphop Thanks for the update here . It helps definitely.

No, we are creating different instance from different services . For each docker container there is singleton pattern in our code, 1 instance per container is created while server is being up . Having said that, it will create roughly 60 instances at a time with the database.

I would also like to mention some details which we noticed in our application . The CreateSession and DeleteSessions API Latency shoots up whenever sessions count increased . At the same time, our spanner db instance cpu health looks perfectly fine(less than 20% cpu utilisation). We cannot find any reason for this behaviour .

I will try out the stackTrace option on db close and let you know my findings .

There is one thing which is bugging me is the sessions count is abnormally high for our application . Our application goes down due to this on peak times . Is there something wrong we are doing or we are missing something very crucial here?

Based on your past experience, are there some obvious scenarios that could lead to spanner session leaks or very high session count or very high CreateSession/DeleteSession API latencies?

PS: I don't think we receive the kind of scale for which so many sessions are required. Sometimes the sessions count increased to 1.5 million. We have currently setup 2 spanner database nodes. We increase it from 1, hoping this may help to tackle the api(s) latency issue, but was of no help.

Any Help is deeply appreciated :)

Thanks

Below is the metric chart for Spanner API Latencies: In some cases the latencies reach 10 minutes . In the chart, the latency is around 4.5 minutes.

Light Blue Line: DeleteSession API request
Violet Line: CreateSession API request

Screenshot 2019-12-08 at 1 47 30 PM

@codemart786
Copy link

codemart786 commented Dec 9, 2019

@AVaksman @skuruppu @callmehiphop @mf2199

Hey guys, few things I want to ask here, related to this issue .

We did load testing in our staging environment, and have seen session count increasing from 3k to 100k within 30 mins.
We have an api which uses spanner locking read-write transaction. We are calling transaction.end() function also. I believe this is a session-leak problem, but could't understand why. We use rps of 4000 for 30 minutes on one of service apis.

Having said that, here are my some of doubts which I would like to clarify:

  1. Do I need to call transaction.end() even after successful commit of transaction? Right now I am not doing that . I am calling transaction.end() only if commit operation throws an error or if we decided to forego commit due to some reason . I believe on successful commit, transaction.end() is called to release the session back to the pool, by the spanner module itself. (Please correct me if I am wrong here)

  2. We added the code snippet as suggested by @callmehiphop on database.close(), before pod shuts-down . The error which we see in the logs is given below:


jsonPayload: {
  message: " got error message trace"   
  metadata: {
   err: "Session leak detected!
    at _acquire (/testProject/node_modules/@google-cloud/spanner/build/src/session-pool.js:278:44)
    at null (null:null:null)"    
   service: "testProject-abc"
  }
 }

Please help me know when the @google-cloud/spanner node module throws this stack trace? If I know in which direction to proceed further, it would be very helpful.

PS: Our application has suffered several production outages, which I believe is mostly due to this issue. Please help me in solving this issue. Any help is deeply appreciated :)

Thanks

@skuruppu
Copy link
Contributor Author

We're looking into your issue based on our offline conversation. But to answer some of your questions:

@AVaksman @skuruppu @callmehiphop @mf2199

Hey guys, few things I want to ask here, related to this issue .

We did load testing in our staging environment, and have seen session count increasing from 3k to 100k within 30 mins.
We have an api which uses spanner locking read-write transaction. We are calling transaction.end() function also. I believe this is a session-leak problem, but could't understand why. We use rps of 4000 for 30 minutes on one of service apis.

Having said that, here are my some of doubts which I would like to clarify:

  1. Do I need to call transaction.end() even after successful commit of transaction? Right now I am not doing that . I am calling transaction.end() only if commit operation throws an error or if we decided to forego commit due to some reason . I believe on successful commit, transaction.end() is called to release the session back to the pool, by the spanner module itself. (Please correct me if I am wrong here)

No, you don't have to call transaction.end() if you've already called transaction.commit().

  1. We added the code snippet as suggested by @callmehiphop on database.close(), before pod shuts-down . The error which we see in the logs is given below:

jsonPayload: {
  message: " got error message trace"   
  metadata: {
   err: "Session leak detected!
    at _acquire (/testProject/node_modules/@google-cloud/spanner/build/src/session-pool.js:278:44)
    at null (null:null:null)"    
   service: "testProject-abc"
  }
 }

Please help me know when the @google-cloud/spanner node module throws this stack trace? If I know in which direction to proceed further, it would be very helpful.

Interesting, it seems that the stack trace is incomplete. _acquire() is what would've been called to get the session but I would've expected to see the stack frames that cause _acquire() to be called. I'll need to do a bit more digging to figure out how to get a better stack trace here.

PS: Our application has suffered several production outages, which I believe is mostly due to this issue. Please help me in solving this issue. Any help is deeply appreciated :)

Thanks

@hengfengli
Copy link
Contributor

hengfengli commented Dec 13, 2019

@codemart786

"We did load testing in our staging environment, and have seen session count increasing from 3k to 100k within 30 mins."

What is the pool settings in your staging environment? How many instances are you running for your load test?

The upper-bound number of sessions you will see in the GCP console = max in the pool setting * the number of instances.

For example, if you see the number of total sessions reaches 1.5 million, because you set 150k for each instance and there are 10 running instances. When there are 1.5 million sessions with only 2 spanner nodes, I guess the latency would become much higher if you still try to create more sessions or delete sessions.

max should not be that large.

Can you try to lower the max number in your pool settings? Then, do the load test to see if it still has high latencies on createSession or deleteSessions.

@olavloite
Copy link
Contributor

olavloite commented Dec 16, 2019

Interesting, it seems that the stack trace is incomplete. _acquire() is what would've been called to get the session but I would've expected to see the stack frames that cause _acquire() to be called. I'll need to do a bit more digging to figure out how to get a better stack trace here.

The reason for the above is the problem described in this document. The TLDR version is that 'developers are currently facing the problem that the (non-standard) Error.stack property in V8 only provides a truncated stack trace up to the most recent await.'. The _acquire method contains an await statement here:

const session = await this._acquires.add(getSession);

Which means that the stacktraces will currently not contain the call stack of the application. See also nodejs/node#11865.

@codemart786
Copy link

Hi @hengfengli

We did load testing in staging environment with max set as 150K. There were about 20 pods running, at the peak load of 4K concurrent requests.
We also tried with setting max as 500 in our staging environment. I understand it will stop from sessions being created since we have lowered max. But beyond a point, our application api latency shot up. All api(s) requests were giving timeouts.

Having said that, is there any possibility that node js spanner client is not reusing the sessions?
This has become high priority task for us, since because of sessions count abnormal behaviour, spanner CreateSession and DeleteSessions api latencies reaches close to 10 minutes. All our apis interacting with spanner goes down. Please, could anyone of you share some steps which could be helpful in debugging this problem from your past experience.

Thanks @olavloite for sharing the doc. Will give it a read.

Thanks

@hengfengli
Copy link
Contributor

@codemart786

Having said that, is there any possibility that node js spanner client is not reusing the sessions?

Yes. If transaction.end() or transaction.commit()/.rollback() are not called, it certainly causes the session to leak. Also, transaction.end() needs to be called for read-only transactions.

Originally, I thought it is an issue in session pool (#750), but it turns out that my tests are not fair.

I still believe that sessions are leaked somewhere. Another guess is that some transactions have very high latencies, which holds sessions for a long time, so that others do not get a chance to get an idle session.

spanner CreateSession and DeleteSessions api latencies reaches close to 10 minutes

This is because you set a very high max. I can understand that you are trying to solve the problem by adding more sessions. But this certainly causes these two api having high latencies.

What I can think of is to check & obverse the status of local session pool, like what I did in #750.

I will continue to do some load tests trying to find out what the root cause is.

@skuruppu
Copy link
Contributor Author

skuruppu commented Dec 17, 2019

Thanks @olavloite for fixing #755. We published version v4.4.1 which should contain the stack trace printing fix.

@olavloite
Copy link
Contributor

Hi @codemart786

If I understand you correctly, you are (sort of) able to reproduce this problem on your staging environment, right? If so, would it be possible for you to try the following:

  1. Make a build of your application that uses the current master branch of this client library. master now contains feat: include potential leaked session stacktraces in error #759 which will include the stacktraces of potentially leaked sessions in the error that is thrown when the session pool is exhausted.
  2. Deploy this build on your test/staging environment with the following session pool settings:
    1. max=100
    2. fail=true (This setting will cause the session pool to throw an error instead of waiting indefinitely when the maximum number of sessions in the pool has been reached. You should therefore be careful with enabling this setting on your production environment, as these errors could bubble up to the users.)
  3. This will hopefully still trigger the problem that you are running into, but instead of the application just hanging, it will start throwing errors that include valuable debugging information.

If you could then inspect the stacktraces (messages) that are included in the SessionPoolExhaustedError, it should be possible to determine whether the problem is caused by a session leak or not, and when it is, what part of the code is causing it.

@codemart786
Copy link

Hi @olavloite @hengfengli

Thanks for the suggestion. I will give this a try and test the same in the staging environment. Will keep you guys updated about the same

Thanks

@skuruppu
Copy link
Contributor Author

skuruppu commented Jan 9, 2020

@codemart786 let us know if the issues are fixed. We should close this bug if there are no further related topics to discuss.

For any new issues, please open a separate issue.

@codemart786
Copy link

@skuruppu @olavloite @hengfengli

Hey guys,

Yes we are good to close this issue. Thanks everyone for the suggestions. We did many changes in response to session pool abnormally increase. As of now, the system is stable.
So the strategy was to identify the bottlenecks in our system, via load testing and to optimise it. Thanks to maor(from gcp spanner team), we optimised the table indexes wherever possible, reduced max to 500 in session pool config(as per @hengfengli and @olavloite suggestions). We did a lot of other changes in our system, which helped in mitigating the issue, the details of which is shared in email, with you guys.

@skuruppu
Copy link
Contributor Author

skuruppu commented Jan 9, 2020

Perfect, thanks for the update @codemart786. Glad to hear that the system is stable now. Please feel free to open issues in this repo if you come across any further problems or questions.

@skuruppu skuruppu closed this as completed Jan 9, 2020
@google-cloud-label-sync google-cloud-label-sync bot added the api: spanner Issues related to the googleapis/nodejs-spanner API. label Jan 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/nodejs-spanner API. needs more info This issue needs more information from the customer to proceed. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

No branches or pull requests

9 participants