Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add metrics to capture acquired and released sessions data #67

Merged
merged 7 commits into from Feb 27, 2020

Conversation

mayurkale22
Copy link
Member

@mayurkale22 mayurkale22 commented Feb 10, 2020

Updates #53 and continuation of #54 and #65.

New Metrics:
cloud.google.com/java/spanner/num_acquired_sessions => The number of sessions acquired from the session pool.

cloud.google.com/java/spanner/num_released_sessions => The number of sessions released (or destroyed) by the user and pool maintainer.

Another option, instead of having two metrics what if we just include one metric cloud.google.com/java/spanner/diff_acquired_released_sessions to indicate diff/gaps between acquired and released sessions. I think stand alone they both (num_acquired_sessions and num_released_sessions) are adding very less value unless we combine them to see the session leaks or other anomalies.

@mayurkale22 mayurkale22 requested a review from skuruppu Feb 10, 2020
@googlebot googlebot added the cla: yes label Feb 10, 2020
@mayurkale22 mayurkale22 requested review from rghetia and olavloite Feb 10, 2020
@@ -1346,9 +1359,11 @@ PooledSession getReadSession() throws SpannerException {
readWaiters.add(waiter);
} else {
span.addAnnotation("Acquired read write session");
numSessionsAcquired++;
Copy link
Member Author

@mayurkale22 mayurkale22 Feb 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olavloite I added these counters based on my understanding, let me know if there is a better way to get these (acquired and released sessions) values.

Copy link
Contributor

@olavloite olavloite Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no variable currently keeping track of this, but there is one that comes very close: numSessionsInUse. If you really want to keep track of the number of sessions acquired and the number of sessions released back into the pool, then you should add the increase statements of these variables at the exact same places as where numSessionsInUse is increased and decreased.

The maintainer never checks out a session from the pool in the same way as that user code can do. Checking for session leaks from the maintainer is therefore not possible by using any of these values. Other parts of the client library never checks out a session from the pool independently from any user code. That means that there is no way to measure whether a session leak is caused by a bug in user code or a bug in the client library. To the session pool, they will always look the same.

@skuruppu
Copy link
Contributor

@skuruppu skuruppu commented Feb 13, 2020

Another option, instead of having two metrics what if we just include one metric cloud.google.com/java/spanner/diff_acquired_released_sessions to indicate diff/gaps between acquired and released sessions.

I agree that the difference is the most useful. So could we call it something like cloud.google.com/java/spanner/unreleased_sessions?

I would also be interested in distinguishing between unreleased sessions held by the pool maintainer vs the user. Just because I want to distinguish between a leak in the client library and a leak by the user.

How would a user distinguish between this metric and in_use_sessions though? Logically, I would think of unreleased sessions as being the sessions that I'm using right now.

@mayurkale22
Copy link
Member Author

@mayurkale22 mayurkale22 commented Feb 13, 2020

I would also be interested in distinguishing between unreleased sessions held by the pool maintainer vs the user. Just because I want to distinguish between a leak in the client library and a leak by the user.

SGTM. What is the best way to understand the owner of the session (maintainer/user). I imagine it would be little complicated than keeping variable to track the acquired/released count.

How would a user distinguish between this metric and in_use_sessions though? Logically, I would think of unreleased sessions as being the sessions that I'm using right now.

I think you're right, users might get confused with these two metrics. Seems like unreleased_sessions are the sum of in_use_sessions and idle/leak sessions and it will be almost equal to (or greater than) in_use_sessions all the time. If the difference between the two is steadily increasing, it is certainly an indication that there is a session leak.
I think unreleased_sessions - in_use_sessions == count of session leaks. Please correct me, this is based on my limited knowledge on session pool.

@skuruppu
Copy link
Contributor

@skuruppu skuruppu commented Feb 14, 2020

SGTM. What is the best way to understand the owner of the session (maintainer/user). I imagine it would be little complicated than keeping variable to track the acquired/released count.

Yeah it would be more complicated. You would have to go through each of the methods that check out a session an increment a maintainer_checked_out counter or a user_checked_out counter based on whether the pool or the user checks out the session. I'm not too familiar with the code to know how difficult this would be.

I think unreleased_sessions - in_use_sessions == count of session leaks. Please correct me, this is based on my limited knowledge on session pool.

Yup that's my understanding as well.

@mayurkale22
Copy link
Member Author

@mayurkale22 mayurkale22 commented Feb 17, 2020

@olavloite I would like to get your reviews on this.

@@ -1628,7 +1628,7 @@ public Void call() {
assertThat(record.getMetrics())
.containsEntry(MetricRegistryConstants.NUM_ACQUIRED_SESSIONS, 0L);
assertThat(record.getMetrics())
.containsEntry(MetricRegistryConstants.NUM_RELEASED_SESSIONS, 3L);
.containsEntry(MetricRegistryConstants.NUM_RELEASED_SESSIONS, 5L);
Copy link
Member Author

@mayurkale22 mayurkale22 Feb 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skuruppu thanks for the suggestions. I made the changes, but somehow test results look fishy to me. After change, NUM_RELEASED_SESSIONS == 5 and NUM_ACQUIRED_SESSIONS == 0.

Copy link
Contributor

@skuruppu skuruppu Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was partially wrong. It turns out that releaseSession() is only used internally. So you can get calls to this function due to the session pool doing keep alive requests (e.g. check out session to send a "select 1" then release it back to the pool). I think this is why you're seeing the number of released sessions being higher than you expected.

I also just realized that getReadSession() docs say to call Session.close() when you're done with the sessions. It seems that sessions acquired by the user will get closed and sessions acquired by the client library gets released.

@olavloite did I understand this correctly?

Copy link
Contributor

@skuruppu skuruppu Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why NUM_ACQUIRED_SESSIONS is zero though. I would've thought calling getReadSession() would increment this.

Copy link
Contributor

@olavloite olavloite Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also just realized that getReadSession() docs say to call Session.close() when you're done with the sessions. It seems that sessions acquired by the user will get closed and sessions acquired by the client library gets released.

That is correct(ish), but the naming of the Session.close() method is a little bit confusing in this case. This is roughly what is going on in the session pool regarding checking out, closing, releasing etc.:

  1. User code will never see a session. Instead, user code sees different types of transactions (read/write, read-only, single use) and ResultSets. A session is automatically checked out from the session pool by the client library when one is needed for a transaction requested by user code. Checking out a session is always one of the methods getReadSession or getReadWriteSession. A session leak can be caused by user code if it fails to end a transaction or to close a ResultSet.
  2. The getReadSession and getReadWriteSession methods return a PooledSession instance. The close method of this class will check whether the session is still considered valid, and if so, release it into the pool (releaseSession()) or delete it (invalidateSession()).
  3. So technically, only the client library checks sessions out and releases them back into the session pool. User code never does (directly).
  4. In addition to the above, there is also the PoolMaintainer: The PoolMaintainer never really checks a session out from the pool, although the keepAlive method does do something that comes very close: It manually removes a session from the inventory of the pool, executes a ping request and then calls releaseSession() to add it back to the inventory. This is not tracked by the session pool as a session that is checked out and back in again, and should also be treated separately from all the other measurements for session leaks.

Regarding the current test results with NUM_RELEASED_SESSIONS == 5 and NUM_ACQUIRED_SESSIONS == 0. This is because the NUM_ACQUIRED_SESSIONS is only increased if a session could be acquired without causing a wait, and the 2 sessions that are being checked out, are being checked out directly after the session pool has been created. The pool therefore has not had the time to create any sessions yet, which will place both requests in a (very short) wait before they can be served.

NUM_RELEASED_SESSIONS is 5 because it is counting all calls to releaseSession. That method is also called when a session is created and released into the pool. You should only count the number of calls to PooledSession#close() (see also my comment above).

Copy link
Member Author

@mayurkale22 mayurkale22 Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olavloite would like to get your eyes on this.

@mayurkale22 mayurkale22 force-pushed the metrics_3 branch 3 times, most recently from d6c99a3 to a268a3b Compare Feb 25, 2020
@mayurkale22
Copy link
Member Author

@mayurkale22 mayurkale22 commented Feb 25, 2020

@skuruppu I updated the code (341feee) based on #67 (comment). The released and acquired count looks good to me now, PTAL. Thanks

Copy link
Contributor

@skuruppu skuruppu left a comment

Thanks for fixing the implementation based on @olavloite's feedback.

Consider removing the paragraph in the commit message for this PR regarding the diff metric since we're not implementing that here.

@mayurkale22
Copy link
Member Author

@mayurkale22 mayurkale22 commented Feb 26, 2020

Consider removing the paragraph in the commit message for this PR regarding the diff metric since we're not implementing that here.

Done.

@mayurkale22
Copy link
Member Author

@mayurkale22 mayurkale22 commented Feb 26, 2020

@olavloite Please review and approve when you get a chance. Thanks

@mayurkale22
Copy link
Member Author

@mayurkale22 mayurkale22 commented Feb 27, 2020

These metrics have all been tested locally against a real spanner instance and this is what I see:
acquired-released

Copy link
Contributor

@olavloite olavloite left a comment

The code looks good to me. The build errors are not related to this PR.
I assume that the screenshot of the test run was made before the last minor fix was applied.

@mayurkale22
Copy link
Member Author

@mayurkale22 mayurkale22 commented Feb 27, 2020

I assume that the screenshot of the test run was made before the last minor fix was applied.

Yes, I have changed the screenshot based on current implementations. Thanks for noticing it.

@mayurkale22 mayurkale22 merged commit 94d0557 into googleapis:master Feb 27, 2020
12 checks passed
@mayurkale22 mayurkale22 deleted the metrics_3 branch Feb 27, 2020
gopherbot pushed a commit to googleapis/google-cloud-go that referenced this issue Mar 31, 2020
This change addes five metrics for session management:

* in_use_sessions: the current number of sessions that are checked out
from the session pool.
* max_in_use_sessions: the maximum number of in_use_sessions in last 10
minutes. This is based on tumbling windows, instead of sliding windows.
* max_allowed_sessions: the maximum number of sessions that is
configured by the user.
* get_sessions_timeout: the cumulative number of get sessions timeouts
when pool exhaustion happens.
* num_acquired_sessions: the cumulative number of sessions that are
checked out from the session pool.
* num_released_sessions: the cumulative number of sessions that are
released back to the session pool.

All metrics are tagged by:

* client_id: each database has its own increasing ID sequence. For
two different databases, their client IDs all start from "client-1".
* database: the database ID.
* instance_id: the instance ID.
* library_version: the library version from
google-cloud-go/internal/version which is a date in YYYYMMDD format.

Notes:

There are three ways to check out a session from the pool:

1) take(): get a read session called by a user
2) takeWriteSession(): get a read/write session by a user
3) getNextForTx() in healthchecker's worker (session.go:1336):
healthchecker's workers convert read sessions to r/w sessions. E.g.,
when initializing the pool, workers check out 20 read sessions from
the pool, turn them into r/w sessions, and put them back to the pool
, assume that MinOpended=100 and WriteSessions=0.2.

So some metrics are also emitted by case 3. This might confuse users if
they are not aware of this behaviour.

Related Java PRs:

* googleapis/java-spanner#54
* googleapis/java-spanner#65
* googleapis/java-spanner#67

Change-Id: Ie163b08ef18ac2a47e1669fefab92d61fe8f2a82
Reviewed-on: https://code-review.googlesource.com/c/gocloud/+/52953
Reviewed-by: kokoro <noreply+kokoro@google.com>
Reviewed-by: Knut Olav Løite <koloite@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants