chore(spanner): add LatencyTracker interface and default implementation#12729
chore(spanner): add LatencyTracker interface and default implementation#12729
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new latency tracking mechanism using Exponentially Weighted Moving Average (EWMA), including a LatencyTracker interface, its EwmaLatencyTracker implementation, and comprehensive unit tests. The primary feedback concerns the initial state of the EwmaLatencyTracker, specifically that an uninitialized tracker currently returns a score of 0.0. This is problematic as it implies a perfect score, potentially leading to incorrect load balancing decisions. It is suggested that an uninitialized tracker should instead return Double.POSITIVE_INFINITY to accurately reflect its unmeasured state, and a new test case should be added to verify this behavior.
...r/google-cloud-spanner/src/main/java/com/google/cloud/spanner/spi/v1/EwmaLatencyTracker.java
Show resolved
Hide resolved
...ogle-cloud-spanner/src/test/java/com/google/cloud/spanner/spi/v1/EwmaLatencyTrackerTest.java
Show resolved
Hide resolved
bc8842b to
274308d
Compare
Adds an internal LatencyTracker interface and a default implementation that allows the client to track the latency of requests. This can be used for automatic replica selection and load balancing.
274308d to
c50bb2e
Compare
rahul2393
left a comment
There was a problem hiding this comment.
Thanks, looks good to me overall with some questions open. I would like to see follow-up PRs soon for answers
Open Question: Eligibility filtering for stale read/query only, score ownership, score updates from successful/errorful routed calls, and the actual Po2 selection logic.
| * | ||
| * @param latencyMillis the observed latency in milliseconds. | ||
| */ | ||
| void update(long latencyMillis); |
There was a problem hiding this comment.
This flatten most “60us vs 500us vs 700us” differences into the same bucket. If we want this to drive bypass selection, the score needs to be at least micros, and ideally nanos or Duration.
There was a problem hiding this comment.
Good point, changed to Duration.
| @Override | ||
| public double getScore() { | ||
| synchronized (lock) { | ||
| return initialized ? score : Double.MAX_VALUE; |
There was a problem hiding this comment.
getScore() returns Double.MAX_VALUE until the tracker has seen traffic. That means a new endpoint, or one recreated after eviction, will always lose against any sampled endpoint that has historical data. In other words, it never gets traffic, so it never learns. We should be probing / low-rate exploration so a replica can “come back to the game”; this implementation bakes in starvation unless some separate mechanism guarantees exploration.
There was a problem hiding this comment.
Yeah, that is a good point. We will fix this in a follow-up PR in combination in the ReplicaSelector by allowing some of the traffic to just choose a random endpoint.
| import com.google.api.core.InternalApi; | ||
|
|
||
| /** | ||
| * Interface for tracking latency scores of Spanner servers. |
There was a problem hiding this comment.
nit: The abstraction is attached to the wrong identity unless you are very careful in the follow-up work. The doc wants a score “for a given spanner server”, but this branch introduces a generic tracker with no ownership model.
In the current routing code, CachedTablet instances are reused across cache updates and can change serverAddress in place. If someone later stores the EWMA on CachedTablet, the latency history from the old server will bleed into the new one after a cache update. The stable identity in this codebase is the per-address endpoint cached, not the tablet object.
Adds an internal LatencyTracker interface and a default implementation that allows the client to track the latency of requests. This can be used for automatic replica selection and load balancing.