Skip to content

chore(spanner): add LatencyTracker interface and default implementation#12729

Merged
olavloite merged 2 commits intomainfrom
spanner-latency-tracker
Apr 10, 2026
Merged

chore(spanner): add LatencyTracker interface and default implementation#12729
olavloite merged 2 commits intomainfrom
spanner-latency-tracker

Conversation

@olavloite
Copy link
Copy Markdown
Contributor

Adds an internal LatencyTracker interface and a default implementation that allows the client to track the latency of requests. This can be used for automatic replica selection and load balancing.

@olavloite olavloite requested review from a team as code owners April 9, 2026 13:46
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new latency tracking mechanism using Exponentially Weighted Moving Average (EWMA), including a LatencyTracker interface, its EwmaLatencyTracker implementation, and comprehensive unit tests. The primary feedback concerns the initial state of the EwmaLatencyTracker, specifically that an uninitialized tracker currently returns a score of 0.0. This is problematic as it implies a perfect score, potentially leading to incorrect load balancing decisions. It is suggested that an uninitialized tracker should instead return Double.POSITIVE_INFINITY to accurately reflect its unmeasured state, and a new test case should be added to verify this behavior.

@olavloite olavloite force-pushed the spanner-latency-tracker branch from bc8842b to 274308d Compare April 9, 2026 14:33
Adds an internal LatencyTracker interface and a default implementation that allows the
client to track the latency of requests. This can be used for automatic replica
selection and load balancing.
@olavloite olavloite force-pushed the spanner-latency-tracker branch from 274308d to c50bb2e Compare April 9, 2026 14:36
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good to me overall with some questions open. I would like to see follow-up PRs soon for answers

Open Question: Eligibility filtering for stale read/query only, score ownership, score updates from successful/errorful routed calls, and the actual Po2 selection logic.

*
* @param latencyMillis the observed latency in milliseconds.
*/
void update(long latencyMillis);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This flatten most “60us vs 500us vs 700us” differences into the same bucket. If we want this to drive bypass selection, the score needs to be at least micros, and ideally nanos or Duration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, changed to Duration.

@Override
public double getScore() {
synchronized (lock) {
return initialized ? score : Double.MAX_VALUE;
Copy link
Copy Markdown
Contributor

@rahul2393 rahul2393 Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getScore() returns Double.MAX_VALUE until the tracker has seen traffic. That means a new endpoint, or one recreated after eviction, will always lose against any sampled endpoint that has historical data. In other words, it never gets traffic, so it never learns. We should be probing / low-rate exploration so a replica can “come back to the game”; this implementation bakes in starvation unless some separate mechanism guarantees exploration.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that is a good point. We will fix this in a follow-up PR in combination in the ReplicaSelector by allowing some of the traffic to just choose a random endpoint.

import com.google.api.core.InternalApi;

/**
* Interface for tracking latency scores of Spanner servers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The abstraction is attached to the wrong identity unless you are very careful in the follow-up work. The doc wants a score “for a given spanner server”, but this branch introduces a generic tracker with no ownership model.

In the current routing code, CachedTablet instances are reused across cache updates and can change serverAddress in place. If someone later stores the EWMA on CachedTablet, the latency history from the old server will bleed into the new one after a cache update. The stable identity in this codebase is the per-address endpoint cached, not the tablet object.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack

@olavloite olavloite enabled auto-merge (squash) April 10, 2026 11:08
@olavloite olavloite merged commit c29b99f into main Apr 10, 2026
102 of 103 checks passed
@olavloite olavloite deleted the spanner-latency-tracker branch April 10, 2026 11:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants