Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: rfc for configurable kv timeout #45093

Merged
merged 7 commits into from
Jul 10, 2023
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
230 changes: 230 additions & 0 deletions docs/design/2023-06-30-configurable-kv-timeout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
# Proposal: Configurable KV Timeout

* Authors: [cfzjywxk](https://github.com/cfzjywxk)
* Tracking issue: TODO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can file a tracking issue now


## Motivation

The associated TiKV requests are typically processed quickly, taking only a few milliseconds. However, if
there is disk IO or network latency jitter on a specified TiKV node, the duration of these requests may spike
to several seconds or more. This usually happens when there is an EBS issue like the EBS is repairing its
replica.
ekexium marked this conversation as resolved.
Show resolved Hide resolved

Currently, the timeout limit for the TiKV requests is fixed and cannot be adjusted. For example:
- `ReadTimeoutShort`: [30 seconds](https://github.com/tikv/client-go/blob/ce9203ef66e99dcc8b14f68777c520830ba99099/internal/client/client.go#L82), which is used for the Get RPC request.
- `ReadTimeoutMedium`: [60 seconds](https://github.com/tikv/client-go/blob/ce9203ef66e99dcc8b14f68777c520830ba99099/internal/client/client.go#L83), which is used for the BatchGet, Cop RPC request.

The kv-client or TiDB may wait for the responses from TiKV nodes until timeout happens,
it could not retry or redirect the requests to other nodes quickly as the timeout configuration is usually
tens of seconds. As a result, the application may experience a query latency spike when TiKV node jitter is
encountered.

A possible improvement suggested by [#44771](https://github.com/pingcap/tidb/issues/44771) is to make the
timeout values of specific KV requests configurable. For example:
- Adding a session variable `tidb_kv_read_timeout`, which is used to control the timeout for a single
TiKV read RPC request. When the user sets the value of this variable, all read RPC request timeouts will use this value.
The default value of this variable is 0, and the timeout of TiKV read RPC requests is still the original
value of `ReadTimeoutShort` and `ReadTimeoutMedium`.
- Adding statement level hint like `SELECT /*+ tidb_kv_read_timeout(500ms) */ * FROM t where id = ?;` to
set the timeout value of the KV requests of this single query to the certain value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can briefly describe the benefit of configurable timeout at the end of the Motivation section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in the Motivation & Background section.

## Example Usage

Consider the stale read usage case:

```SQL
set @@tidb_read_staleness=-5;
# The unit is miliseconds. The session variable usage.
set @@tidb_tikv_tidb_timeout=500;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
set @@tidb_tikv_tidb_timeout=500;
set @@ tidb_kv_read_timeout=500;

select * from t where id = 1;
# The unit is miliseconds. The query hint usage.
select /*+ tidb_kv_read_timeout(500ms) */ * FROM t where id = 1;
```

## Benefits

For latency sensitive applications, providing predictable sub-second read latency by set fast retrying
is valuable. queries that are usually very fast (such as point-select query), setting the value
of `tidb_kv_read_timeout` to short value like `500ms`, the TiDB cluster would be more tolerable
for network latency or io latency jitter for a single storage node, because the retry are more quickly.

## Problems

- Setting the variable `tidb_kv_read_timeout ` may not be easy if it affects the timeout for all
TiKV read requests, such as Get, BatchGet, Cop in this session.A timeout of 1 second may be sufficient for GET requests,
but may be small for COP requests. Some large COP requests may keep timing out and could not be processed properly.
- If the value of the variable `tidb_kv_read_timeout` is set too small, more retries will occur,
Copy link
Contributor

@ekexium ekexium Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can solve this by letting this timeout setting take effect only once, because it is less likely to have multiple nodes affected by jitter at the same time. Unlimited retry can also lead to congestion, increasing latency of all requests and make the situation worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Or the max retry times could be set to the same value as the max-replica configuration which is the available replicas that can handle stale or follower read requests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Described in the Timeout Retry part, current strategy is to let the timeout retry take effect at most available-replicas - 1 times.

increasing the load pressure on the TiDB cluster. In the worst case the query would not return until the
max backoff timeout is reached if the `tidb_kv_read_timeout` is set to a value which none of the replicas
could finish processing within that time.


## Detailed Design

The query hint usage would be more flexible and safer as the impact is limited to a single query.

### Add Hint Support For `tidb_kv_read_timeout`

- Add related field in the `StatementContext` struct like
```go
type StmtHints struct {
...
kv_read_timeout: Duration
cfzjywxk marked this conversation as resolved.
Show resolved Hide resolved
}
```
- Support `tidb_kv_read_timeout` processing in `ExtractTableHintsFromStmtNode` and `handleStmtHints`, convert
the user input hint value into `StmtContext` so it could be used later.

## Support Timeout Configuration For Get And Batch-Get

- Add related filed in the `KVSnapshot` struct like

```go
type KVSnapshot struct {
...
KVReadTimeout: Duration
}
```

This change is to be done in the `client-go` repository.

- Add set interface `SetKVReadTimeout` and get interface `GetKVReadTimeout`
```go
func (s *KVSnapshot) SetKVReadTimeout(KVTimeout Duration) {
s.KVReadTimeout = KVTimeout
}
```

This change is to be done both in the `client-go` repository and `tidb` repository.

- Call the set interfaces to pass the timeout value from `StmtContext` to the `KVSnapshot` structures, when
the `PointGet` and `BatchPointGet` executors are built.
- Change the `kvproto` to pass the read timeout value for the [`GetRequest`](https://github.com/pingcap/kvproto/blob/master/proto/kvrpcpb.proto#L26)
```protobuf
message GetRequest {
Context context = 1;
bytes key = 2;
uint64 version = 3;
uint32 read_timeout = 4; // Add this new field.
}
```
- Change the `kvproto` to pass the read timeout value ~ the [`BatchGetRequest`](https://github.com/pingcap/kvproto/blob/master/proto/kvrpcpb.proto#L414)
```protobuf
message BatchGetRequest {
Context context = 1;
repeated bytes keys = 2;
uint64 version = 3;
uint32 read_timeout = 4; // Add this new field.
}
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently kvproto already has a field max_execution_duration_ms in Context https://github.com/pingcap/kvproto/blob/master/proto/kvrpcpb.proto#L820, so no need to add new timeout field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we could re-use the field in the Context which is used by write requests now, thanks for the reminding.

These changes need to be done in the `kvproto` repository.
- Support timeout check during the request handling in TiKV. When there's new point get and batch point get
requests are created, the `kv_read_timeout` value should be read from the `GetReuqest` and `BatchGetRequest`
and passed to the `pointGetter`. But by now there's no canceling mechanism when the task is scheduled to the
read pool in TiKV, a simpler way is to check the deadline duration before next processing and try to return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we distinguish different DeadlineExceeded Errors? If it's caused by kv_read_timeout it's supposed to retry. What happens if it is caused by other deadlines setting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There don't seem to be any other deadlines setting in TiDB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TiKV coprocessor requests set their deadlines from the config item end_point_request_max_handle_duration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more clear to distinguish them as the kv_read_timeout is supposed to only affect read requests.

directly if the deadline is already exceeded so the resources could be saved.
```rust
pub fn get(
&self,
mut ctx: Context,
key: Key,
start_ts: TimeStamp,
) -> impl Future<Output = Result<(Option<Value>, KvGetStatistics)>> {
...
let res = self.read_pool.spawn_handle(
async move {
// Add timeout check.
let snapshot =
Self::with_tls_engine(|engine| Self::snapshot(engine, snap_ctx)).await?;

{
...
let result = Self::with_perf_context(CMD, || {
let _guard = sample.observe_cpu();
let snap_store = SnapshotStore::new(
snapshot,
start_ts,
ctx.get_isolation_level(),
!ctx.get_not_fill_cache(),
bypass_locks,
access_locks,
false,
);
// Add timeout check.
snap_store
.get(&key, &mut statistics)
});
...
...
);
}
```
These changes need to be done in the `tikv` repository.

## Support Timeout Configuration For Coprocessor Tasks

- Add related field in the `copTask` struct like
```go
type copTask struct {
...
KVReadTimeout: Duration
}
```
- Add the `kv_read_timeout` field in the [`coprocessor.Request`](https://github.com/pingcap/kvproto/blob/master/proto/coprocessor.proto#L24)
This change needs to be done in the `kvproto` repository.
- Use the `kv_read_timeout` value passed in to calculate the `deadline` result in `parse_and_handle_unary_request`,
```rust
fn parse_request_and_check_memory_locks(
&self,
mut req: coppb::Request,
peer: Option<String>,
is_streaming: bool,
) -> Result<(RequestHandlerBuilder<E::Snap>, ReqContext)> {
...
req_ctx = ReqContext::new(
tag,
context,
ranges,
self.max_handle_duration, // Here use the specified timeout value.
peer,
Some(is_desc_scan),
start_ts.into(),
cache_match_version,
self.perf_level,
);
}
```
This change needs to be done in the `tikv` repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto?


## Timeout Retry

- When timeout happens, retry the next available peer for the read requests like stale read and `non-leader-only`
snapshot read requests. The strategy is:
1. Try the next peer if the first request times out immediately.
2. If all the responses from all the peers are `timeout` errors, return `timeout` error to the upper layer,
3. Backoff with the `timeout` error and retry the requests with **the orginal default** `ReadShortTimeout`
and `ReadMediumTimeout` values.
4. Continue to process like before.
- The timeout retry details should be recorded and handled properly in the request tracker, so are the metrics.


## Compatibility

As new fields are added into the structs in the `kvproto` repository. The requests with configurable timeout
values would take effect on newer versions. These fields are expected not to take effect when down-grading the
cluster theoretically.

## More comprehensive solution
Copy link
Contributor

@ekexium ekexium Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we support cancelling a request from outside commands, we can implement a real hedge policy. For example

  • For requests taking more than a threshold of time, send a hedge request to another replica. Either response can be taken as the final result.
  • When either response is returned, cancel the other request.
  • TiKV prioritizes non-hedge requests so the hedge policy won't increase latencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be realized if the real in-time canceling is supported as described above.

There’s already a query level timeout configuration `max_execution_time` which could be configured using
variables and query hints.
If the timeout of RPC or TiKV requests could derive the timeout values from the `max_execution_time` in an
intelligent way, maybe it’s not needed to expose another variable like `tidb_kv_read_timeout`.

For example, consider the strategy:
- The `max_execution_time` is configured to `1s` on the query level
- The first RPC timeout is configured to some proportional value of the `max_execution_time` like `500ms`
- When first RPC times out the retry RPC is configured to left value of the `max_execution_time` like `400ms`

From the application’s perspective, TiDB is trying to finish the requests ASAP considering the
input `max_execution_time` value, and when TiDB could not finish in time the timeout error is returned
to the application and the query fails fast.