doc: rfc for configurable kv timeout #45093

cfzjywxk · 2023-06-30T10:04:03Z

What problem does this PR solve?

Issue Number: ref #44771

Problem Summary:
Add the detailed design document for the configurable kv timeout proposal.

What is changed and how it works?

Check List

Tests

Side effects

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Signed-off-by: cfzjywxk <lsswxrxr@163.com>

tiprow · 2023-06-30T10:04:25Z

Hi @cfzjywxk. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

crazycs520 · 2023-07-03T07:18:11Z

docs/design/2023-06-30-configurable-kv-timeout.md

+- Change the `kvproto` to pass the read timeout value for the [`GetRequest`](https://github.com/pingcap/kvproto/blob/master/proto/kvrpcpb.proto#L26)
+```protobuf
+message GetRequest {
+    Context context = 1;
+    bytes key = 2;
+    uint64 version = 3;
+    uint32 read_timeout = 4; // Add this new field.
+}
+```
+- Change the `kvproto` to pass the read timeout value ~ the [`BatchGetRequest`](https://github.com/pingcap/kvproto/blob/master/proto/kvrpcpb.proto#L414)
+```protobuf
+message BatchGetRequest {
+    Context context = 1;
+    repeated bytes keys = 2;
+    uint64 version = 3;
+    uint32 read_timeout = 4; // Add this new field.
+}
+```


Currently kvproto already has a field max_execution_duration_ms in Context https://github.com/pingcap/kvproto/blob/master/proto/kvrpcpb.proto#L820, so no need to add new timeout field?

Yes we could re-use the field in the Context which is used by write requests now, thanks for the reminding.

crazycs520 · 2023-07-03T07:22:56Z

docs/design/2023-06-30-configurable-kv-timeout.md

+- Add the `kv_read_timeout` field in the [`coprocessor.Request`](https://github.com/pingcap/kvproto/blob/master/proto/coprocessor.proto#L24)
+This change needs to be done in the `kvproto` repository.
+- Use the `kv_read_timeout` value passed in to calculate the `deadline` result in `parse_and_handle_unary_request`,
+```rust
+fn parse_request_and_check_memory_locks(
+    &self,
+    mut req: coppb::Request,
+    peer: Option<String>,
+    is_streaming: bool,
+) -> Result<(RequestHandlerBuilder<E::Snap>, ReqContext)> {
+    ...
+     req_ctx = ReqContext::new(
+        tag,
+        context,
+        ranges,
+        self.max_handle_duration, // Here use the specified timeout value.
+        peer,
+        Some(is_desc_scan),
+        start_ts.into(),
+        cache_match_version,
+        self.perf_level,
+    );
+}
+```
+This change needs to be done in the `tikv` repository.


crazycs520 · 2023-07-03T08:05:23Z

docs/design/2023-06-30-configurable-kv-timeout.md

+```SQL
+set @@tidb_read_staleness=-5;
+# The unit is miliseconds. The session variable usage.
+set @@tidb_tikv_tidb_timeout=500;


Suggested change

set @@tidb_tikv_tidb_timeout=500;

set @@ tidb_kv_read_timeout=500;

Signed-off-by: cfzjywxk <lsswxrxr@163.com>

docs/design/2023-06-30-configurable-kv-timeout.md

ekexium · 2023-07-04T03:25:13Z

docs/design/2023-06-30-configurable-kv-timeout.md

+- Setting the variable `tidb_kv_read_timeout ` may not be easy if it affects the timeout for all 
+TiKV read requests, such as Get, BatchGet, Cop in this session.A timeout of 1 second may be sufficient for GET requests, 
+but may be small for COP requests. Some large COP requests may keep timing out and could not be processed properly.
+- If the value of the variable `tidb_kv_read_timeout` is set too small, more retries will occur, 


We can solve this by letting this timeout setting take effect only once, because it is less likely to have multiple nodes affected by jitter at the same time. Unlimited retry can also lead to congestion, increasing latency of all requests and make the situation worse.

Good idea. Or the max retry times could be set to the same value as the max-replica configuration which is the available replicas that can handle stale or follower read requests.

Described in the Timeout Retry part, current strategy is to let the timeout retry take effect at most available-replicas - 1 times.

ekexium · 2023-07-04T03:26:18Z

docs/design/2023-06-30-configurable-kv-timeout.md

+value of `ReadTimeoutShort` and `ReadTimeoutMedium`.
+- Adding statement level hint like `SELECT /*+ tidb_kv_read_timeout(500ms) */ * FROM t where id = ?;` to
+set the timeout value of the KV requests of this single query to the certain value.
+


We can briefly describe the benefit of configurable timeout at the end of the Motivation section.

Added in the Motivation & Background section.

docs/design/2023-06-30-configurable-kv-timeout.md

ekexium · 2023-07-04T03:44:12Z

docs/design/2023-06-30-configurable-kv-timeout.md

+- Support timeout check during the request handling in TiKV. When there's new point get and batch point get 
+requests are created, the `kv_read_timeout` value should be read from the `GetReuqest` and `BatchGetRequest`
+and passed to the `pointGetter`. But by now there's no canceling mechanism when the task is scheduled to the
+read pool in TiKV, a simpler way is to check the deadline duration before next processing and try to return


Shall we distinguish different DeadlineExceeded Errors? If it's caused by kv_read_timeout it's supposed to retry. What happens if it is caused by other deadlines setting?

There don't seem to be any other deadlines setting in TiDB.

TiKV coprocessor requests set their deadlines from the config item end_point_request_max_handle_duration.

It would be more clear to distinguish them as the kv_read_timeout is supposed to only affect read requests.

ekexium · 2023-07-04T03:52:10Z

docs/design/2023-06-30-configurable-kv-timeout.md

+serveral reasons:
+  - When the read task is polled and executed, there's no timeout checking mechanism in the task scheduler
+  by now.
+  - The read operations are synchronous, so if the read task is blocked by slow io the task could not be


How about letting client-go send the second request when timeout exceeded in the client side? It can cover more cases like network delay

Yes, it may be the first step in the initial implementation. It's better if the requests could be canceled in time top-down, but it's not easy in TiKV by now for the reasons listed here.

ekexium · 2023-07-04T07:08:25Z

docs/design/2023-06-30-configurable-kv-timeout.md

+The requests with configurable timeout values would take effect on newer versions. These fields are expected not to take effect when down-grading the
+cluster theoretically.
+
+## More comprehensive solution


When we support cancelling a request from outside commands, we can implement a real hedge policy. For example

For requests taking more than a threshold of time, send a hedge request to another replica. Either response can be taken as the final result.

When either response is returned, cancel the other request.

TiKV prioritizes non-hedge requests so the hedge policy won't increase latencies.

This could be realized if the real in-time canceling is supported as described above.

crazycs520

REST LGTM

Co-authored-by: crazycs <crazycs520@gmail.com>

ti-chi-bot · 2023-07-06T07:12:23Z

@cfzjywxk: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-br-integration-test	`e522baa`	link	true	`/test pull-br-integration-test`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ti-chi-bot · 2023-07-06T07:13:47Z

@cfzjywxk: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-br-integration-test	`e522baa`	link	true	`/test pull-br-integration-test`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

zhangjinpeng87

Please organize the RFC as:

Motivation/Background
The usage of proposed solution, detailed design
Other alternatives we have considered, list their pros and cons

Signed-off-by: cfzjywxk <lsswxrxr@163.com>

ekexium · 2023-07-10T02:26:20Z

docs/design/2023-06-30-configurable-kv-timeout.md

+# Proposal: Configurable KV Timeout
+
+* Authors: [cfzjywxk](https://github.com/cfzjywxk)
+* Tracking issue: TODO


We can file a tracking issue now

ti-chi-bot · 2023-07-10T02:26:47Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crazycs520, ekexium

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [crazycs520,ekexium]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2023-07-10T02:26:49Z

[LGTM Timeline notifier]

Timeline:

2023-07-04 08:07:57.165512981 +0000 UTC m=+104909.099146404: ☑️ agreed by crazycs520.
2023-07-10 02:26:48.12626193 +0000 UTC m=+298899.896600643: ☑️ agreed by ekexium.

cfzjywxk added 3 commits June 30, 2023 17:25

add rfc for configurable kv timeout

c090f39

Signed-off-by: cfzjywxk <lsswxrxr@163.com>

format

4178424

Signed-off-by: cfzjywxk <lsswxrxr@163.com>

format

10e75eb

Signed-off-by: cfzjywxk <lsswxrxr@163.com>

cfzjywxk added component/docs sig/transaction SIG:Transaction labels Jun 30, 2023

cfzjywxk requested review from zyguan, you06, MyonKeminta, crazycs520 and ekexium June 30, 2023 10:04

ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 30, 2023

ti-chi-bot bot removed the do-not-merge/needs-linked-issue label Jun 30, 2023

crazycs520 reviewed Jul 3, 2023

View reviewed changes

cfzjywxk added 2 commits July 4, 2023 10:12

refactor

4779455

Signed-off-by: cfzjywxk <lsswxrxr@163.com>

refactor

e8c0165

Signed-off-by: cfzjywxk <lsswxrxr@163.com>

crazycs520 reviewed Jul 4, 2023

View reviewed changes

docs/design/2023-06-30-configurable-kv-timeout.md Outdated Show resolved Hide resolved

ekexium reviewed Jul 4, 2023

View reviewed changes

crazycs520 approved these changes Jul 4, 2023

View reviewed changes

ti-chi-bot bot added needs-1-more-lgtm approved labels Jul 4, 2023

Update docs/design/2023-06-30-configurable-kv-timeout.md

e522baa

Co-authored-by: crazycs <crazycs520@gmail.com>

cfzjywxk requested a review from zhangjinpeng87 July 4, 2023 10:16

cfzjywxk mentioned this pull request Jul 6, 2023

make cop request timeout a config paramter tikv/client-go#865

Merged

zhangjinpeng87 reviewed Jul 10, 2023

View reviewed changes

refactor

ce5e64c

Signed-off-by: cfzjywxk <lsswxrxr@163.com>

cfzjywxk requested review from zhangjinpeng87 and ekexium July 10, 2023 02:14

ekexium approved these changes Jul 10, 2023

View reviewed changes

ti-chi-bot bot added the lgtm label Jul 10, 2023

ti-chi-bot bot removed the needs-1-more-lgtm label Jul 10, 2023

ti-chi-bot bot merged commit b040671 into pingcap:master Jul 10, 2023
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: rfc for configurable kv timeout #45093

doc: rfc for configurable kv timeout #45093

cfzjywxk commented Jun 30, 2023 •

edited

tiprow bot commented Jun 30, 2023

crazycs520 Jul 3, 2023

cfzjywxk Jul 4, 2023

crazycs520 Jul 3, 2023

crazycs520 Jul 3, 2023

ekexium Jul 4, 2023 •

edited

cfzjywxk Jul 4, 2023

cfzjywxk Jul 10, 2023

ekexium Jul 4, 2023

cfzjywxk Jul 10, 2023

ekexium Jul 4, 2023

crazycs520 Jul 4, 2023

ekexium Jul 4, 2023

cfzjywxk Jul 4, 2023

ekexium Jul 4, 2023 •

edited

cfzjywxk Jul 4, 2023

ekexium Jul 4, 2023 •

edited

cfzjywxk Jul 4, 2023

crazycs520 left a comment •

edited

ti-chi-bot bot commented Jul 6, 2023

ti-chi-bot commented Jul 6, 2023

zhangjinpeng87 left a comment

ekexium Jul 10, 2023

ti-chi-bot bot commented Jul 10, 2023

ti-chi-bot bot commented Jul 10, 2023

	set @@tidb_tikv_tidb_timeout=500;
	set @@ tidb_kv_read_timeout=500;

doc: rfc for configurable kv timeout #45093

doc: rfc for configurable kv timeout #45093

Conversation

cfzjywxk commented Jun 30, 2023 • edited

What problem does this PR solve?

What is changed and how it works?

Check List

Release note

tiprow bot commented Jun 30, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekexium Jul 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekexium Jul 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekexium Jul 4, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

crazycs520 left a comment • edited

Choose a reason for hiding this comment

ti-chi-bot bot commented Jul 6, 2023

ti-chi-bot commented Jul 6, 2023

zhangjinpeng87 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ti-chi-bot bot commented Jul 10, 2023

ti-chi-bot bot commented Jul 10, 2023

[LGTM Timeline notifier]

cfzjywxk commented Jun 30, 2023 •

edited

ekexium Jul 4, 2023 •

edited

ekexium Jul 4, 2023 •

edited

ekexium Jul 4, 2023 •

edited

crazycs520 left a comment •

edited