Skip to content

*: Optimize the underlying SegmentReader concurrency for TableScan under disagg arch#10522

Merged
ti-chi-bot[bot] merged 6 commits intopingcap:masterfrom
JaySon-Huang:opt_disagg_concurrency
Nov 4, 2025
Merged

*: Optimize the underlying SegmentReader concurrency for TableScan under disagg arch#10522
ti-chi-bot[bot] merged 6 commits intopingcap:masterfrom
JaySon-Huang:opt_disagg_concurrency

Conversation

@JaySon-Huang
Copy link
Copy Markdown
Contributor

@JaySon-Huang JaySon-Huang commented Nov 3, 2025

What problem does this PR solve?

Issue Number: ref #10356

Problem Summary: The query performance under disagg arch is slow when compute node local cache missed.

The main reason is that SegmentReaderPool default size is vcore * dt_read_thread_count_scale, which is vcore * 2, and StorageDisaggregated creates SegmentReadTaskPool with max_active_segment = num_stream. When cache missed, SegmentReader will perform blocking IO calling S3 API. So the speed of reading data from the TableScan (which is reading from the SegmentReaderPool) is not sufficient for the Pipeline model executing other computing.

The best way is to refine the StorageLayer reading logic and let it yield the current SegmentReaderTask from the SegmentReaderPool if it require network IO from remote storage service and let another SegmentReaderTask has chance for executing reading data from local cache. But it require lots of efforts.

** Now we increase the underlying SegmentReader concurrency for TableScan speed when cache miss under disagg arch. **

What is changed and how it works?

*: Optimize the underlying SegmentReader concurrency for TableScan under disagg arch
  - Adjust the concurrency under disagg
    * `SegmentReaderPoolManager` will init the SegmentReaderPool with size = vcore * dt_read_thread_count_scale (2.0) * 10 for disagg compute node
    * `StorageDisaggregated` will create SegmentReadTaskPool with max_active_segment = num_stream * 10 for disagg read task
    * `initThreadPool` will generate thread pool with 6*vcore threads at max for `BuildReadTaskForWNPool`/`BuildReadTaskForWNTablePool`/`BuildReadTaskPool`/`RNWritePageCachePool`
  - ScanDetails changes under disagg
    * Add rows_per_sec and bytes_per_sec for TableScan that summing from all concurrency
    * Fix num_columns and read_mode in scan_details
    * Fix the logging of `SegmentReadTaskPool` does not show mpp_task_id correctly
    * Add logging about finish building tasks from write node response
  - Add a http API /tiflash/remote/cache/evict for evicting local cache on compute node for testing

Tested with chbenchmark 8000

First query after CN restarted Following 4 query avg
q1 latency (seconds) q1 TableScan (seconds) TableScan bytes_per_sec TableScan rows_per_sec q1 latency (seconds) q1 TableScan (seconds) TableScan bytes_per_sec TableScan rows_per_sec
Optimal baseline (fully hit local cache on CN) 6.44 3.67 6.42 3.27 13178 MiB/s 600,678,349
vcore=8 max_active_seg=vcore 95.70 92.60 614 MiB/s 28,020,696 50.68 47.50 1182 MiB/s 53,901,421
vcore=8 max_active_seg=5*vcore 19.15 15.80 3973 MiB/s 181,027,573 8.66 5.36 12564 MiB/s 572,858,663
vcore=8 max_active_seg=10*vcore 10.96 7.54 10516 MiB/s 479,218,168 8.42 4.78 19440 MiB/s 886,323,146
  • Deploy next-gen cluster
    • 2 compute node with vcore limited to 8, local cache disabled
    - host: 172.31.10.1
...
      config:
        flash.disaggregated_mode: tiflash_compute
        storage.main.dir:
            - /tidb-deploy/tiflash-9000/data
        storage.remote.cache.capacity: 200000000000
        storage.remote.cache.dir: /tidb-deploy/tiflash-9000/remote_cache
        storage.remote.cache.dtfile_level: 0
        tcp_port: 9000
      resource_control:
        cpu_quota: 800%
    - host: 172.31.10.2
...
      config:
        flash.disaggregated_mode: tiflash_compute
        storage.main.dir:
            - /tidb-deploy/tiflash-9000/data
        storage.remote.cache.capacity: 200000000000
        storage.remote.cache.dir: /tidb-deploy/tiflash-9000/remote_cache
        storage.remote.cache.dtfile_level: 0
        tcp_port: 9000
      resource_control:
        cpu_quota: 800%
  • BR restore chbenchmark 8000 to the cluster
    • chbenchmark8k.order_line involve about 2610 segment
  • Run chbenchmark AP query 1 on the static dataset

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Optimize the TableScan performance under disagg arch

Signed-off-by: JaySon-Huang <tshent@qq.com>
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 3, 2025
Signed-off-by: JaySon-Huang <tshent@qq.com>
Signed-off-by: JaySon-Huang <tshent@qq.com>
@JaySon-Huang JaySon-Huang force-pushed the opt_disagg_concurrency branch from 0586ebe to 59eae76 Compare November 3, 2025 11:19
Signed-off-by: JaySon-Huang <tshent@qq.com>
@JaySon-Huang JaySon-Huang changed the title [WIP] *:Opt disagg concurrency *: Optimize the underlying SegmentReader concurrency for TableScan under disagg arch Nov 3, 2025
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 3, 2025
Signed-off-by: JaySon-Huang <tshent@qq.com>
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Nov 3, 2025
@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

/test pull-unit-test

@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

@JinheLin @Lloyd-Pottiger @CalvinNeo PTAL

Signed-off-by: JaySon-Huang <tshent@qq.com>
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Nov 4, 2025
Copy link
Copy Markdown
Member

@CalvinNeo CalvinNeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Nov 4, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CalvinNeo, JinheLin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 4, 2025
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot bot commented Nov 4, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-11-04 01:43:32.737987773 +0000 UTC m=+148062.181017642: ☑️ agreed by JinheLin.
  • 2025-11-04 02:01:00.514681815 +0000 UTC m=+149109.957711694: ☑️ agreed by CalvinNeo.

@ti-chi-bot ti-chi-bot bot merged commit 09ca448 into pingcap:master Nov 4, 2025
7 checks passed
@JaySon-Huang JaySon-Huang deleted the opt_disagg_concurrency branch November 4, 2025 02:04
@JaySon-Huang
Copy link
Copy Markdown
Contributor Author

/cherry-pick release-nextgen-20251011

@ti-chi-bot
Copy link
Copy Markdown
Member

@JaySon-Huang: new pull request created to branch release-nextgen-20251011: #10523.

Details

In response to this:

/cherry-pick release-nextgen-20251011

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot bot pushed a commit that referenced this pull request Nov 4, 2025
…der disagg arch (#10522) (#10523)

ref #10356

*: Optimize the underlying SegmentReader concurrency for TableScan under disagg arch
  - Adjust the concurrency under disagg
    * `SegmentReaderPoolManager` will init the SegmentReaderPool with size = vcore * dt_read_thread_count_scale (2.0) * 10 for disagg compute node
    * `StorageDisaggregated` will create SegmentReadTaskPool with max_active_segment = num_stream * 10 for disagg read task
    * `initThreadPool` will generate thread pool with 6*vcore threads at max for `BuildReadTaskForWNPool`/`BuildReadTaskForWNTablePool`/`BuildReadTaskPool`/`RNWritePageCachePool`
  - ScanDetails changes under disagg
    * Add rows_per_sec and bytes_per_sec for TableScan that summing from all concurrency
    * Fix num_columns and read_mode in scan_details
    * Fix the logging of `SegmentReadTaskPool` does not show mpp_task_id correctly
    * Add logging about finish building tasks from write node response
  - Add a http API /tiflash/remote/cache/evict for evicting local cache on compute node for testing

Signed-off-by: JaySon-Huang <tshent@qq.com>

Co-authored-by: JaySon-Huang <tshent@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants