Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: reduce gRPC window size to avoid OOM #2699

Merged
merged 8 commits into from
Sep 2, 2021

Conversation

overvenus
Copy link
Member

@overvenus overvenus commented Sep 1, 2021

What problem does this PR solve?

Note: TiCDC in the following tests includes #2214.

Memory usage (pause sorter)

Tested with following changes

diff --git a/cdc/puller/sorter/unified_sorter.go b/cdc/puller/sorter/unified_sorter.go
index 9cd9c898..6e033429 100644
--- a/cdc/puller/sorter/unified_sorter.go
+++ b/cdc/puller/sorter/unified_sorter.go
@@ -197,9 +197,9 @@ func (s *UnifiedSorter) Run(ctx context.Context) error {
                }
        })
 
-       errg.Go(func() error {
-               return printError(runMerger(subctx, numConcurrentHeaps, heapSorterCollectCh, s.outputCh, ioCancelFunc))
-       })
+       // errg.Go(func() error {
+       //      return printError(runMerger(subctx, numConcurrentHeaps, heapSorterCollectCh, s.outputCh, ioCancelFunc))
+       // })
 
        errg.Go(func() error {
                captureAddr := util.CaptureAddrFromCtx(ctx)
64MB vs 64KB
image
image
image

The last image shows TiKV snapshot is holded for a long time, because CDC scan is flow controlled. Through, I doubt if it will actually happen as TiCDC sorter can always make progress.

Performance

No significant difference.

64MB (OOM) vs 64KB
image
image
image

Close #2673

What is changed and how it works?

Check List

Tests

  • Manual test (add detailed scripts or steps below)

Related changes

  • Need to cherry-pick to the release branch

Release note

Fix OOM when TiCDC captures too many regions

Signed-off-by: Neil Shen <overvenus@gmail.com>
@overvenus overvenus added type/bugfix This PR fixes a bug. component/kv-client TiKV kv log client component. labels Sep 1, 2021
@ti-chi-bot
Copy link
Member

ti-chi-bot commented Sep 1, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • amyangfei
  • liuzix

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@overvenus overvenus added this to In progress in Capturing 60K Tables via automation Sep 1, 2021
@ti-chi-bot ti-chi-bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Sep 1, 2021
@overvenus overvenus added needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. needs-cherry-pick-release-5.0 Should cherry pick this PR to release-5.0 branch. needs-cherry-pick-release-5.1 Should cherry pick this PR to release-5.1 branch. needs-cherry-pick-release-5.2 Should cherry pick this PR to release-5.2 branch. labels Sep 1, 2021
@ti-chi-bot ti-chi-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Sep 1, 2021
@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Sep 1, 2021
@amyangfei
Copy link
Contributor

By the way, is there any metric exposed from gRPC API we can use to monitor the stream/connection status, such as window size, receive buffer, etc.

@overvenus
Copy link
Member Author

By the way, is there any metric exposed from gRPC API we can use to monitor the stream/connection status, such as window size, receive buffer, etc.

Maybe you are looking for channelz https://pkg.go.dev/google.golang.org/grpc/internal/channelz#ChannelInternalMetric

Signed-off-by: Neil Shen <overvenus@gmail.com>
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Sep 2, 2021
@amyangfei
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: e83449a

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Sep 2, 2021
cdc/kv/client.go Outdated
grpcInitialConnWindowSize = 1 << 27 // 128 MB The value for initial window size on a connection
// TiCDC may open numerous gRPC streams,
// with 64KB window size, 10K streams takes about 27GB memory.
grpcInitialWindowSize = 65535 // 64 KB The value for initial window size on a stream
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about 1 << 16 ?

Signed-off-by: Neil Shen <overvenus@gmail.com>
@ti-chi-bot ti-chi-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed status/can-merge Indicates a PR has been approved by a committer. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 2, 2021
@overvenus
Copy link
Member Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 2ff79b1

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Sep 2, 2021
@codecov-commenter
Copy link

Codecov Report

Merging #2699 (a5a48f8) into master (81c22b1) will increase coverage by 5.2636%.
The diff coverage is 68.8027%.

@@               Coverage Diff                @@
##             master      #2699        +/-   ##
================================================
+ Coverage   55.8716%   61.1353%   +5.2636%     
================================================
  Files           169        161         -8     
  Lines         20667      17916      -2751     
================================================
- Hits          11547      10953       -594     
+ Misses         8012       5970      -2042     
+ Partials       1108        993       -115     

@ti-chi-bot ti-chi-bot merged commit 1e96246 into pingcap:master Sep 2, 2021
Capturing 60K Tables automation moved this from In progress to Done Sep 2, 2021
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #2723.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #2724.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #2725.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #2726.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick-approved Cherry pick PR approved by release team. component/kv-client TiKV kv log client component. needs-cherry-pick-release-4.0 Should cherry pick this PR to release-4.0 branch. needs-cherry-pick-release-5.0 Should cherry pick this PR to release-5.0 branch. needs-cherry-pick-release-5.1 Should cherry pick this PR to release-5.1 branch. needs-cherry-pick-release-5.2 Should cherry pick this PR to release-5.2 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/bugfix This PR fixes a bug.
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

gRPC initial window is too large that may cause OOM
7 participants