Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drainer/: Reduce memory usage (#735) #737

Merged
merged 23 commits into from Sep 25, 2019

Conversation

@lichunzhu
Copy link
Collaborator

commented Sep 2, 2019

What problem does this PR solve?

Cherry pick of #735.
TOOL-1480
Current drainer uses buffed channel to cache some instances which may cause OOM when facing binlogs which consumes too much memory.

What is changed and how it works?

  1. Avoid any buffer at upstream to avoid too much memory usage and not decreasing performance.
  2. Add txnManager in loader to manage cached txn memory usage. Without txnManager the performance of loader will be affected.

To check the effectiveness, we make two tests

In the following:
v1 means 097db42, which is the newest master branch.
v2 means 1e51514, which is the current branch.
In v1, maxBinlogItemCount=512
In v2, maxBinlogItemCount=0

TEST 1: Small binlogs, to validate the efficiency of drainer

I put the test of v1 and v2 in one picture. 15:30 ~ 15:50 is test of v1 while 15:53 ~ 16:13 is test of v2. As we can see the memory usage and drainer event of both branches are similar.
drainer event:
image
memory:
image
execution time:
image

TEST 2: the upstream data one binlog near 60M, to validate the momery usage of drainer

For v1 and situations we set loader.input,loader.success, dsyncer.success buffer to 1024 drainer will get an OOM problem. The memory usage will rise up to more than 100G.
For v2, the memory usage is around 2G. The drainer event for DDL is around 2k and is around 14k for DMLs.
drainer event:
image
memory:
image
execute time:
image

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Code changes

Side effects

  • Possible performance regression

Related changes

  • Need to update the documentation
  • Need to update the tidb-ansible repository
  • Need to be included in the release note
@lichunzhu lichunzhu changed the title Czli/drainer/reduce memory usage drainer/: Reduce memory usage Sep 3, 2019
@lichunzhu lichunzhu changed the title drainer/: Reduce memory usage drainer/: Reduce memory usage https://github.com/pingcap/tidb-binlog/pull/735 Sep 3, 2019
@lichunzhu lichunzhu changed the title drainer/: Reduce memory usage https://github.com/pingcap/tidb-binlog/pull/735 drainer/: Reduce memory usage Sep 3, 2019
@lichunzhu lichunzhu changed the title drainer/: Reduce memory usage drainer/: Reduce memory usage (#735) Sep 3, 2019
@lichunzhu lichunzhu requested review from WangXiangUSTC and leoppro Sep 4, 2019
@WangXiangUSTC

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

LGTM

Copy link
Member

left a comment

LGTM

@suzaku suzaku added status/LGT2 and removed status/LGT1 labels Sep 6, 2019
@lichunzhu

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 6, 2019

DNM now. This PR should better be merged after release-2.1 version.

@july2993

This comment has been minimized.

Copy link
Collaborator

commented Sep 8, 2019

do you run it at the case there lag ?
In v1, maxBinlogItemCount=512

for this, the mem will not take only less than 1G
60M * 512 = at least 30G

also you may need to change pkg/loader ' input buf size.
keep in mind what you need to do is prove there will not take too much memory when there's no lag.

lichunzhu added 2 commits Sep 9, 2019
…ip binlog by previous test case
@july2993 july2993 added the status/DNM label Sep 9, 2019
@GregoryIan

This comment has been minimized.

Copy link
Collaborator

commented Sep 24, 2019

@suzaku PTAL

@lichunzhu

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2019

DNM now, I will add some unit tests about txnManager.

@GregoryIan

This comment has been minimized.

Copy link
Collaborator

commented Sep 25, 2019

please complete it today @lichunzhu

@kennytm kennytm added the status/DNM label Sep 25, 2019
… when block at cond.Wait() or input is closed. add one unit test to test whether txnManager can handle txn whose size if bigger than maxCacheSize
@lichunzhu

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2019

/run-all-tests tidb=release-3.0 tikv=release-3.0 pd=release-3.0

@lichunzhu

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2019

@lichunzhu lichunzhu added status/PTAL and removed status/DNM labels Sep 25, 2019
select {
case input <- txn:
case <-time.After(50 * time.Microsecond):
c.Fatal("txnManager gets blocked while receiving txns")

This comment has been minimized.

Copy link
@suzaku

suzaku Sep 25, 2019

Member

If the 5th txn get blocked sending to input, why didn't this test fail?

This comment has been minimized.

Copy link
@lichunzhu

lichunzhu Sep 25, 2019

Author Collaborator

The 5th txn was picked from input but can't be added to the txnManager.cacheChan. Although it gets blocked at cond.Wait() but it's still picked out from input. By the way, we can't know a txn's size when it is not picked out from input.

This comment has been minimized.

Copy link
@lichunzhu

lichunzhu Sep 25, 2019

Author Collaborator

The check c.Assert(output, check.HasLen, 4) can make sure it's not added.

pkg/loader/load_test.go Outdated Show resolved Hide resolved
case t := <-output:
txnManager.pop(t)
c.Assert(t, check.DeepEquals, txn)
case <-time.After(50 * time.Microsecond):

This comment has been minimized.

Copy link
@suzaku

suzaku Sep 25, 2019

Member

No need to wait? Just use default, because if we can't get a txn at this point, something must be wrong.

This comment has been minimized.

Copy link
@lichunzhu

lichunzhu Sep 25, 2019

Author Collaborator

This place default can be used.

output := txnManager.run()
select {
case input <- txnSmall:
case <-time.After(50 * time.Microsecond):

This comment has been minimized.

Copy link
@suzaku

suzaku Sep 25, 2019

Member

Use default?

This comment has been minimized.

Copy link
@lichunzhu

lichunzhu Sep 25, 2019

Author Collaborator

In txnManager.run() it will start a new goroutine and start watching input channel which will take some time. Using default may get fault.

@lichunzhu

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2019

/run-unit-tests

@@ -413,13 +413,13 @@ func (s *txnManagerSuite) TestRunTxnManager(c *check.C) {
case t := <-output:
txnManager.pop(t)
c.Assert(t, check.DeepEquals, txn)
case <-time.After(50 * time.Microsecond):
default:
c.Fatal("Fail to pick txn from txnManager")
}

This comment has been minimized.

Copy link
@suzaku

suzaku Sep 25, 2019

Member
select {
case t := <-output:
case <-time.After...
}

This pattern appears a lot in this test, is it possible to extract this as a helper function to simplify the test?

@lichunzhu

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2019

@suzaku PTAL

@lichunzhu

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2019

/run-unit-tests

@GregoryIan

This comment has been minimized.

Copy link
Collaborator

commented Sep 25, 2019

Rest LGTM

Co-Authored-By: satoru <satorulogic@gmail.com>
@lichunzhu

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2019

/run-unit-tests

@july2993

This comment has been minimized.

Copy link
Collaborator

commented Sep 25, 2019

/run-all-tests tidb=release-3.0 tikv=release-3.0 pd=release-3.0

Copy link
Collaborator

left a comment

LGTM

@lichunzhu

This comment has been minimized.

Copy link
Collaborator Author

commented Sep 25, 2019

/run-integration-tests tidb=release-3.0 tikv=release-3.0 pd=release-3.0

@lichunzhu lichunzhu merged commit bbe5d6e into pingcap:master Sep 25, 2019
5 checks passed
5 checks passed
idc-jenkins-ci-binlog/build Jenkins job succeeded.
Details
idc-jenkins-ci-binlog/check Jenkins job succeeded.
Details
idc-jenkins-ci-tidb-binlog/integration-test Jenkins job succeeded.
Details
idc-jenkins-ci-tidb-binlog/unit-test Jenkins job succeeded.
Details
license/cla Contributor License Agreement is signed.
Details
@lichunzhu lichunzhu deleted the lichunzhu:czli/drainer/reduceMemoryUsage branch Sep 25, 2019
@july2993 july2993 referenced this pull request Oct 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.