Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/gc_worker: Fix that gc_delete_range table is queried with wrong form of TS #7610

Merged

Conversation

MyonKeminta
Copy link
Contributor

@MyonKeminta MyonKeminta commented Sep 4, 2018

Signed-off-by: MyonKeminta MyonKeminta@users.noreply.github.com

What problem does this PR solve?

The GC worker checks table mysql.gc_delete_range with ts < safePoint and the ranges indicated by them. But unfortunately (and dramatically) the table was queried with incorrect form of TS previously. GCWorker calculated a safePoint in the form physical << 18 | logical, where the physical is a timestamp in milliseconds and logical is 0. However the ts field in gc_delete_range is in seconds. So these timestamps in the table is always much less than safePoint, and as a result the ranges will be immediately deleted at the next time GC runs.

This PR fixes it.

What is changed and how it works?

Check List

Tests

  • Manually.

Related changes

  • Need to cherry-pick to the release branch

@MyonKeminta
Copy link
Contributor Author

@zhangjinpeng1987 PTAL

ddl/util/util.go Outdated
sql := fmt.Sprintf(loadDeleteRangeSQL, safePoint)
// DeleteRange records in `gc_delete_range` table is saved with a timestamp in seconds, and the physical part of
// safePoint's form is in milliseconds. Divide it by 1000 to get seconds.
sql := fmt.Sprintf(loadDeleteRangeSQL, oracle.ExtractPhysical(safePoint)/1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add test to cover generating SQL.

zhangjinpeng87
zhangjinpeng87 previously approved these changes Sep 5, 2018
Copy link
Contributor

@zhangjinpeng87 zhangjinpeng87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangjinpeng87 zhangjinpeng87 added the status/LGT1 Indicates that a PR has LGTM 1. label Sep 5, 2018
@coocood
Copy link
Member

coocood commented Sep 5, 2018

I think we better rename getNowTS to getNowSecs.

Signed-off-by: MyonKeminta <MyonKeminta@users.noreply.github.com>
@MyonKeminta
Copy link
Contributor Author

@zhangjinpeng1987 @coocood PTAL again. Manual test is in progress.

@coocood
Copy link
Member

coocood commented Sep 5, 2018

LGTM

Copy link
Contributor

@zhangjinpeng87 zhangjinpeng87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangjinpeng87 zhangjinpeng87 added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Sep 5, 2018
@MyonKeminta
Copy link
Contributor Author

Tested on a TiDB cluster, with gc_run_interval and gc_lifetime both set to 10min.

Producing the problem

Drop a table:

2018/09/05 04:11:13.308 session.go:1404: [info] [CRUCIAL OPERATION] con:9644 schema_ver:45 drop table sbtest4 (by root@10.147.38.64).

Check log of GC:

2018/09/05 04:07:59.378 gc_worker.go:244: [info] [gc worker] 59691ea2f100004 starts the whole job, safePoint: 402684609643937792
2018/09/05 04:07:59.378 gc_worker.go:502: [info] [gc worker] 59691ea2f100004 start resolve locks, safePoint: 402684609643937792.
2018/09/05 04:08:03.703 gc_worker.go:574: [info] [gc worker] 59691ea2f100004 finish resolve locks, safePoint: 402684609643937792, regions: 2431, total resolved: 0, cost time: 4.325325808s
2018/09/05 04:08:03.704 gc_worker.go:370: [info] [gc worker] 59691ea2f100004 start delete 0 ranges
2018/09/05 04:08:03.704 gc_worker.go:387: [info] [gc worker] 59691ea2f100004 finish delete 0 ranges, cost time: 181ns
2018/09/05 04:08:03.705 gc_worker.go:406: [info] [gc worker] 59691ea2f100004 start re-delete 0 ranges
2018/09/05 04:08:03.705 gc_worker.go:423: [info] [gc worker] 59691ea2f100004 finish re-delete 0 ranges, cost time: 180ns
2018/09/05 04:09:43.708 gc_worker.go:730: [info] [gc worker] 59691ea2f100004 start gc, concurrency 2, safePoint: 402684609643937792.
2018/09/05 04:10:20.592 gc_worker.go:751: [info] [gc worker] 59691ea2f100004 finish gc, safePoint: 402684609643937792, successful regions: 2431, failed regions: 0, total cost time: 36.883631631s
(Drop table happens here)
2018/09/05 04:17:59.384 gc_worker.go:244: [info] [gc worker] 59691ea2f100004 starts the whole job, safePoint: 402684766930337792
2018/09/05 04:17:59.384 gc_worker.go:502: [info] [gc worker] 59691ea2f100004 start resolve locks, safePoint: 402684766930337792.
2018/09/05 04:18:03.257 gc_worker.go:574: [info] [gc worker] 59691ea2f100004 finish resolve locks, safePoint: 402684766930337792, regions: 2431, total resolved: 0, cost time: 3.873065073s
2018/09/05 04:18:03.258 gc_worker.go:370: [info] [gc worker] 59691ea2f100004 start delete 1 ranges
2018/09/05 04:18:09.607 gc_worker.go:387: [info] [gc worker] 59691ea2f100004 finish delete 1 ranges, cost time: 6.3482219s
2018/09/05 04:18:09.608 gc_worker.go:406: [info] [gc worker] 59691ea2f100004 start re-delete 1 ranges
2018/09/05 04:18:11.908 gc_worker.go:423: [info] [gc worker] 59691ea2f100004 finish re-delete 1 ranges, cost time: 2.299524768s

It shows that Delete Range happens even that its life time (10min) haven't been expired.

(In this log, the "re-delete range" is an additional operation in my another PR. It doesn't affect other behavior before it. It can be just ignored.)

This PR

Drop a table:

2018/09/05 06:24:34.782 session.go:1378: [info] [CRUCIAL OPERATION] con:4756 schema_ver:48 drop table sbtest3 (by root@10.147.38.64).

Check log of GC:

2018/09/05 06:18:23.585 gc_worker.go:240: [info] [gc worker] 596a1576af80004 starts the whole job, safePoint: 402686660711022592
2018/09/05 06:18:23.585 gc_worker.go:429: [info] [gc worker] 596a1576af80004 start resolve locks, safePoint: 402686660711022592.
2018/09/05 06:18:27.225 gc_worker.go:501: [info] [gc worker] 596a1576af80004 finish resolve locks, safePoint: 402686660711022592, regions: 2448, total resolved: 0, cost time: 3.639854729s
2018/09/05 06:18:27.232 gc_worker.go:357: [info] [gc worker] 596a1576af80004 start delete 0 ranges
2018/09/05 06:18:27.232 gc_worker.go:381: [info] [gc worker] 596a1576af80004 finish delete 0 ranges, regions: 0, cost time: 347ns
2018/09/05 06:20:07.235 gc_worker.go:657: [info] [gc worker] 596a1576af80004 start gc, concurrency 2, safePoint: 402686660711022592.
2018/09/05 06:20:12.765 gc_worker.go:678: [info] [gc worker] 596a1576af80004 finish gc, safePoint: 402686660711022592, successful regions: 2448, failed regions: 0, total cost time: 5.529338238s
(Drop table happens here)
2018/09/05 06:28:23.577 gc_worker.go:240: [info] [gc worker] 596a1576af80004 starts the whole job, safePoint: 402686817997422592
2018/09/05 06:28:23.577 gc_worker.go:429: [info] [gc worker] 596a1576af80004 start resolve locks, safePoint: 402686817997422592.
2018/09/05 06:28:30.226 gc_worker.go:501: [info] [gc worker] 596a1576af80004 finish resolve locks, safePoint: 402686817997422592, regions: 2460, total resolved: 0, cost time: 6.648473486s
2018/09/05 06:28:30.243 gc_worker.go:357: [info] [gc worker] 596a1576af80004 start delete 0 ranges
2018/09/05 06:28:30.243 gc_worker.go:381: [info] [gc worker] 596a1576af80004 finish delete 0 ranges, regions: 0, cost time: 237ns
2018/09/05 06:30:10.247 gc_worker.go:657: [info] [gc worker] 596a1576af80004 start gc, concurrency 2, safePoint: 402686817997422592.
2018/09/05 06:30:37.216 gc_worker.go:678: [info] [gc worker] 596a1576af80004 finish gc, safePoint: 402686817997422592, successful regions: 2465, failed regions: 0, total cost time: 26.968647288s
2018/09/05 06:38:23.581 gc_worker.go:240: [info] [gc worker] 596a1576af80004 starts the whole job, safePoint: 402686975283822592
2018/09/05 06:38:23.581 gc_worker.go:429: [info] [gc worker] 596a1576af80004 start resolve locks, safePoint: 402686975283822592.
2018/09/05 06:38:31.902 gc_worker.go:501: [info] [gc worker] 596a1576af80004 finish resolve locks, safePoint: 402686975283822592, regions: 2483, total resolved: 0, cost time: 8.320904077s
2018/09/05 06:38:31.910 gc_worker.go:357: [info] [gc worker] 596a1576af80004 start delete 1 ranges
2018/09/05 06:38:32.799 gc_worker.go:381: [info] [gc worker] 596a1576af80004 finish delete 1 ranges, regions: 39, cost time: 888.632618ms
2018/09/05 06:40:12.814 gc_worker.go:657: [info] [gc worker] 596a1576af80004 start gc, concurrency 2, safePoint: 402686975283822592.
2018/09/05 06:41:01.189 gc_worker.go:678: [info] [gc worker] 596a1576af80004 finish gc, safePoint: 402686975283822592, successful regions: 2489, failed regions: 0, total cost time: 48.374212203s

We can see that the range is not immediately deleted at the next time GC runs, but deleted in the second GC after it.

@coocood @zhangjinpeng1987

@MyonKeminta
Copy link
Contributor Author

MyonKeminta commented Sep 5, 2018

@coocood I don't have write access to this repo. Please help me merge it. Thanks.

@MyonKeminta
Copy link
Contributor Author

/run-integration-tests

@shenli
Copy link
Member

shenli commented Sep 5, 2018

/run-all-tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants