ANALYZE TABLE immediately after bulk update is unreliable #16570

wwar · 2020-04-18T02:07:54Z

Bug Report

1. What did you do?

After I bulk updated rows, I ran ANALYZE TABLE. The ANALYZE did not take into account the recent update, and statistics remained out of date.

This appears to be a race condition. Consider the following:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
 id BIGINT NOT NULL PRIMARY KEY auto_increment,
 b BIGINT NOT NULL,
 pad VARBINARY(255),
 INDEX (b)
);

INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM dual;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
UPDATE t1 SET b=id;
SELECT SLEEP(5);
ANALYZE TABLE t1;
EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;

If you remove the SLEEP(5) you will get incorrect statistics.

2. What did you expect to see?

With the sleep call:

mysql> EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;
+-------------------------------+---------+---------+-----------+----------------------+------------------------------------------------------------------------------+----------------------------------+---------------+------+
| id                            | estRows | actRows | task      | access object        | execution info                                                               | operator info                    | memory        | disk |
+-------------------------------+---------+---------+-----------+----------------------+------------------------------------------------------------------------------+----------------------------------+---------------+------+
| IndexLookUp_10                | 4.00    | 4       | root      |                      | time:11.883687ms, loops:2, rpc num: 1, rpc time:11.148234ms, proc keys:21013 |                                  | 10.0234375 KB | N/A  |
| ├─IndexRangeScan_8(Build)     | 4.00    | 4       | cop[tikv] | table:t1, index:b(b) | time:12ms, loops:1                                                           | range:[-inf,5), keep order:false | N/A           | N/A  |
| └─TableRowIDScan_9(Probe)     | 4.00    | 4       | cop[tikv] | table:t1             | time:0s, loops:1                                                             | keep order:false                 | N/A           | N/A  |
+-------------------------------+---------+---------+-----------+----------------------+------------------------------------------------------------------------------+----------------------------------+---------------+------+
3 rows in set (0.01 sec)

3. What did you see instead?

Without the sleep call:

mysql> EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;
+-------------------------+----------+---------+-----------+---------------+-----------------------------------------------------------------------------+------------------+-----------------+------+
| id                      | estRows  | actRows | task      | access object | execution info                                                              | operator info    | memory          | disk |
+-------------------------+----------+---------+-----------+---------------+-----------------------------------------------------------------------------+------------------+-----------------+------+
| TableReader_7           | 21010.00 | 4       | root      |               | time:26.682049ms, loops:2, rpc num: 1, rpc time:26.58694ms, proc keys:21010 | data:Selection_6 | 1.7880859375 KB | N/A  |
| └─Selection_6           | 21010.00 | 4       | cop[tikv] |               | time:24ms, loops:25                                                         | lt(test.t1.b, 5) | N/A             | N/A  |
|   └─TableFullScan_5     | 21010.00 | 21010   | cop[tikv] | table:t1      | time:24ms, loops:25                                                         | keep order:false | N/A             | N/A  |
+-------------------------+----------+---------+-----------+---------------+-----------------------------------------------------------------------------+------------------+-----------------+------+
3 rows in set (0.03 sec)

4. What version of TiDB are you using? (`tidb-server -V` or run `select tidb_version();` on TiDB)

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-259-g2c8afe6e1
Git Commit Hash: 2c8afe6e1621d4bc2074d17f6c09c17f99fd4d89
Git Branch: master
UTC Build Time: 2020-04-15 03:14:55
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

The text was updated successfully, but these errors were encountered:

zz-jason · 2020-04-21T09:22:59Z

I'm sorry, I can not reproduce this issue. Here is my test step:

Copy all the following SQL statements into a file called 16570.sql. Note, there is no SELECT SLEEP(5):

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
 id BIGINT NOT NULL PRIMARY KEY auto_increment,
 b BIGINT NOT NULL,
 pad VARBINARY(255),
 INDEX (b)
);

INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM dual;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
UPDATE t1 SET b=id;
ANALYZE TABLE t1;
EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;

Start a tidb-server with the default configuration, the tidb-server version is the same as this issue described:

➜ ./bin/tidb-server -V
Release Version: v4.0.0-beta.2-259-g2c8afe6e1
Git Commit Hash: 2c8afe6e1621d4bc2074d17f6c09c17f99fd4d89
Git Branch: HEAD
UTC Build Time: 2020-04-21 09:12:50
GoVersion: go1.13.10
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false

execute the sql file, I got this output:

➜ mysql --default-character-set=utf8 -h 127.0.0.1 -P 4000 -u root -D test < 16570.sql
id      estRows actRows task    access object   execution info  operator info   memory  disk
IndexLookUp_10  5.37    4       root            time:19.802983ms, loops:2, rpc num: 1, rpc time:19.59566ms, proc keys:0         10.0234375 KB   N/A
├─IndexRangeScan_8(Build)       5.37    4       cop[tikv]       table:t1, index:b(b)    time:19.566802ms, loops:5       range:[-inf,5), keep order:false        N/A     N/A
└─TableRowIDScan_9(Probe)       5.37    4       cop[tikv]       table:t1        time:55.581µs, loops:5  keep order:false        N/A     N/A

The execution plan is not the same as the issue described.

@wwar Could you provide a detailed reproduce step to help us locate the root cause?

wwar · 2020-04-21T14:09:20Z

I cannot reproduce with mocktikv as the store. Could that be the issue?

I am using the default configuration, with a single tikv server which is local (tikv + pd are from nightly, tidb-server I have tried both master and nightly):

./bin/pd-server  --data-dir=/mnt/evo860/data/pd --log-file=/mnt/evo860/logs/pd.log &
./bin/tikv-server --pd="127.0.0.1:2379" --data-dir=/mnt/evo860/data/tikv -A 127.0.0.1:20165 --log-file=/mnt/evo860/logs/tikv.log &

cd /home/wwar/Documents/tidb
./bin/tidb-server --path="127.0.0.1:2379" --store=tikv

Version of tidb-server:

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-302-g8c455e86c
Git Commit Hash: 8c455e86c136ac020129998c94f890448367f62a
Git Branch: master
UTC Build Time: 2020-04-21 02:25:12
GoVersion: go1.13.8
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

zz-jason · 2020-04-22T02:23:09Z

Thanks @wwar, I reproduced this issue with tikv as well:

start a playground cluster with tiup:

tiup playground nightly

execute the SQL file described in the former comment

I got this output:

id      estRows actRows task    access object   execution info  operator info   memory  disk
TableReader_7   21010.00        4       root            time:28.845952ms, loops:2, rpc num: 1, rpc time:28.786899ms, proc keys:21010    data:Selection_6        1.787109375 KB  N/A
└─Selection_6   21010.00        4       cop[tikv]               time:29ms, loops:25     lt(test.t1.b, 5)        N/A     N/A
  └─TableFullScan_5     21010.00        21010   cop[tikv]       table:t1        time:29ms, loops:25     keep order:false        N/A     N/A

eurekaka · 2020-04-23T09:45:56Z

It looks like a bug in tikv coprocessor. I turned off auto-analyze, statistics feedback, and increased the stats-lease to 100s, the problem can still be reproduced.

I added some debugging logs in ANALYZE execution, and found that, when there was no sleep(5), the NDV of index b returned by distsql.Analyze request was 1; with sleep(5), the NDV of index b returned by distsql.Analyze request was 21010. Looks like there is a problem in the visibility of distsql.Analyze request in coprocessor, i.e, Analyze request in coprocessor cannot see the latest data until several seconds later, but I checked that the StartTS of the request is always MaxUint64.

@breeswish can you please check if it is the bug in Analyze interface of coprocessor?

breezewish · 2020-04-23T10:11:06Z

Sorry currently I'm not very familiar with the Analyze part of the Coprocessor, since it is maintained by TiDB member previously 🤣I can help check whether Coprocessor Analyze successfully scanned recently data (but cannot help check what happens next). Does it help?

breezewish · 2020-04-23T12:51:09Z

I've confirmed that indeed Analyze did not scanned latest data, although received max_ts. I will ask txn member to look at this issue.

eurekaka · 2020-04-24T02:49:04Z

@breeswish thanks, it helps

wwar · 2020-04-25T15:37:07Z

Confirming this is fixed:

..
mysql> EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;
+-------------------------------+---------+---------+-----------+----------------------+--------------------------------------------------------------------------+----------------------------------+---------------+------+
| id                            | estRows | actRows | task      | access object        | execution info                                                           | operator info                    | memory        | disk |
+-------------------------------+---------+---------+-----------+----------------------+--------------------------------------------------------------------------+----------------------------------+---------------+------+
| IndexLookUp_10                | 5.41    | 4       | root      |                      | time:24.366877ms, loops:2, rpc num: 1, rpc time:23.673301ms, proc keys:0 |                                  | 10.0234375 KB | N/A  |
| ├─IndexRangeScan_8(Build)     | 5.41    | 4       | cop[tikv] | table:t1, index:b(b) | time:23.588053ms, loops:5                                                | range:[-inf,5), keep order:false | N/A           | N/A  |
| └─TableRowIDScan_9(Probe)     | 5.41    | 4       | cop[tikv] | table:t1             | time:331.463µs, loops:5                                                  | keep order:false                 | N/A           | N/A  |
+-------------------------------+---------+---------+-----------+----------------------+--------------------------------------------------------------------------+----------------------------------+---------------+------+
3 rows in set (0.03 sec)

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-352-g5ede18f10
Git Commit Hash: 5ede18f10eedfe2e3690d7728dec3ffa4b0af2d5
Git Branch: master
UTC Build Time: 2020-04-24 03:45:17
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

Note: This is a case where it would have really helped if tikv version was included in tidb_version() output (both in terms of initial verification, and later confirming it is fixed.)

eurekaka · 2020-04-29T09:16:19Z

The root cause of the problem is that in tikv 4.0, commit of secondary lock is in asynchronous way, and ANALYZE request uses MaxUint64 as StartTS and RC as tikv isolation level, so it would ignore those un-released locks, thus it cannot read the latest data sometimes. We can change the behavior of ANALYZE to wait for the release of the locks for latest data, but it would incur potential performance regression theoretically. From another perspective, ANALYZE does not need to read the real LATEST data actually. The time window of the secondary lock commit is usually at ms level, so we prefer to keep the current ANALYZE behavior at least for now.

wwar · 2020-05-05T04:09:10Z

I tested this again after your comment, and it is indeed not fixed (due to a race).

May I suggest documenting the maximum delay that is expected until ANALYZE reads up to date data? The common use case for manually ANALYZE is after you know a batch update may have skewed stats. But it sounds like this now an anti-pattern, since running it will update the time stats were last gathered and delay any auto-regeneration.

zz-jason · 2020-05-06T02:18:21Z

From the investigation result:

The time window of the secondary lock commit is usually at ms level

@wwar What's the time period between your bulk insert and analyze operation?

wwar · 2020-05-06T04:08:11Z

In my case, I know of this issue and I will probably just use a sleep(5) as in this example. If sleep(1) is safe enough, then it is easy to switch.

My suggestion was to update the docs to make this clearer. i.e. on https://pingcap.com/docs/stable/reference/sql/statements/analyze-table/ under "MySQL Compatibility" it says:

This statement is syntactically similar with MySQL. However, ANALYZE TABLE may take significantly longer to execute on TiDB, as internally it operates in a different manner.

This should be updated to mention both this issue and tidb_enable_fast_analyze. May I suggest:

TiDB differs from MySQL in both the statistics it collects and how it makes use of statistics during query execution. While this statement is syntactically similar to MySQL, the following differences apply:

TiDB may not include very recently committed changes when running ANALYZE TABLE. After a batch-update of rows, you may need to sleep(1) before executing ANALYZE TABLE in order for the statistics update to reflect these changes.

ANALYZE TABLE takes significantly longer to execute in TiDB than MySQL. This performance difference can be partially mitigated by enabling fast analyze with SET GLOBAL tidb_enable_fast_analyze=1. Fast analyze makes use of sampling, leading to less accurate statistics. Its usage is still considered experimental.

eurekaka · 2021-01-04T08:08:44Z

pingcap/docs#3151 has updated the documentation, so I am going to close this issue.

ti-srebot · 2021-01-04T08:08:54Z

Please edit this comment or add a new comment to complete the following information

Not a bug

Remove the 'type/bug' label
Add notes to indicate why it is not a bug

Duplicate bug

Add the 'type/duplicate' label
Add the link to the original bug

Bug

Note: Make Sure that 'component', and 'severity' labels are added
Example for how to fill out the template: #20100

1. Root Cause Analysis (RCA) (optional)

2. Symptom (optional)

3. All Trigger Conditions (optional)

4. Workaround (optional)

5. Affected versions

6. Fixed versions

wwar added the type/bug This issue is a bug. label Apr 18, 2020

github-actions bot added this to Need Triage in Question and Bug Reports Apr 18, 2020

This was referenced Apr 18, 2020

Cascades planner cost estimates off on simple range conditions #16571

Open

P1-[4.0-bug-hunting]-[SQL Plan Management]-ANALYZE after bulk update unreliable tidb-challenge-program/bug-hunting-issue#20

Open

crazycs520 moved this from Need Triage to In Progress: Waiting on Investigation in Question and Bug Reports Apr 20, 2020

crazycs520 assigned eurekaka Apr 20, 2020

zz-jason added Priority/P1 Features that will be implemented in the latest or next major/minor version component/statistics labels Apr 21, 2020

github-actions bot added this to Issue Backlog: Need Triage in SIG Planner Kanban Apr 21, 2020

lzmhhh123 moved this from Issue Backlog: Need Triage to Issue Backlog: Release 4.0 in SIG Planner Kanban Apr 22, 2020

youjiali1995 mentioned this issue Apr 23, 2020

Mustn't check txn status with maxUint64 as callerStartTs #16778

Closed

wwar closed this as completed Apr 25, 2020

Question and Bug Reports automation moved this from In Progress: Waiting on Investigation to Closed(This Week) Apr 25, 2020

SIG Planner Kanban automation moved this from Issue Backlog: Release 4.0 to Issue: Done (This Week) Apr 25, 2020

zz-jason moved this from Issue: Done (This Week) to Issue: In Progress in SIG Planner Kanban Apr 27, 2020

zz-jason moved this from Closed(This Week) to In Progress: Waiting on Investigation in Question and Bug Reports Apr 27, 2020

wwar reopened this May 5, 2020

Question and Bug Reports automation moved this from In Progress: Waiting on Investigation to Need Triage May 5, 2020

SIG Planner Kanban automation moved this from Issue: In Progress to Issue Backlog: Need Triage May 5, 2020

zz-jason moved this from Issue Backlog: Need Triage to Issue Backlog: High Priority in SIG Planner Kanban May 6, 2020

winoros moved this from Need Triage to In Progress: Handled to SIGs in Question and Bug Reports May 6, 2020

wwar mentioned this issue May 7, 2020

Document ANALYZE TABLE incompatibilities with MySQL pingcap/docs#2478

Closed

sre-bot mentioned this issue Jun 5, 2020

P1-[4.0-bug-hunting]-[SQL Plan Management]-ANALYZE after bulk update unreliable #17802

Closed

ghost removed this from In Progress: Handled to SIGs in Question and Bug Reports Aug 6, 2020

sre-bot added the severity/moderate label Aug 10, 2020

eurekaka closed this as completed Jan 4, 2021

SIG Planner Kanban automation moved this from Backlog: High Priority to Coding Finished (This Week) Jan 4, 2021

ti-srebot added the need-more-info label Jan 4, 2021

eurekaka added the status/won't-fix label Jan 4, 2021

ti-srebot removed the need-more-info label Jan 4, 2021

eurekaka mentioned this issue Jan 4, 2021

wrong modify_count / row_count after analyze #9458

Closed

tisonkun added the wontfix This issue will not be fixed. label Sep 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ANALYZE TABLE immediately after bulk update is unreliable #16570

ANALYZE TABLE immediately after bulk update is unreliable #16570

wwar commented Apr 18, 2020

zz-jason commented Apr 21, 2020

wwar commented Apr 21, 2020 •

edited

Loading

zz-jason commented Apr 22, 2020

eurekaka commented Apr 23, 2020 •

edited

Loading

breezewish commented Apr 23, 2020

breezewish commented Apr 23, 2020

eurekaka commented Apr 24, 2020

wwar commented Apr 25, 2020

eurekaka commented Apr 29, 2020

wwar commented May 5, 2020

zz-jason commented May 6, 2020

wwar commented May 6, 2020

eurekaka commented Jan 4, 2021

ti-srebot commented Jan 4, 2021

ANALYZE TABLE immediately after bulk update is unreliable #16570

ANALYZE TABLE immediately after bulk update is unreliable #16570

Comments

wwar commented Apr 18, 2020

Bug Report

1. What did you do?

2. What did you expect to see?

3. What did you see instead?

4. What version of TiDB are you using? (tidb-server -V or run select tidb_version(); on TiDB)

zz-jason commented Apr 21, 2020

wwar commented Apr 21, 2020 • edited Loading

zz-jason commented Apr 22, 2020

eurekaka commented Apr 23, 2020 • edited Loading

breezewish commented Apr 23, 2020

breezewish commented Apr 23, 2020

eurekaka commented Apr 24, 2020

wwar commented Apr 25, 2020

eurekaka commented Apr 29, 2020

wwar commented May 5, 2020

zz-jason commented May 6, 2020

wwar commented May 6, 2020

eurekaka commented Jan 4, 2021

ti-srebot commented Jan 4, 2021

Please edit this comment or add a new comment to complete the following information

Not a bug

Duplicate bug

Bug

1. Root Cause Analysis (RCA) (optional)

2. Symptom (optional)

3. All Trigger Conditions (optional)

4. Workaround (optional)

5. Affected versions

6. Fixed versions

4. What version of TiDB are you using? (`tidb-server -V` or run `select tidb_version();` on TiDB)

wwar commented Apr 21, 2020 •

edited

Loading

eurekaka commented Apr 23, 2020 •

edited

Loading