Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ANALYZE TABLE immediately after bulk update is unreliable #16570

Closed
wwar opened this issue Apr 18, 2020 · 14 comments
Closed

ANALYZE TABLE immediately after bulk update is unreliable #16570

wwar opened this issue Apr 18, 2020 · 14 comments
Assignees
Labels
component/statistics Priority/P1 Features that will be implemented in the latest or next major/minor version severity/moderate type/bug This issue is a bug. wontfix This issue will not be fixed.

Comments

@wwar
Copy link

wwar commented Apr 18, 2020

Bug Report

1. What did you do?

After I bulk updated rows, I ran ANALYZE TABLE. The ANALYZE did not take into account the recent update, and statistics remained out of date.

This appears to be a race condition. Consider the following:

DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
 id BIGINT NOT NULL PRIMARY KEY auto_increment,
 b BIGINT NOT NULL,
 pad VARBINARY(255),
 INDEX (b)
);

INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM dual;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
UPDATE t1 SET b=id;
SELECT SLEEP(5);
ANALYZE TABLE t1;
EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;

If you remove the SLEEP(5) you will get incorrect statistics.

2. What did you expect to see?

With the sleep call:

mysql> EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;
+-------------------------------+---------+---------+-----------+----------------------+------------------------------------------------------------------------------+----------------------------------+---------------+------+
| id                            | estRows | actRows | task      | access object        | execution info                                                               | operator info                    | memory        | disk |
+-------------------------------+---------+---------+-----------+----------------------+------------------------------------------------------------------------------+----------------------------------+---------------+------+
| IndexLookUp_10                | 4.00    | 4       | root      |                      | time:11.883687ms, loops:2, rpc num: 1, rpc time:11.148234ms, proc keys:21013 |                                  | 10.0234375 KB | N/A  |
| ├─IndexRangeScan_8(Build)     | 4.00    | 4       | cop[tikv] | table:t1, index:b(b) | time:12ms, loops:1                                                           | range:[-inf,5), keep order:false | N/A           | N/A  |
| └─TableRowIDScan_9(Probe)     | 4.00    | 4       | cop[tikv] | table:t1             | time:0s, loops:1                                                             | keep order:false                 | N/A           | N/A  |
+-------------------------------+---------+---------+-----------+----------------------+------------------------------------------------------------------------------+----------------------------------+---------------+------+
3 rows in set (0.01 sec)

3. What did you see instead?

Without the sleep call:

mysql> EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;
+-------------------------+----------+---------+-----------+---------------+-----------------------------------------------------------------------------+------------------+-----------------+------+
| id                      | estRows  | actRows | task      | access object | execution info                                                              | operator info    | memory          | disk |
+-------------------------+----------+---------+-----------+---------------+-----------------------------------------------------------------------------+------------------+-----------------+------+
| TableReader_7           | 21010.00 | 4       | root      |               | time:26.682049ms, loops:2, rpc num: 1, rpc time:26.58694ms, proc keys:21010 | data:Selection_6 | 1.7880859375 KB | N/A  |
| └─Selection_6           | 21010.00 | 4       | cop[tikv] |               | time:24ms, loops:25                                                         | lt(test.t1.b, 5) | N/A             | N/A  |
|   └─TableFullScan_5     | 21010.00 | 21010   | cop[tikv] | table:t1      | time:24ms, loops:25                                                         | keep order:false | N/A             | N/A  |
+-------------------------+----------+---------+-----------+---------------+-----------------------------------------------------------------------------+------------------+-----------------+------+
3 rows in set (0.03 sec)

4. What version of TiDB are you using? (tidb-server -V or run select tidb_version(); on TiDB)

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-259-g2c8afe6e1
Git Commit Hash: 2c8afe6e1621d4bc2074d17f6c09c17f99fd4d89
Git Branch: master
UTC Build Time: 2020-04-15 03:14:55
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)
@zz-jason
Copy link
Member

I'm sorry, I can not reproduce this issue. Here is my test step:

  1. Copy all the following SQL statements into a file called 16570.sql. Note, there is no SELECT SLEEP(5):
DROP TABLE IF EXISTS t1;
CREATE TABLE t1 (
 id BIGINT NOT NULL PRIMARY KEY auto_increment,
 b BIGINT NOT NULL,
 pad VARBINARY(255),
 INDEX (b)
);

INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM dual;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
INSERT INTO t1 SELECT NULL, 1, RANDOM_BYTES(255) FROM t1 a INNER JOIN t1 b INNER JOIN t1 c LIMIT 10000;
UPDATE t1 SET b=id;
ANALYZE TABLE t1;
EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;
  1. Start a tidb-server with the default configuration, the tidb-server version is the same as this issue described:
➜ ./bin/tidb-server -V
Release Version: v4.0.0-beta.2-259-g2c8afe6e1
Git Commit Hash: 2c8afe6e1621d4bc2074d17f6c09c17f99fd4d89
Git Branch: HEAD
UTC Build Time: 2020-04-21 09:12:50
GoVersion: go1.13.10
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
  1. execute the sql file, I got this output:
➜ mysql --default-character-set=utf8 -h 127.0.0.1 -P 4000 -u root -D test < 16570.sql
id      estRows actRows task    access object   execution info  operator info   memory  disk
IndexLookUp_10  5.37    4       root            time:19.802983ms, loops:2, rpc num: 1, rpc time:19.59566ms, proc keys:0         10.0234375 KB   N/A
├─IndexRangeScan_8(Build)       5.37    4       cop[tikv]       table:t1, index:b(b)    time:19.566802ms, loops:5       range:[-inf,5), keep order:false        N/A     N/A
└─TableRowIDScan_9(Probe)       5.37    4       cop[tikv]       table:t1        time:55.581µs, loops:5  keep order:false        N/A     N/A

The execution plan is not the same as the issue described.

@wwar Could you provide a detailed reproduce step to help us locate the root cause?

@zz-jason zz-jason added Priority/P1 Features that will be implemented in the latest or next major/minor version component/statistics labels Apr 21, 2020
@github-actions github-actions bot added this to Issue Backlog: Need Triage in SIG Planner Kanban Apr 21, 2020
@wwar
Copy link
Author

wwar commented Apr 21, 2020

I cannot reproduce with mocktikv as the store. Could that be the issue?

I am using the default configuration, with a single tikv server which is local (tikv + pd are from nightly, tidb-server I have tried both master and nightly):

./bin/pd-server  --data-dir=/mnt/evo860/data/pd --log-file=/mnt/evo860/logs/pd.log &
./bin/tikv-server --pd="127.0.0.1:2379" --data-dir=/mnt/evo860/data/tikv -A 127.0.0.1:20165 --log-file=/mnt/evo860/logs/tikv.log &

cd /home/wwar/Documents/tidb
./bin/tidb-server --path="127.0.0.1:2379" --store=tikv

Version of tidb-server:

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-302-g8c455e86c
Git Commit Hash: 8c455e86c136ac020129998c94f890448367f62a
Git Branch: master
UTC Build Time: 2020-04-21 02:25:12
GoVersion: go1.13.8
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

@zz-jason
Copy link
Member

Thanks @wwar, I reproduced this issue with tikv as well:

  1. start a playground cluster with tiup:
tiup playground nightly
  1. execute the SQL file described in the former comment

I got this output:

id      estRows actRows task    access object   execution info  operator info   memory  disk
TableReader_7   21010.00        4       root            time:28.845952ms, loops:2, rpc num: 1, rpc time:28.786899ms, proc keys:21010    data:Selection_6        1.787109375 KB  N/A
└─Selection_6   21010.00        4       cop[tikv]               time:29ms, loops:25     lt(test.t1.b, 5)        N/A     N/A
  └─TableFullScan_5     21010.00        21010   cop[tikv]       table:t1        time:29ms, loops:25     keep order:false        N/A     N/A

@lzmhhh123 lzmhhh123 moved this from Issue Backlog: Need Triage to Issue Backlog: Release 4.0 in SIG Planner Kanban Apr 22, 2020
@eurekaka
Copy link
Contributor

eurekaka commented Apr 23, 2020

It looks like a bug in tikv coprocessor. I turned off auto-analyze, statistics feedback, and increased the stats-lease to 100s, the problem can still be reproduced.

I added some debugging logs in ANALYZE execution, and found that, when there was no sleep(5), the NDV of index b returned by distsql.Analyze request was 1; with sleep(5), the NDV of index b returned by distsql.Analyze request was 21010. Looks like there is a problem in the visibility of distsql.Analyze request in coprocessor, i.e, Analyze request in coprocessor cannot see the latest data until several seconds later, but I checked that the StartTS of the request is always MaxUint64.

@breeswish can you please check if it is the bug in Analyze interface of coprocessor?

@breezewish
Copy link
Member

Sorry currently I'm not very familiar with the Analyze part of the Coprocessor, since it is maintained by TiDB member previously 🤣I can help check whether Coprocessor Analyze successfully scanned recently data (but cannot help check what happens next). Does it help?

@breezewish
Copy link
Member

I've confirmed that indeed Analyze did not scanned latest data, although received max_ts. I will ask txn member to look at this issue.

@eurekaka
Copy link
Contributor

@breeswish thanks, it helps

@wwar
Copy link
Author

wwar commented Apr 25, 2020

Confirming this is fixed:

..
mysql> EXPLAIN ANALYZE SELECT * FROM t1 WHERE b < 5;
+-------------------------------+---------+---------+-----------+----------------------+--------------------------------------------------------------------------+----------------------------------+---------------+------+
| id                            | estRows | actRows | task      | access object        | execution info                                                           | operator info                    | memory        | disk |
+-------------------------------+---------+---------+-----------+----------------------+--------------------------------------------------------------------------+----------------------------------+---------------+------+
| IndexLookUp_10                | 5.41    | 4       | root      |                      | time:24.366877ms, loops:2, rpc num: 1, rpc time:23.673301ms, proc keys:0 |                                  | 10.0234375 KB | N/A  |
| ├─IndexRangeScan_8(Build)     | 5.41    | 4       | cop[tikv] | table:t1, index:b(b) | time:23.588053ms, loops:5                                                | range:[-inf,5), keep order:false | N/A           | N/A  |
| └─TableRowIDScan_9(Probe)     | 5.41    | 4       | cop[tikv] | table:t1             | time:331.463µs, loops:5                                                  | keep order:false                 | N/A           | N/A  |
+-------------------------------+---------+---------+-----------+----------------------+--------------------------------------------------------------------------+----------------------------------+---------------+------+
3 rows in set (0.03 sec)

mysql> select tidb_version()\G
*************************** 1. row ***************************
tidb_version(): Release Version: v4.0.0-beta.2-352-g5ede18f10
Git Commit Hash: 5ede18f10eedfe2e3690d7728dec3ffa4b0af2d5
Git Branch: master
UTC Build Time: 2020-04-24 03:45:17
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false
1 row in set (0.00 sec)

Note: This is a case where it would have really helped if tikv version was included in tidb_version() output (both in terms of initial verification, and later confirming it is fixed.)

@wwar wwar closed this as completed Apr 25, 2020
Question and Bug Reports automation moved this from In Progress: Waiting on Investigation to Closed(This Week) Apr 25, 2020
SIG Planner Kanban automation moved this from Issue Backlog: Release 4.0 to Issue: Done (This Week) Apr 25, 2020
@zz-jason zz-jason moved this from Issue: Done (This Week) to Issue: In Progress in SIG Planner Kanban Apr 27, 2020
@zz-jason zz-jason moved this from Closed(This Week) to In Progress: Waiting on Investigation in Question and Bug Reports Apr 27, 2020
@eurekaka
Copy link
Contributor

The root cause of the problem is that in tikv 4.0, commit of secondary lock is in asynchronous way, and ANALYZE request uses MaxUint64 as StartTS and RC as tikv isolation level, so it would ignore those un-released locks, thus it cannot read the latest data sometimes. We can change the behavior of ANALYZE to wait for the release of the locks for latest data, but it would incur potential performance regression theoretically. From another perspective, ANALYZE does not need to read the real LATEST data actually. The time window of the secondary lock commit is usually at ms level, so we prefer to keep the current ANALYZE behavior at least for now.

@wwar wwar reopened this May 5, 2020
Question and Bug Reports automation moved this from In Progress: Waiting on Investigation to Need Triage May 5, 2020
SIG Planner Kanban automation moved this from Issue: In Progress to Issue Backlog: Need Triage May 5, 2020
@wwar
Copy link
Author

wwar commented May 5, 2020

I tested this again after your comment, and it is indeed not fixed (due to a race).

May I suggest documenting the maximum delay that is expected until ANALYZE reads up to date data? The common use case for manually ANALYZE is after you know a batch update may have skewed stats. But it sounds like this now an anti-pattern, since running it will update the time stats were last gathered and delay any auto-regeneration.

@zz-jason
Copy link
Member

zz-jason commented May 6, 2020

From the investigation result:

The time window of the secondary lock commit is usually at ms level

@wwar What's the time period between your bulk insert and analyze operation?

@wwar
Copy link
Author

wwar commented May 6, 2020

In my case, I know of this issue and I will probably just use a sleep(5) as in this example. If sleep(1) is safe enough, then it is easy to switch.

My suggestion was to update the docs to make this clearer. i.e. on https://pingcap.com/docs/stable/reference/sql/statements/analyze-table/ under "MySQL Compatibility" it says:

This statement is syntactically similar with MySQL. However, ANALYZE TABLE may take significantly longer to execute on TiDB, as internally it operates in a different manner.

This should be updated to mention both this issue and tidb_enable_fast_analyze. May I suggest:

TiDB differs from MySQL in both the statistics it collects and how it makes use of statistics during query execution. While this statement is syntactically similar to MySQL, the following differences apply:

  1. TiDB may not include very recently committed changes when running ANALYZE TABLE. After a batch-update of rows, you may need to sleep(1) before executing ANALYZE TABLE in order for the statistics update to reflect these changes.

  2. ANALYZE TABLE takes significantly longer to execute in TiDB than MySQL. This performance difference can be partially mitigated by enabling fast analyze with SET GLOBAL tidb_enable_fast_analyze=1. Fast analyze makes use of sampling, leading to less accurate statistics. Its usage is still considered experimental.

@zz-jason zz-jason moved this from Issue Backlog: Need Triage to Issue Backlog: High Priority in SIG Planner Kanban May 6, 2020
@winoros winoros moved this from Need Triage to In Progress: Handled to SIGs in Question and Bug Reports May 6, 2020
@ghost ghost removed this from In Progress: Handled to SIGs in Question and Bug Reports Aug 6, 2020
@eurekaka
Copy link
Contributor

eurekaka commented Jan 4, 2021

pingcap/docs#3151 has updated the documentation, so I am going to close this issue.

@eurekaka eurekaka closed this as completed Jan 4, 2021
SIG Planner Kanban automation moved this from Backlog: High Priority to Coding Finished (This Week) Jan 4, 2021
@ti-srebot
Copy link
Contributor

Please edit this comment or add a new comment to complete the following information

Not a bug

  1. Remove the 'type/bug' label
  2. Add notes to indicate why it is not a bug

Duplicate bug

  1. Add the 'type/duplicate' label
  2. Add the link to the original bug

Bug

Note: Make Sure that 'component', and 'severity' labels are added
Example for how to fill out the template: #20100

1. Root Cause Analysis (RCA) (optional)

2. Symptom (optional)

3. All Trigger Conditions (optional)

4. Workaround (optional)

5. Affected versions

6. Fixed versions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/statistics Priority/P1 Features that will be implemented in the latest or next major/minor version severity/moderate type/bug This issue is a bug. wontfix This issue will not be fixed.
Projects
No open projects
SIG Planner Kanban
  
Coding Finished (This Week)
Development

No branches or pull requests

7 participants