Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MySQL reports Deadlock when TiCDC replicating data #3589

Closed
overvenus opened this issue Nov 23, 2021 · 3 comments · Fixed by #4138 or #6278
Closed

MySQL reports Deadlock when TiCDC replicating data #3589

overvenus opened this issue Nov 23, 2021 · 3 comments · Fixed by #4138 or #6278
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. component/sink Sink component. severity/moderate This is a moderate bug. type/bug This is a bug.

Comments

@overvenus
Copy link
Member

overvenus commented Nov 23, 2021

What did you do?

In 1 TiDB, 1 TiKV, 1 TiCDC cluster, set up a changefeed that replicates all tables to downstream MySQL.

MySQL may reports deadlock error.

Reproduce:

  • Extends TiDB GC lifetime to 2400h.
  • Load data to TiDB, tiup bench tpcc prepare and tiup bench tpcc run
  • Create a changefeed cfa whose start-ts = TiDB gc safepoint, and replicates data to MySQL.
  • Wait checkpoint lag close to 10s.
  • Remove changefeed cfa and create a changefeed cfb use the same config as cfa.

We can find deadlock errors in TiCDC log.

MySQL deadlock log: deadlock.log

Deadlock table schema

CREATE TABLE `order_line` (
  `ol_o_id` int(11) NOT NULL,
  `ol_d_id` int(11) NOT NULL,
  `ol_w_id` int(11) NOT NULL,
  `ol_number` int(11) NOT NULL,
  `ol_i_id` int(11) NOT NULL,
  `ol_supply_w_id` int(11) DEFAULT NULL,
  `ol_delivery_d` datetime DEFAULT NULL,
  `ol_quantity` int(11) DEFAULT NULL,
  `ol_amount` decimal(6,2) DEFAULT NULL,
  `ol_dist_info` char(24) DEFAULT NULL,
  PRIMARY KEY (`ol_w_id`,`ol_d_id`,`ol_o_id`,`ol_number`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

What did you expect to see?

No deadlock error.

What did you see instead?

MySQL Deadlock error.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v5.2.1

TiCDC version (execute cdc version):

v5.2.1

MySQL version

5.7.36
@overvenus overvenus added type/bug This is a bug. component/sink Sink component. severity/moderate This is a moderate bug. area/ticdc Issues or PRs related to TiCDC. labels Nov 23, 2021
@github-actions github-actions bot added this to Need Triage in Question and Bug Reports Nov 23, 2021
@zhaoxinyu
Copy link
Contributor

zhaoxinyu commented Nov 25, 2021

  1. How to reproduce this problem in MySQL?
  • Prepare initial data as follows:
CREATE TABLE `order_line` (
  `ol_o_id` int(11) NOT NULL,
  `ol_d_id` int(11) NOT NULL,
  `ol_w_id` int(11) NOT NULL,
  `ol_number` int(11) NOT NULL,
  `ol_i_id` int(11) NOT NULL,
  `ol_supply_w_id` int(11) DEFAULT NULL,
  `ol_delivery_d` datetime DEFAULT NULL,
  `ol_quantity` int(11) DEFAULT NULL,
  `ol_amount` decimal(6,2) DEFAULT NULL,
  `ol_dist_info` char(24) DEFAULT NULL,
  PRIMARY KEY (`ol_w_id`,`ol_d_id`,`ol_o_id`,`ol_number`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
INSERT INTO `order_line` VALUES (7215,2,58,4,52661,58,'2021-11-13 02:38:33',5,185.65,'ZFQZYPNYPLZHNNBWPEIWTSPS'),(7222,5,78,8,52661,58,'2021-11-13 02:38:33',5,185.65,'ZFQZYPNYPLZHNNBWPEIWTSPS');
  • Run sql statements in two transactions:
    T1: begin;
    T2: begin;
    T1: delete from test.order_line where ol_o_id=7220 and ol_d_id=4 and ol_w_id=68 and ol_number = 7;
    T2: delete from test.order_line where ol_o_id=7221 and ol_d_id=2 and ol_w_id=58 and ol_number = 4;
    T1: REPLACE INTO test.order_line(ol_o_id,ol_d_id,ol_w_id,ol_number,ol_i_id,ol_supply_w_id,ol_delivery_d,ol_quantity,ol_amount,ol_dist_info) VALUES (7219,2,58,4,52661,58,'2021-11-13 02:38:33',5,'185.65',_binary'ZFQZYPNYPLZHNNBWPEIWTSPS');
    T2: REPLACE INTO test.order_line(ol_o_id,ol_d_id,ol_w_id,ol_number,ol_i_id,ol_supply_w_id,ol_delivery_d,ol_quantity,ol_amount,ol_dist_info) VALUES (7218,4,58,1,23862,58,'2021-11-13 02:38:33',7,'408.94',_binary'QTCLICHWFMNOODRLLIBOOTUQ');

  • At this time, we get error from T2:
    Deadlock found when trying to get lock; try restarting transaction

  1. Why does this happen?
  • In RR isolation level, a gap lock is taken when we delete a nonexistent record. Because two gap locks in two transactions on the same gap don't block each other, so the two delete statements can successfully executed. (Reference gap lock)
  • REPLACE INTO statement in T1 tries to insert a record and it tries to take an insert intention lock. And because a gap lock is taken by T2, this REPLACE INTO statement is blocked.
  • REPLACE INTO statement in T2 also tries to insert a record into the same gap as T1, and it also waits to obtain an insert intention lock.
  • Circular waiting happens and dead lock occurs.

@lance6716
Copy link
Contributor

https://github.com/mysql/mysql-server/blob/3290a66c89eb1625a7058e0ef732432b6952b435/sql/log_event.cc#L13136-L13141

maybe we can follow MySQL's implementation

@maxshuang maxshuang self-assigned this Nov 30, 2021
@maxshuang
Copy link
Contributor

maxshuang commented Dec 8, 2021

Mysql suffers from delicate but complicate lock rules, ref partial to :
https://dev.mysql.com/doc/refman/5.7/en/innodb-locking.html
https://dev.mysql.com/doc/refman/5.7/en/innodb-locks-set.html

I will discuss https://asktug.com/t/topic/272906/7 here, which is slight different from this issue. In that situation, we meet "pk(auto_increment) + uk + replace into + rr(/rc)". It will cause deadlock on specified situation, which may be normal in production. And deadlock will not disappear even we change the isolation level to rc, because rc needs gap lock to check for uniqueness on uk, ref to:

https://dev.mysql.com/doc/refman/5.7/en/innodb-transaction-isolation-levels.html
READ COMMITTED: Gap locking is only used for foreign-key constraint checking and duplicate-key checking.

This deadlock pattern is as follow:

txn1:  hold next-key lock  -----------------------> wait for insert intention lock(wait for txn2)
txn2:  -------------------> wait next-key lock(wait for txn1) --------->

What xinyu discuss above is not exactly precise in this situation, because our "delete" always hit some records and add record locks on uk or pk but not gap lock. So, this deadlock is raised purely by "replace into", which has very complicate lock rules. I will make a simple explanation below, ref to: http://blog.itpub.net/7728585/viewspace-2141409/ .

Scene: 
- mysql version: 5.7.x 5.6.x
- paras: innodb_autoinc_lock_mode = 1(or 2)  ----> cause autoinc column monotonically increasing but not orderly
- isolation: rr or rc
- index: pk(auto_increment) + uk 
- dml: replace into
- table:
CREATE TABLE `lock_test` (
  `id` int NOT NULL AUTO_INCREMENT,
  `c0` int DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_c0` (`c0`)
) ENGINE=InnoDB AUTO_INCREMENT=2241 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
Insert into lock_test values(10, 4);
- txn1: replace into lock_test values(9, 4);
- txn2:  replace into lock_test values(15, 4);

Lock wait sequence:

Ts Txn1 Txn2
t0 begin begin
t1 replace (9, 4) replace (15,4)
t2 get uk next-key lock(4,10), check uk 4 conflict
t3 wait uk next-key lock(4,10) to check uniqueness
t4 tag delete pre index of pk&uk, insert uk(4,9), find gap lock wait in (4,10), wait insert intension lock

Much more details in "replace into" lock process, including gap locks in pk, lock inherit/split in pk&uk. I will make more deep investigation if free.

  • Conclusion
  1. After tests, it will happen in 5.6.x and 5.7.x with isolation rr&rc, but mysql8 seems solve this "replace into" deadlock(Not quite sure). Detail principle need to be investigated in mysql8.
  2. We use "delete+replace into" instead of "update" to satisfy Idempotent in TiCDC,but it seems that using "replace into" may not be a good idea under high concurrency.
    We need to test using "insert xxx on duplicate key update" to replace "replace into" SQL.
  3. Remove uk will abbreviate the deadlock. Only with the pk, "replace into" will get record lock and gap lock in this situation, gap lock will only block insert intention lock but not another gap lock.
    @overvenus @lance6716 @amyangfei @zhaoxinyu @liuzix @ben1009 PLAT

Question and Bug Reports automation moved this from Need Triage to Done Jan 14, 2022
ti-chi-bot pushed a commit that referenced this issue Jul 22, 2022
overvenus added a commit to ti-chi-bot/tiflow that referenced this issue Aug 17, 2022
…safe mode (pingcap#6278)

close pingcap#3589, close pingcap#5611

Signed-off-by: Neil Shen <overvenus@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment