Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential race condition in CREATE TABLE #10410

Closed
aphyr opened this issue May 9, 2019 · 7 comments

Comments

Projects
None yet
3 participants
@aphyr
Copy link

commented May 9, 2019

With TiDB 3.0.0-beta.1, on a five-node EC2 cluster running Debian Stretch, TiDB can return successfully from a create table command, then return Table ... doesn't exist errors when the same client attempts to insert a record into the table it just created.

  1. What did you do?

Our Jepsen test runs the following pair of commands to verify the cluster is ready to serve requests:

          (j/execute! c ["create table if not exists jepsen_await (id int primary key, val int)"])
          (j/insert! c "jepsen_await" {:id  (swap! await-id inc)
                                       :val (rand-int 5)})
  1. What did you expect to see?

If the create table command completes, one would expect the subsequent insert to observe that table's existence.

  1. What did you see instead?

Instead, the insert! call throws:

java.sql.SQLSyntaxErrorException: (conn=1) Table 'test.jepsen_await' doesn't exist.

  1. What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)?

3.0.0-beta.1.

You can reproduce this problem with Jepsen dd9a7a53ddcbc8bf691134598f144e14213e241d by running something like

lein run test-all --concurrency 2n --time-limit 30

I haven't seen this in my LXC cluster, but on EC2, roughly 1 in 15 tests crash during cluster setup because of this behavior.

@aphyr

This comment has been minimized.

Copy link
Author

commented May 9, 2019

@shenli

This comment has been minimized.

Copy link
Member

commented May 9, 2019

@aphyr Thanks!

@aphyr

This comment has been minimized.

Copy link
Author

commented May 20, 2019

This has been tricky to reproduce, but I just hit another instance of it. These two happened in long-fork table initialization, which occurs after we've already successfully created and inserted records into the await table. I hit 4 of these in rapid succession, after several hours of stable testing. I wonder if there's some sort of... noisy neighbor in EC2 that... maybe slows down a component just enough to trigger this behavior?

20190520T205752.000Z.zip

20190520T202233.000Z.zip

@zimulala

This comment has been minimized.

Copy link
Member

commented Jun 6, 2019

@aphyr Thanks!
Background:
Currently, the DDL lease when doing boostrap is 100ms. When the bootstrap is completed, the DDL lease will become the value in the normal configuration. The default is 45s.
And the owner changes the state for the longest time waiting for 2 * lease to confirm the completion of the synchronization.

The scenario in which the problem occurred:
All TiDB servers will do boostrap operations. When other TiDB servers complete DDL and DML operations in boostrap and start receiving DDL operations from the client, the owner should also end the bootstrap, but due to a conflict when modifying the boostrap version, sleep 1s. In this 1s, there are DDL operations received by other TiDB servers.

This PR #10029 fixed this issue.

@zimulala zimulala added the type/bug label Jun 6, 2019

@zimulala zimulala self-assigned this Jun 6, 2019

@shenli

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

hi @aphyr
The issue is fixed. Should we close it now?

@aphyr

This comment has been minimized.

Copy link
Author

commented Jun 10, 2019

I haven't had time to confirm myself, but if you think it's fixed, you're welcome to close the issue!

@shenli

This comment has been minimized.

Copy link
Member

commented Jun 10, 2019

OK! We have tested it.
If you have any more problem with 3.0.0-rc.2 or later version, please feel free to reopen this issue.
Thank you!

@shenli shenli closed this Jun 10, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.