Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve error report after users do something wrong in bootstrap...... #287

Closed
siddontang opened this issue Aug 25, 2016 · 6 comments
Closed
Labels
status/discussion-wanted The issue needs to be discussed.

Comments

@siddontang
Copy link
Contributor

siddontang commented Aug 25, 2016

Now we see some users delete PD data or TiKV data and then start the cluster again, of course this will fail, but the error message is not clear, which even confuses us.

We may meet two cases here:

  1. PD has no data but TiKV has. When TiKV starts, it will find the PD is not bootstrapped, but it has already bootstrapped, so it will raise an error, maybe we should panic here directly and ask user to maybe he/she connects wrong PD or clear the TiKV data.
  2. PD has data but TiKV has not. When TiKV starts, it will find the PD is bootstrapped, so it will do nothing and wait PD does balance for it. But now PD has a wrong region info, so when TiDB starts, it will connect the wrong TiKV and get key not in region error.

Let's explain case 2 in detail, assume we have 1 PD + 1 TiKV.

  1. We bootstrapped the cluster successfully before, now PD has first region with a peer which store info is "id => 1, addr => host1:20160".
  2. Delete the TiKV data and then restart it. Now TiKV will find the PD is already bootstrapped, so it will go on, alloc a new store id maybe 3, report to PD, and wait PD does balance for it. Now PD has two stores 1 and 3, but with same address......
  3. TiDB starts and gets first region info from PD, note here, the region peer's store id is 1 but address is "host1:20160", then TiDB connects this TiKV. Unfortunately, The connected TiKV has no region, and returns key not in region error.

So how do we tell user the correct error and let he/she deletes all data and then bootstraps the cluster again?

maybe we can:

  1. Check the store report, if PD finds the store's address is conflicted with an existing different store's, it logs an error about this.
  2. TiDB sending message should contain the Store ID, TiKV can check store id mismatched and returns the error.
  3. When TiKV starts, it gets all stores from PD, if it finds address conflicted, log an error too. (necessary???)

/cc @ngaut @qiuyesuifeng @disksing @huachaohuang @overvenus

@qiuyesuifeng
Copy link
Contributor

@siddontang As you mentioned, case one maybe PD has no data but TiKV has?

@siddontang
Copy link
Contributor Author

yep, fixed.

@qiuyesuifeng
Copy link
Contributor

qiuyesuifeng commented Aug 25, 2016

I think we should check the conflicted store address, if it has already in PD, we can return an error to let TiKV start failed. If you really want to start a new TiKV, you should remove the bad store through PD api(not supported now).
Also there is another case, the node which store exists changes its host, then restarts the store, for this scenario, above method also works well.

@siddontang
Copy link
Contributor Author

Yes, as I said before, we need to check conflicted address with different store ID.

@huachaohuang
Copy link
Contributor

Seems this has been solved, should we close this?

@siddontang
Copy link
Contributor Author

yep

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status/discussion-wanted The issue needs to be discussed.
Projects
None yet
Development

No branches or pull requests

3 participants