Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul KV consistency checks #894

Merged
merged 25 commits into from May 26, 2019
Merged

Consul KV consistency checks #894

merged 25 commits into from May 26, 2019

Conversation

shlomi-noach
Copy link
Collaborator

Work in progress to be able to audit and fix KV pair inconsistency, especially external KV pair store inconsistencies.

e.g. if upon failover Consul is unavailable, orchestrator could run a consistency check against its local KV store to detect and resolve such inconsistency.

Initial commit: writing Nano timestamp in all KV stores.

@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 09:47 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 10:13 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 10:21 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 10:26 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 10:29 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 10:33 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 12:03 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 12:23 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 12:48 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 13:20 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 13:31 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 13:34 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 14:07 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 14:10 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 14:18 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 19, 2019 14:56 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 20, 2019 08:37 Inactive
Shlomi Noach added 2 commits May 21, 2019 07:58
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 21, 2019 05:00 Inactive
@shlomi-noach shlomi-noach changed the title KV: auditing write time; WIP for consistency checks Consul KV consistency checks May 21, 2019
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 21, 2019 05:32 Inactive
@shlomi-noach
Copy link
Collaborator Author

What this PR turned to be doing:

Assuming ConsulCrossDataCenterDistribution is set, when the orchestrator leader distributes master KV values, we run a thorough check on all known Consul datacenters to verify all KVs exist on all these clusters and are correct. This only happens on the orchestrator leader node.

There's caching to avoid abusing consul. On first iteration as leader, orchestrator will query all KV on all consul clusters. From there on, it will rely on caching of last-known value.

What this effectively does: if a consul cluster was down for any reason, and if some failover took place at that time, and as result orcehstartor could not update that cluster with the new KV data -- the change is not lost. Every minute, orchestrator will retry updating the cluster with the most up-to-date KV data.

@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor May 21, 2019 06:36 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 21, 2019 09:59 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor May 21, 2019 09:59 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 22, 2019 04:21 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=concertmaster May 26, 2019 06:29 Inactive
@shlomi-noach shlomi-noach temporarily deployed to production/mysql_cluster=conductor May 26, 2019 06:39 Inactive
@shlomi-noach shlomi-noach merged commit f6fc25c into master May 26, 2019
@shlomi-noach shlomi-noach deleted the kv-pair-auditing branch May 26, 2019 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant