Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Avoid returning empty data on startup of a non-leader server #4554

Merged
merged 5 commits into from
Aug 23, 2018

Conversation

@pierresouchay pierresouchay changed the title Stale reads on server init Avoid Stale reads to return nothing without error on server follower init Aug 21, 2018
@pierresouchay
Copy link
Contributor Author

Thanks to @vaLski to help with tests to solve this issue

@vaLski
Copy link
Contributor

vaLski commented Aug 21, 2018

I confirm that this PR fix all three issues.

Reproducer without this patch applied:

  • create consul server leader
  • create consul server follower
  • write some data into the kv store
  for i in $(seq 1 10); do
    consul kv put pub/prefix/${i}
  done
  • initiate stale reads on the follower
  while true; do
    consul kv get -recurse -detailed  -stale pub/prefix/
    echo $?
    date +%s
    sleep 0.1
  done
  • go to the consul leader and block all incoming/outgoing traffic from/to follower
    iptables -I INPUT -s follower.ip.ip.ip -j DROP
  • go to the follower
  • kill consul and start it again
  • shortly you will see the follower answering recurse queries with valid response code and empty payload
  • this behavior is unexpected. it should either return 5xx or return non-empty data.

As soon as the patch is applied, follower will start answering stale queries with 5xx error, unless it contacted the leader at least once, thus having some consistent raft db version. That's the expected behavior.

@pierresouchay
Copy link
Contributor Author

@mkeeler @banks @pearkes This fix is quite important as it lead us to outages (as it happens as well to @vaLski)

This is basically a race condition in Server Code that leads stale request to return empty instead of an error if a client (re)connects too fast before the server could contact its leader. Thus, Consul returns false data (for instance empty kv, but we had the same issues a long time ago that cause major outage because a restarting server did return [] for list of nodes.

It will fix 3 issues at the same time :-)

@pierresouchay pierresouchay changed the title Avoid Stale reads to return nothing without error on server follower init [BUGFIX] Avoid returning empty data on startup of a non-leader server Aug 21, 2018
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay @vaLski thanks this looks like a great find and fix. The test seems good although I need to check it through more carefully to be sure it's not going to be potentially flaky in CI - seems OK but I'm not totally sure on a quick glance.

I'm approving this because the logic seems good but we might not merge until later in the release cycle when we have a little more time to test it ourselves thoroughly!

@freddygv can you take a look over the test code and check that it doesn't seem to rely on any timing assumptions that will cause us problems?

@pierresouchay
Copy link
Contributor Author

pierresouchay commented Aug 22, 2018

@banks Thank you for the quick review

@freddygv About the flakiness, it should be Ok since I used the exact same patterns as existing tests (that are not known to be flaky) and I tested the following way:

while go test -parallel 2 -timeout 30s github.com/hashicorp/consul/agent/consul -run '^TestCatalog_ListService(s_Stale|Nodes)$'; do go clean -testcache; done

=> 80 consecutive runs without a single failure (It usually take around 5-6 runs to get a failure for unstable tests)

t.Fatalf("bad: %#v", out.Services)
}

if out.Services["consul"] == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will ever be nil if the prior assertion for len(out.Services) passes. Also, Services maps a service to its tags, not its ID, according to this. So the stored value should be an empty slice.

It should be ok to remove this check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

os.RemoveAll(dir1)

args.AllowStale = false
// Run the query, do not wait for leader, never any contact with leader, should fail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please update this comment, we have had contact with a leader, it's just that now we don't have one anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE

args.AllowStale = false
// Run the query, do not wait for leader, never any contact with leader, should fail
if err := msgpackrpc.CallWithCodec(codec, "Catalog.ListServices", &args, &out); err == nil || err.Error() != structs.ErrNoLeader.Error() {
t.Fatalf("expected %v but got err: %v and %v", structs.ErrNoLeader, err, out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spotted some flakiness here after re-running the agent/consul job in Travis ~5 times: https://travis-ci.org/hashicorp/consul/jobs/418763773

Here is the error:
catalog_endpoint_test.go:1532: expected No cluster leader but got err: <nil> and {map[consul:[]] {42 0s true }}

The last true in the slice above is the value of KnownLeader, so it seems that the result for the RPC is may be coming back before the heartbeat fails and the leader is removed.

Could this test be restructured so that it doesn't depend on the side-effects of Leave() and Shutdown()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DONE, I added testrpc.WaitUntilNoLeader() new test method in order to solve this kind of issues

@pierresouchay
Copy link
Contributor Author

@freddygv In the first check, some unit tests did fail, but not related to my change: https://travis-ci.org/hashicorp/consul/jobs/419520868

@vaLski
Copy link
Contributor

vaLski commented Aug 23, 2018

Salute and big thanks to everyone involved in tracking and fixing this. Great job guys. Really \o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants