Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

403 Permission denied error / 404 not found when submiting concurrent kv GET,PUT or DELETE / kv GET recurse with the same token/different tokens with the same policy from a non primary datacenter #5219

Closed
viniciusartur opened this issue Jan 12, 2019 · 9 comments · Fixed by #5246
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected

Comments

@viniciusartur
Copy link

viniciusartur commented Jan 12, 2019

Overview of the Issue

Given we have 2 consul clusters with ACL enabled and ACL replication
When we submit concurrent kv GET, PUT or DELETE with the same token or with different tokens with the same policy from the non primary datacenter
Then we have a Unexpected response code: 403 (Permission denied)
Or
Then we have a 404 when we submit a kv GET recurse

It seems when consul is resolving the policies of this token it get lost in some way.
The function resolvePoliciesForIdentity from agent/consul/acl.go flows different when have concurrent requests with the same token.

Reproduction Steps

  1. 2 vagrant machines

➜ reproduce_it cat Vagrantfile

# -*- mode: ruby -*-
# vi: set ft=ruby :

VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  (1..2).each do |index|
    config.vm.define "consul-dc#{index}" do |c|
      c.vm.box =  "ubuntu/trusty64"
      c.vm.hostname = "consul-dc#{index}"
      c.vm.network "private_network", ip: "192.168.56.10#{index}"
      #c.vm.provision "shell", path: "setup.sh"
      #c.vm.provider :virtualbox do |vb|
      #  vb.customize ["modifyvm", :id, "--memory", "2048"]
      #end
      #c.vm.synced_folder "./", "/home/vagrant/go/src/github.com/hashicorp/consul-replicate"
    end
  end
end
  1. vagrant up
  2. install unzip and download consul-1.4.0
  3. dc1 run consul with the following command and config.json:
    command: consul agent -server -config-dir=config.json
    config.json:
{
  "server": true,
  "bind_addr": "0.0.0.0",
  "data_dir": "/tmp/consul",
  "datacenter": "dc1",
  "advertise_addr": "192.168.56.101",
  "log_level": "TRACE",
  "bootstrap_expect": 1,
  "primary_datacenter": "dc1",
  "acl" : {
    "enabled" : true,
    "default_policy" : "deny",
    "down_policy" : "extend-cache"
  }
}
  1. wait for bootstrap the cluster
  2. bootstrap the acl, create agent policy, create token policy, set token policy, restart consul server
  3. dc2 run consul with the same command as dc1 and config.json:
{
  "server": true,
  "bind_addr": "0.0.0.0",
  "data_dir": "/tmp/consul",
  "datacenter": "dc2",
  "advertise_addr": "192.168.56.102",
  "log_level": "TRACE",
  "bootstrap_expect": 1,
  "primary_datacenter": "dc1",
  "retry_join_wan": ["192.168.56.101"],
  "acl" : {
    "enabled" : true,
    "default_policy" : "deny",
    "down_policy" : "extend-cache",
    "tokens": {
      "agent": "617539a8-5bf0-3b4a-4bbc-208cfdfae481"
    }
  }
}
  1. Create a policy with key_prefix "" { policy: "write"}

  2. Create a token or create two different tokens using the same policy

  3. Start 2 loops from dc2 writing keys using that token:

export CONSUL_HTTP_TOKEN=96f33a23-c54d-e2bd-651b-893846c43242
for i in $(seq 1 100000); do consul kv put test/1 1; exit_status=$?; if [ $exit_status -eq 1 ]; then break; fi ; done

Eventually the 403 permission denied errors will come

or

for i in $(seq 1 100000); do http_code=$(curl -s -o /dev/null -w "%{http_code}" -H "X-Consul-Token: $CONSUL_HTTP_TOKEN" localhost:8500/v1/kv/test?recurse=true); echo $http_code; if [ $http_code -eq 404 ]; then break; fi ; done

Eventually the 404 will come

Please notice on this POC I didn't test through an agent connected to DC2, but we had the issue on a production cluster even connected from local agent instead of communicating directly though the server.

You can reproduce the error with PUT, GET or DELETE.

As I mentioned before,
It seems when consul is resolving the policies of this token it get lost in some way.
The function resolvePoliciesForIdentity from agent/consul/acl.go flows different when have concurrent requests with the same token or different tokens with the same policy.

@viniciusartur viniciusartur changed the title 403 Permission denied error when submiting concurrent kv GET,PUT or DELETE with the same token from a non primary datacenter 403 Permission denied error when submiting concurrent kv GET,PUT or DELETE with the same token/different tokens with the same policy from a non primary datacenter Jan 14, 2019
@viniciusartur viniciusartur changed the title 403 Permission denied error when submiting concurrent kv GET,PUT or DELETE with the same token/different tokens with the same policy from a non primary datacenter 403 Permission denied error / 404 not found when submiting concurrent kv GET,PUT or DELETE / kv GET recurse with the same token/different tokens with the same policy from a non primary datacenter Jan 15, 2019
@viniciusartur
Copy link
Author

Please notice we have had impactful disasters due to this issue.
We use Consul UI and Consul Replicate and due to unexpected 404 responses we've had KV delete recurse of keys.

Consul UI triggers a delete?recurse when it receives a 404.
Consul Replicate understands it should delete the entire prefix destination when it receives a 404.

We've had unexpected 404 responses due to the fact we've been submitting requests associated to the same policy and in a intermittent way were processed concurrently.

@mkeeler mkeeler added type/bug Feature does not function as expected theme/acls ACL and token generation labels Jan 15, 2019
@mkeeler
Copy link
Member

mkeeler commented Jan 15, 2019

@viniciusartur Were the requests being made through a client agent or directly to a consul server?

@viniciusartur
Copy link
Author

We’ve had unexpected 403 on both scenarios: through a client agent and directly to a consul server.
I didn’t try to reproduce with 2 local agents sending requests to a non primary dc cluster.
When I was troubleshooting I noticed the problem happens on the leader of the non primary dc cluster, on the function mentioned above.

@viniciusartur
Copy link
Author

We noticed that configuring the replication token on the non primary datacenter stops reproducing the issue.
But we didn't notice that the policies were not being replicated. We presumed that we had replication working.
Once the policy is replicated on the non primary dc, Consul seems to solve it locally instead of doing remote calls to the primary dc.

@mkeeler
Copy link
Member

mkeeler commented Jan 17, 2019

@viniciusartur This is good information. Somehow I managed to miss the "from non primary datacenter" bit in the bug title.

I assume in your non-primary DC you were seeing some error logs about not being able to replicate policies (prior to setting up the replication token).

In Consul 1.4.0+ a replication_token must be set in non-primary datacenters. This token needs at least acl = "read" privileges in order to replicate policies or acl = "write" privileges in order to replicate both tokens and policies. Without that your newly created policy would never be replicated.

I assume then that the real bug lies somewhere with the remote policy resolution happening on the servers. This also brings up a bigger issue which is that there is currently no guide around setting up ACLs with multiple datacenters which would have probably helped.

@mkeeler
Copy link
Member

mkeeler commented Jan 17, 2019

@viniciusartur Without the replication token set do those queries ever work? They shouldn't as you have a default deny policy and the down policy is extend-cache. Since it could have never populated the cache of remote policies it should deny access always.

@viniciusartur
Copy link
Author

@mkeeler Couldn’t you reproduce the issue the way I posted?
So you can see by yourself. It’s very straightforward and you could simulate scenarios you want.
A guide to setup ACL in multiple datacenters is missing, in fact. But what is weird is even without the replication token configured in non-primary datacenters the token and policy are resolved in non-primary datacenters. This make reach a wrong conclusion that the replication is working on the non-primary datacenter. I don’t remember seeing a error log message about not being able to replicate policies.
When you set the replication token is not to fallback the resolution of tokens and policies in case of outage of the primary datacenter?
Why does the replication work on multiple datacenters even when you don’t set the replication token?
Why concurrent requests produces a misbehavior on resolving policies when the replication token is not set?
These are the questions that intrigues me :)

@mkeeler
Copy link
Member

mkeeler commented Jan 18, 2019

@viniciusartur I was able to reproduce yesterday. Not with vagrant but with a little terraform + docker + a python script to bootstrap acls and do the kv gets/puts.

I will have to write up an some internals docs on this but there are a few things to note:

  1. In the happy path with replication happening, the only RPC that needs to be made is for the replication routines. After that the consul servers in the secondary DCs will always resolve the tokens (if token replication is enabled) and policies from their local raft store.

  2. Due to Help text is too wide (> 80 chars) #1, it is sometimes necessary to wait for a token/policy to get replicated before it can be used in secondary datacenters.

  3. When replication was never successful, like when a replication token has not been set, then servers in the secondary datacenters will use RPCs back to the primary datacenter to resolve the tokens and policies. Which is why its not a permission denied 100% of the time. This is particularly useful for allowing a token you used during bootstrap to be used in the API requests to set the replication tokens within the secondary DC. Without this remote RPC fallback you would be forced to configure a acl.tokens.agent token and use it for setting up all the other tokens.

  4. The "fallback" in no. 3 is the only way things get resolved on non-server consul instances.

I put in a feature request for me to remember that at some point those secondary servers need to determine how stale their replication is and enable the fallback procedure in that case as well: #4842

One other note is that replication is done by the leader in each secondary datacenter. When I spun up my test cluster I wasn't seeing it at first either. But then remembered I needed to find the leader and view its logs.

Why concurrent requests causes bad behavior and resulting in permission denials I have yet to determine. Now that I have reproduced it I am going to figure that out.

@viniciusartur
Copy link
Author

Thanks for clarifying it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants