403 Permission denied error / 404 not found when submiting concurrent kv GET,PUT or DELETE / kv GET recurse with the same token/different tokens with the same policy from a non primary datacenter #5219

viniciusartur · 2019-01-12T02:24:34Z

Overview of the Issue

Given we have 2 consul clusters with ACL enabled and ACL replication
When we submit concurrent kv GET, PUT or DELETE with the same token or with different tokens with the same policy from the non primary datacenter
Then we have a Unexpected response code: 403 (Permission denied)
Or
Then we have a 404 when we submit a kv GET recurse

It seems when consul is resolving the policies of this token it get lost in some way.
The function resolvePoliciesForIdentity from agent/consul/acl.go flows different when have concurrent requests with the same token.

Reproduction Steps

2 vagrant machines

➜ reproduce_it cat Vagrantfile

# -*- mode: ruby -*-
# vi: set ft=ruby :

VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

  (1..2).each do |index|
    config.vm.define "consul-dc#{index}" do |c|
      c.vm.box =  "ubuntu/trusty64"
      c.vm.hostname = "consul-dc#{index}"
      c.vm.network "private_network", ip: "192.168.56.10#{index}"
      #c.vm.provision "shell", path: "setup.sh"
      #c.vm.provider :virtualbox do |vb|
      #  vb.customize ["modifyvm", :id, "--memory", "2048"]
      #end
      #c.vm.synced_folder "./", "/home/vagrant/go/src/github.com/hashicorp/consul-replicate"
    end
  end
end

vagrant up
install unzip and download consul-1.4.0
dc1 run consul with the following command and config.json:
command: consul agent -server -config-dir=config.json
config.json:

{
  "server": true,
  "bind_addr": "0.0.0.0",
  "data_dir": "/tmp/consul",
  "datacenter": "dc1",
  "advertise_addr": "192.168.56.101",
  "log_level": "TRACE",
  "bootstrap_expect": 1,
  "primary_datacenter": "dc1",
  "acl" : {
    "enabled" : true,
    "default_policy" : "deny",
    "down_policy" : "extend-cache"
  }
}

wait for bootstrap the cluster
bootstrap the acl, create agent policy, create token policy, set token policy, restart consul server
dc2 run consul with the same command as dc1 and config.json:

{
  "server": true,
  "bind_addr": "0.0.0.0",
  "data_dir": "/tmp/consul",
  "datacenter": "dc2",
  "advertise_addr": "192.168.56.102",
  "log_level": "TRACE",
  "bootstrap_expect": 1,
  "primary_datacenter": "dc1",
  "retry_join_wan": ["192.168.56.101"],
  "acl" : {
    "enabled" : true,
    "default_policy" : "deny",
    "down_policy" : "extend-cache",
    "tokens": {
      "agent": "617539a8-5bf0-3b4a-4bbc-208cfdfae481"
    }
  }
}

Create a policy with key_prefix "" { policy: "write"}
Create a token or create two different tokens using the same policy
Start 2 loops from dc2 writing keys using that token:

export CONSUL_HTTP_TOKEN=96f33a23-c54d-e2bd-651b-893846c43242
for i in $(seq 1 100000); do consul kv put test/1 1; exit_status=$?; if [ $exit_status -eq 1 ]; then break; fi ; done

Eventually the 403 permission denied errors will come

or

for i in $(seq 1 100000); do http_code=$(curl -s -o /dev/null -w "%{http_code}" -H "X-Consul-Token: $CONSUL_HTTP_TOKEN" localhost:8500/v1/kv/test?recurse=true); echo $http_code; if [ $http_code -eq 404 ]; then break; fi ; done

Eventually the 404 will come

Please notice on this POC I didn't test through an agent connected to DC2, but we had the issue on a production cluster even connected from local agent instead of communicating directly though the server.

You can reproduce the error with PUT, GET or DELETE.

As I mentioned before,
It seems when consul is resolving the policies of this token it get lost in some way.
The function resolvePoliciesForIdentity from agent/consul/acl.go flows different when have concurrent requests with the same token or different tokens with the same policy.

The text was updated successfully, but these errors were encountered:

viniciusartur · 2019-01-15T04:05:45Z

Please notice we have had impactful disasters due to this issue.
We use Consul UI and Consul Replicate and due to unexpected 404 responses we've had KV delete recurse of keys.

Consul UI triggers a delete?recurse when it receives a 404.
Consul Replicate understands it should delete the entire prefix destination when it receives a 404.

We've had unexpected 404 responses due to the fact we've been submitting requests associated to the same policy and in a intermittent way were processed concurrently.

mkeeler · 2019-01-15T14:09:11Z

@viniciusartur Were the requests being made through a client agent or directly to a consul server?

viniciusartur · 2019-01-15T14:26:38Z

We’ve had unexpected 403 on both scenarios: through a client agent and directly to a consul server.
I didn’t try to reproduce with 2 local agents sending requests to a non primary dc cluster.
When I was troubleshooting I noticed the problem happens on the leader of the non primary dc cluster, on the function mentioned above.

viniciusartur · 2019-01-16T11:36:48Z

We noticed that configuring the replication token on the non primary datacenter stops reproducing the issue.
But we didn't notice that the policies were not being replicated. We presumed that we had replication working.
Once the policy is replicated on the non primary dc, Consul seems to solve it locally instead of doing remote calls to the primary dc.

mkeeler · 2019-01-17T18:50:53Z

@viniciusartur This is good information. Somehow I managed to miss the "from non primary datacenter" bit in the bug title.

I assume in your non-primary DC you were seeing some error logs about not being able to replicate policies (prior to setting up the replication token).

In Consul 1.4.0+ a replication_token must be set in non-primary datacenters. This token needs at least acl = "read" privileges in order to replicate policies or acl = "write" privileges in order to replicate both tokens and policies. Without that your newly created policy would never be replicated.

I assume then that the real bug lies somewhere with the remote policy resolution happening on the servers. This also brings up a bigger issue which is that there is currently no guide around setting up ACLs with multiple datacenters which would have probably helped.

mkeeler · 2019-01-17T20:48:41Z

@viniciusartur Without the replication token set do those queries ever work? They shouldn't as you have a default deny policy and the down policy is extend-cache. Since it could have never populated the cache of remote policies it should deny access always.

viniciusartur · 2019-01-18T00:20:00Z

@mkeeler Couldn’t you reproduce the issue the way I posted?
So you can see by yourself. It’s very straightforward and you could simulate scenarios you want.
A guide to setup ACL in multiple datacenters is missing, in fact. But what is weird is even without the replication token configured in non-primary datacenters the token and policy are resolved in non-primary datacenters. This make reach a wrong conclusion that the replication is working on the non-primary datacenter. I don’t remember seeing a error log message about not being able to replicate policies.
When you set the replication token is not to fallback the resolution of tokens and policies in case of outage of the primary datacenter?
Why does the replication work on multiple datacenters even when you don’t set the replication token?
Why concurrent requests produces a misbehavior on resolving policies when the replication token is not set?
These are the questions that intrigues me :)

mkeeler · 2019-01-18T14:30:34Z

@viniciusartur I was able to reproduce yesterday. Not with vagrant but with a little terraform + docker + a python script to bootstrap acls and do the kv gets/puts.

I will have to write up an some internals docs on this but there are a few things to note:

In the happy path with replication happening, the only RPC that needs to be made is for the replication routines. After that the consul servers in the secondary DCs will always resolve the tokens (if token replication is enabled) and policies from their local raft store.
Due to Help text is too wide (> 80 chars) #1, it is sometimes necessary to wait for a token/policy to get replicated before it can be used in secondary datacenters.
When replication was never successful, like when a replication token has not been set, then servers in the secondary datacenters will use RPCs back to the primary datacenter to resolve the tokens and policies. Which is why its not a permission denied 100% of the time. This is particularly useful for allowing a token you used during bootstrap to be used in the API requests to set the replication tokens within the secondary DC. Without this remote RPC fallback you would be forced to configure a acl.tokens.agent token and use it for setting up all the other tokens.
The "fallback" in no. 3 is the only way things get resolved on non-server consul instances.

I put in a feature request for me to remember that at some point those secondary servers need to determine how stale their replication is and enable the fallback procedure in that case as well: #4842

One other note is that replication is done by the leader in each secondary datacenter. When I spun up my test cluster I wasn't seeing it at first either. But then remembered I needed to find the leader and view its logs.

Why concurrent requests causes bad behavior and resulting in permission denials I have yet to determine. Now that I have reproduced it I am going to figure that out.

viniciusartur · 2019-01-19T05:45:31Z

Thanks for clarifying it!

mkeeler added type/bug Feature does not function as expected theme/acls ACL and token generation labels Jan 15, 2019

mkeeler mentioned this issue Jan 22, 2019

Fix several ACL token/policy resolution issues. #5246

Merged

mkeeler closed this as completed in #5246 Jan 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

403 Permission denied error / 404 not found when submiting concurrent kv GET,PUT or DELETE / kv GET recurse with the same token/different tokens with the same policy from a non primary datacenter #5219

403 Permission denied error / 404 not found when submiting concurrent kv GET,PUT or DELETE / kv GET recurse with the same token/different tokens with the same policy from a non primary datacenter #5219

viniciusartur commented Jan 12, 2019 •

edited

Loading

viniciusartur commented Jan 15, 2019

mkeeler commented Jan 15, 2019

viniciusartur commented Jan 15, 2019

viniciusartur commented Jan 16, 2019

mkeeler commented Jan 17, 2019

mkeeler commented Jan 17, 2019

viniciusartur commented Jan 18, 2019

mkeeler commented Jan 18, 2019

viniciusartur commented Jan 19, 2019

403 Permission denied error / 404 not found when submiting concurrent kv GET,PUT or DELETE / kv GET recurse with the same token/different tokens with the same policy from a non primary datacenter #5219

403 Permission denied error / 404 not found when submiting concurrent kv GET,PUT or DELETE / kv GET recurse with the same token/different tokens with the same policy from a non primary datacenter #5219

Comments

viniciusartur commented Jan 12, 2019 • edited Loading

Overview of the Issue

Reproduction Steps

viniciusartur commented Jan 15, 2019

mkeeler commented Jan 15, 2019

viniciusartur commented Jan 15, 2019

viniciusartur commented Jan 16, 2019

mkeeler commented Jan 17, 2019

mkeeler commented Jan 17, 2019

viniciusartur commented Jan 18, 2019

mkeeler commented Jan 18, 2019

viniciusartur commented Jan 19, 2019

viniciusartur commented Jan 12, 2019 •

edited

Loading