Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Config Entry replication of ingress-gateway entries fails validation in secondary datacenter #9319

Closed
crhino opened this issue Dec 3, 2020 · 11 comments · Fixed by #12307
Closed
Labels
type/bug Feature does not function as expected

Comments

@crhino
Copy link
Contributor

crhino commented Dec 3, 2020

Overview of the Issue

Config Entry replication fails to apply properly in the secondary datacenter, blocking replication from finishing. This is caused by ingress-gateway config validation being dependent on an existing proxy-defaults entry.

I could imagine that this is an issue with any config entry that is dependent on another entry for setting properties like the protocol of the service.

Reproduction Steps

Steps to reproduce this issue, eg:

  1. Create 2 datacenters
  2. Create an ingress gateway config entry with an http listener for a service defined, and a proxy-defaults entry that sets everything to http protocol
  3. Watch the secondary DC to see the replication of config entries

This does not reproduce all of the time, I think because sometimes the secondary DC will replicate the proxy-defaults entry before the ingress-gateway entry is added in my setup.

My Config Entries:

Created via consul config write CLI command:

{
  "kind": "ingress-gateway",
  "name": "ingress1",
  "listeners": [
    {
      "protocol": "http",
      "port": 443,
      "services": [
        {
          "name": "*"
        }
      ]
    },
    {
      "protocol": "http",
      "port": 444,
      "services": [
        {
          "name": "virtual"
        }
      ]
    }
  ]
}

Set in the config of the primary servers:

  "config_entries": {
    "bootstrap": [
      {
        "kind": "proxy-defaults",
        "name": "global",
        "config": {
          "protocol": "http"
        }
      },
      {
        "kind": "service-router",
        "name": "counting",
        "routes": [
          {
            "destination": {
              "NumRetries": 3,
              "RetryOnConnectFailure": true
            }
          }
        ]
      }
    ]
  },

Log Fragments

    2020-12-03T16:33:52.006Z [DEBUG] agent.server.replication.config_entry: finished fetching config entries: amount=3
    2020-12-03T16:33:52.007Z [DEBUG] agent.server.replication.config_entry: Config Entry replication: local=0 remote=3
    2020-12-03T16:33:52.007Z [DEBUG] agent.server.replication.config_entry: Config Entry replication: deletions=0 updates=3
    2020-12-03T16:33:52.007Z [DEBUG] agent.server.replication.config_entry: Updating local config entries: updates=3
    2020-12-03T16:33:52.008Z [WARN]  agent.server.replication.config_entry: replication error (will retry if still leader): error="failed to update local config entries: Failed to apply config upsert: service "virtual" has protocol "tcp", which does not match defined listener protocol "http""
@crhino crhino added the type/bug Feature does not function as expected label Dec 3, 2020
@crhino
Copy link
Contributor Author

crhino commented Dec 3, 2020

Note that I needed to patch 1.9.0 with #9320 in order to actually see the error.

@mikemorris
Copy link
Contributor

mikemorris commented Dec 3, 2020

This sounds like a suspiciously similar replication logic issue to #9271 (comment)

@woz5999
Copy link
Contributor

woz5999 commented Dec 31, 2020

i'm experiencing this same issue in 1.9.1 despite #9271 being closed.

@crhino
Copy link
Contributor Author

crhino commented Jan 4, 2021

Unfortunately #9271 does not address this specific issue, although they are similar.

@woz5999
Copy link
Contributor

woz5999 commented Jan 6, 2021

this state also occurs if setting the protocol via service-defaults. same race condition and failure.

@woz5999
Copy link
Contributor

woz5999 commented Jan 6, 2021

fwiw a workaround is to delete the affected ingress-gateway configs from the primary datacenter, allow the other required configs to replicate, and then recreate the deleted ingress-gateway config.

it's lame, disruptive, and fragile, but it'll at least unblock replication. otherwise, using non-tcp protocol ingress-gateway listeners with federated clusters is a gamble at best and definitely not suitable for production use until this is fixed.

@woz5999
Copy link
Contributor

woz5999 commented Jan 21, 2021

seems like this might be the same issue as #9196

@dsolsona
Copy link

We are also suffering from this issue in our Consul federated clusters and I can confirm @woz5999 workaround works, but this definitely something you don't want to do in production.

@rrijkse
Copy link

rrijkse commented Oct 13, 2021

Just wanted to drop a note here, the work around specified above is quite hard to implement when it affects a lot of other services. An alternative workaround is to temporarily create a config entry of kind service-defaults for the virtual service with the protocol set to whichever it is expecting. This caused replication to resume for me and the proxy-defaults to take effect.

@chrisboulton
Copy link
Contributor

If you're still experiencing this like we are, it's due to the sort algorithm used when applying config entries during replication. The current implementation pretty much does an alpha sort to determine the order, and because proxy-defaults > ingress-gateway, the sort order is out: we want proxy-defaults before ingress-gateways (and probably any other type of config entry too). This works for service-defaults and service-router/service-resolver because well.. the alphabet.

A quick patch which sorts proxy-defaults first is here: bigcommerce@85b4fce. This works for us - once it's installed on a leader in a secondary DC you should be good to go.

A better/more improved fixed would be to configuration entries properly based on their dependencies, or maybe relax the validation when replicated entries are being replied.

rboyer added a commit that referenced this issue Feb 10, 2022
…cation in some circumstances

There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.

Config entry replication procedes in a particular sorted order by kind
and name.

Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.

This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.

This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.

Fixes #9319

The scenario that is NOT ADDRESSED by this PR is as follows:

1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web

If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
rboyer added a commit that referenced this issue Feb 23, 2022
…cation in some circumstances

There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.

Config entry replication procedes in a particular sorted order by kind
and name.

Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.

This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.

This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.

Fixes #9319

The scenario that is NOT ADDRESSED by this PR is as follows:

1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web

If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
rboyer added a commit that referenced this issue Feb 23, 2022
…cation in some circumstances (#12307)

There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.

Config entry replication procedes in a particular sorted order by kind
and name.

Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.

This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.

This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.

Fixes #9319

The scenario that is NOT ADDRESSED by this PR is as follows:

1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web

If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
hc-github-team-consul-core pushed a commit that referenced this issue Feb 23, 2022
…cation in some circumstances (#12307)

There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.

Config entry replication procedes in a particular sorted order by kind
and name.

Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.

This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.

This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.

Fixes #9319

The scenario that is NOT ADDRESSED by this PR is as follows:

1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web

If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
hc-github-team-consul-core pushed a commit that referenced this issue Feb 23, 2022
…cation in some circumstances (#12307)

There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.

Config entry replication procedes in a particular sorted order by kind
and name.

Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.

This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.

This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.

Fixes #9319

The scenario that is NOT ADDRESSED by this PR is as follows:

1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web

If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
@rboyer
Copy link
Member

rboyer commented Feb 28, 2022

A partial fix for most scenarios should go out in the next patch release of consul 1.11, 1.10, and 1.9 due to: #12307

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Feature does not function as expected
Projects
None yet
7 participants