Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: partly fix config entry replication issue that prevents replication in some circumstances #12307

Merged
merged 4 commits into from
Feb 23, 2022

Conversation

rboyer
Copy link
Member

@rboyer rboyer commented Feb 10, 2022

There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.

Config entry replication proceeds in a particular sorted order by kind
and name.

Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.

This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.

This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.

Fixes #9319

The scenario that is NOT ADDRESSED by this PR is as follows:

  1. create: service-defaults: name=new-web, protocol=http
  2. create: service-defaults: name=old-web, protocol=http
  3. create: service-resolver: name=old-web, redirect-to=new-web
  4. delete: service-resolver: name=old-web
  5. update: service-defaults: name=old-web, protocol=grpc
  6. update: service-defaults: name=new-web, protocol=grpc
  7. create: service-resolver: name=old-web, redirect-to=new-web

If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.

@rboyer rboyer requested a review from a team February 10, 2022 20:44
@rboyer rboyer self-assigned this Feb 10, 2022
@vercel vercel bot temporarily deployed to Preview – consul February 10, 2022 20:46 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging February 10, 2022 20:46 Inactive
@@ -214,9 +217,10 @@ func (s *Server) replicateConfig(ctx context.Context, lastRemoteIndex uint64, lo
return 0, true, nil
}
if err != nil {
return 0, false, fmt.Errorf("failed to delete local config entries: %v", err)
merr = multierror.Append(merr, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to preserve the error message here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the inner errors will be of the form Failed to apply config entry <delete|upsert>: <error with details> so since the delete/update portion was already present I didn't see the value.

@@ -228,9 +232,14 @@ func (s *Server) replicateConfig(ctx context.Context, lastRemoteIndex uint64, lo
return 0, true, nil
}
if err != nil {
return 0, false, fmt.Errorf("failed to update local config entries: %v", err)
merr = multierror.Append(merr, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to preserve the error message here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

agent/consul/config_replication_test.go Outdated Show resolved Hide resolved
agent/consul/config_replication_test.go Outdated Show resolved Hide resolved
agent/consul/config_replication_test.go Outdated Show resolved Hide resolved
@vercel vercel bot temporarily deployed to Preview – consul February 23, 2022 17:39 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging February 23, 2022 17:39 Inactive
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging February 23, 2022 17:41 Inactive
@vercel vercel bot temporarily deployed to Preview – consul February 23, 2022 17:41 Inactive
@rboyer rboyer requested a review from kisunji February 23, 2022 17:42
…cation in some circumstances

There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.

Config entry replication procedes in a particular sorted order by kind
and name.

Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.

This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.

This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.

Fixes #9319

The scenario that is NOT ADDRESSED by this PR is as follows:

1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web

If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
@rboyer rboyer force-pushed the fix-some-config-entry-replication branch from 7adca29 to 9f0eccc Compare February 23, 2022 18:36
@vercel vercel bot temporarily deployed to Preview – consul-ui-staging February 23, 2022 18:36 Inactive
@vercel vercel bot temporarily deployed to Preview – consul February 23, 2022 18:36 Inactive
@rboyer rboyer merged commit 6427128 into main Feb 23, 2022
@rboyer rboyer deleted the fix-some-config-entry-replication branch February 23, 2022 23:27
@hc-github-team-consul-core
Copy link
Collaborator

🍒 If backport labels were added before merging, cherry-picking will start automatically.

To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/594115.

@hc-github-team-consul-core
Copy link
Collaborator

🍒✅ Cherry pick of commit 6427128 onto release/1.11.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Feb 23, 2022
…cation in some circumstances (#12307)

There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.

Config entry replication procedes in a particular sorted order by kind
and name.

Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.

This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.

This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.

Fixes #9319

The scenario that is NOT ADDRESSED by this PR is as follows:

1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web

If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
@hc-github-team-consul-core
Copy link
Collaborator

🍒✅ Cherry pick of commit 6427128 onto release/1.10.x succeeded!

hc-github-team-consul-core pushed a commit that referenced this pull request Feb 23, 2022
…cation in some circumstances (#12307)

There are some cross-config-entry relationships that are enforced during
"graph validation" at persistence time that are required to be
maintained. This means that config entries may form a digraph at times.

Config entry replication procedes in a particular sorted order by kind
and name.

Occasionally there are some fixups to these digraphs that end up
replicating in the wrong order and replicating the leaves
(ingress-gateway) before the roots (service-defaults) leading to
replication halting due to a graph validation error related to things
like mismatched service protocol requirements.

This PR changes replication to give each computed change (upsert/delete)
a fair shot at being applied before deciding to terminate that round of
replication in error. In the case where we've simply tried to do the
operations in the wrong order at least ONE of the outstanding requests
will complete in the right order, leading the subsequent round to have
fewer operations to do, with a smaller likelihood of graph validation
errors.

This does not address all scenarios, but for scenarios where the edits
are being applied in the wrong order this should avoid replication
halting.

Fixes #9319

The scenario that is NOT ADDRESSED by this PR is as follows:

1. create: service-defaults: name=new-web, protocol=http
2. create: service-defaults: name=old-web, protocol=http
3. create: service-resolver: name=old-web, redirect-to=new-web
4. delete: service-resolver: name=old-web
5. update: service-defaults: name=old-web, protocol=grpc
6. update: service-defaults: name=new-web, protocol=grpc
7. create: service-resolver: name=old-web, redirect-to=new-web

If you shutdown dc2 just before (4) and turn it back on after (7)
replication is impossible as there is no single edit you can make to
make forward progress.
@hc-github-team-consul-core
Copy link
Collaborator

🍒❌ Cherry pick of commit 6427128 onto release/1.9.x failed! Build Log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Config Entry replication of ingress-gateway entries fails validation in secondary datacenter
4 participants