Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service registered using /v1/catalog/register are removed during anti-entropy sync #7513

Open
remilapeyre opened this issue Mar 26, 2020 · 2 comments
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner

Comments

@remilapeyre
Copy link
Contributor

remilapeyre commented Mar 26, 2020

Services registered using /v1/catalog/register get removed during the anty-entropy sync. As far as I can tell it does not happen when using the agent endpoint.

I noticed this will looking into hashicorp/terraform-provider-consul#187

Reproduction Steps

$ consul agent -dev >consul.log &
[1] 41646
$ curl -X PUT -d '{"Node":"Remis-MacBook-Pro.local", "Address":"127.0.0.1", "Service":{ "Service":"redis", "Port":6379}}' http://localhost:8500/v1/catalog/register

true
$ curl http://localhost:8500/v1/catalog/services
{
    "consul": [],
    "redis": []
}
$ curl http://localhost:8500/v1/catalog/services  # one minute later
{
    "consul": []
}

Consul log

==> Starting Consul agent...
           Version: 'v1.7.2'
           Node ID: '38fac0ab-c53d-29cd-367d-164559a3c5d1'
         Node name: 'Remis-MacBook-Pro.local'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: 8502, DNS: 8600)
      Cluster Addr: 127.0.0.1 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false, Auto-Encrypt-TLS: false

==> Log data will now stream in as it occurs:

    2020-03-26T15:13:08.268+0100 [DEBUG] agent: Using random ID as node ID: id=38fac0ab-c53d-29cd-367d-164559a3c5d1
    2020-03-26T15:13:08.269+0100 [WARN]  agent: Node name will not be discoverable via DNS due to invalid characters. Valid characters include all alpha-numerics and dashes.: node_name=Remis-MacBook-Pro.local
    2020-03-26T15:13:08.269+0100 [DEBUG] agent.tlsutil: Update: version=1
    2020-03-26T15:13:08.270+0100 [DEBUG] agent.tlsutil: OutgoingRPCWrapper: version=1
    2020-03-26T15:13:08.270+0100 [INFO]  agent.server.raft: initial configuration: index=1 servers="[{Suffrage:Voter ID:38fac0ab-c53d-29cd-367d-164559a3c5d1 Address:127.0.0.1:8300}]"
    2020-03-26T15:13:08.270+0100 [INFO]  agent.server.raft: entering follower state: follower="Node at 127.0.0.1:8300 [Follower]" leader=
    2020-03-26T15:13:08.271+0100 [INFO]  agent.server.serf.wan: serf: EventMemberJoin: Remis-MacBook-Pro.local.dc1 127.0.0.1
    2020-03-26T15:13:08.271+0100 [INFO]  agent.server.serf.lan: serf: EventMemberJoin: Remis-MacBook-Pro.local 127.0.0.1
    2020-03-26T15:13:08.271+0100 [INFO]  agent.server: Adding LAN server: server="Remis-MacBook-Pro.local (Addr: tcp/127.0.0.1:8300) (DC: dc1)"
    2020-03-26T15:13:08.271+0100 [INFO]  agent.server: Handled event for server in area: event=member-join server=Remis-MacBook-Pro.local.dc1 area=wan
    2020-03-26T15:13:08.271+0100 [INFO]  agent: Started DNS server: address=127.0.0.1:8600 network=tcp
    2020-03-26T15:13:08.271+0100 [INFO]  agent: Started DNS server: address=127.0.0.1:8600 network=udp
    2020-03-26T15:13:08.272+0100 [INFO]  agent: Started HTTP server: address=127.0.0.1:8500 network=tcp
    2020-03-26T15:13:08.272+0100 [INFO]  agent: Started gRPC server: address=127.0.0.1:8502 network=tcp
    2020-03-26T15:13:08.272+0100 [INFO]  agent: started state syncer
==> Consul agent running!
    2020-03-26T15:13:08.338+0100 [WARN]  agent.server.raft: heartbeat timeout reached, starting election: last-leader=
    2020-03-26T15:13:08.339+0100 [INFO]  agent.server.raft: entering candidate state: node="Node at 127.0.0.1:8300 [Candidate]" term=2
    2020-03-26T15:13:08.339+0100 [DEBUG] agent.server.raft: votes: needed=1
    2020-03-26T15:13:08.339+0100 [DEBUG] agent.server.raft: vote granted: from=38fac0ab-c53d-29cd-367d-164559a3c5d1 term=2 tally=1
    2020-03-26T15:13:08.339+0100 [INFO]  agent.server.raft: election won: tally=1
    2020-03-26T15:13:08.339+0100 [INFO]  agent.server.raft: entering leader state: leader="Node at 127.0.0.1:8300 [Leader]"
    2020-03-26T15:13:08.339+0100 [INFO]  agent.server: cluster leadership acquired
    2020-03-26T15:13:08.339+0100 [INFO]  agent.server: New leader elected: payload=Remis-MacBook-Pro.local
Processing server acl mode for: Remis-MacBook-Pro.local - 0
    2020-03-26T15:13:08.339+0100 [INFO]  agent.server: Cannot upgrade to new ACLs: leaderMode=0 mode=0 found=true leader=127.0.0.1:8300
    2020-03-26T15:13:08.340+0100 [DEBUG] connect.ca.consul: consul CA provider configured: id=07:80:c8:de:f6:41:86:29:8f:9c:b8:17:d6:48:c2:d5:c5:5c:7f:0c:03:f7:cf:97:5a:a7:c1:68:aa:23:ae:81 is_primary=true
    2020-03-26T15:13:08.351+0100 [INFO]  agent.server.connect: initialized primary datacenter CA with provider: provider=consul
    2020-03-26T15:13:08.351+0100 [INFO]  agent.leader: started routine: routine="CA root pruning"
    2020-03-26T15:13:08.351+0100 [DEBUG] agent.server: Skipping self join check for node since the cluster is too small: node=Remis-MacBook-Pro.local
    2020-03-26T15:13:08.351+0100 [INFO]  agent.server: member joined, marking health alive: member=Remis-MacBook-Pro.local
    2020-03-26T15:13:08.525+0100 [DEBUG] agent: Skipping remote check since it is managed automatically: check=serfHealth
    2020-03-26T15:13:08.525+0100 [INFO]  agent: Synced node info
    2020-03-26T15:13:08.525+0100 [DEBUG] agent: Node info in sync
    2020-03-26T15:13:08.724+0100 [DEBUG] agent: Skipping remote check since it is managed automatically: check=serfHealth
    2020-03-26T15:13:08.724+0100 [DEBUG] agent: Node info in sync
    2020-03-26T15:13:10.341+0100 [DEBUG] agent.tlsutil: OutgoingRPCWrapper: version=1
    2020-03-26T15:13:39.142+0100 [DEBUG] agent.http: Request finished: method=PUT url=/v1/catalog/register from=127.0.0.1:49649 latency=629.466µs
    2020-03-26T15:14:08.347+0100 [DEBUG] agent.server: Skipping self join check for node since the cluster is too small: node=Remis-MacBook-Pro.local
    2020-03-26T15:14:34.208+0100 [DEBUG] agent.http: Request finished: method=GET url=/v1/catalog/services from=127.0.0.1:49676 latency=85.537µs
    2020-03-26T15:14:54.679+0100 [DEBUG] agent: Skipping remote check since it is managed automatically: check=serfHealth
    2020-03-26T15:14:54.680+0100 [INFO]  agent: Synced node info
    2020-03-26T15:14:54.680+0100 [INFO]  agent: Deregistered service: service=redis
    2020-03-26T15:15:08.285+0100 [DEBUG] agent.server.router.manager: Rebalanced servers, new active server: number_of_servers=1 active_server="Remis-MacBook-Pro.local.dc1 (Addr: tcp/127.0.0.1:8300) (DC: dc1)"
    2020-03-26T15:15:08.352+0100 [DEBUG] agent.server: Skipping self join check for node since the cluster is too small: node=Remis-MacBook-Pro.local
    2020-03-26T15:15:27.856+0100 [DEBUG] agent.http: Request finished: method=GET url=/v1/catalog/services from=127.0.0.1:49694 latency=63.706µs
    2020-03-26T15:15:42.514+0100 [INFO]  agent: Caught: signal=interrupt
    2020-03-26T15:15:42.514+0100 [INFO]  agent: Graceful shutdown disabled. Exiting
    2020-03-26T15:15:42.514+0100 [INFO]  agent: Requesting shutdown
    2020-03-26T15:15:42.514+0100 [INFO]  agent.server: shutting down server
    2020-03-26T15:15:42.514+0100 [DEBUG] agent.leader: stopping routine: routine="CA root pruning"
    2020-03-26T15:15:42.514+0100 [WARN]  agent.server.serf.lan: serf: Shutdown without a Leave
    2020-03-26T15:15:42.514+0100 [DEBUG] agent.leader: stopped routine: routine="CA root pruning"
    2020-03-26T15:15:42.514+0100 [WARN]  agent.server.serf.wan: serf: Shutdown without a Leave
    2020-03-26T15:15:42.514+0100 [INFO]  agent.server.router.manager: shutting down
    2020-03-26T15:15:42.514+0100 [INFO]  agent: consul server down
    2020-03-26T15:15:42.514+0100 [INFO]  agent: shutdown complete
    2020-03-26T15:15:42.514+0100 [INFO]  agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=tcp
    2020-03-26T15:15:42.514+0100 [INFO]  agent: Stopping server: protocol=DNS address=127.0.0.1:8600 network=udp
    2020-03-26T15:15:42.514+0100 [INFO]  agent: Stopping server: protocol=HTTP address=127.0.0.1:8500 network=tcp
    2020-03-26T15:15:42.515+0100 [INFO]  agent: Waiting for endpoints to shut down
    2020-03-26T15:15:42.515+0100 [INFO]  agent: Endpoints down
    2020-03-26T15:15:42.515+0100 [INFO]  agent: Exit code: code=1
@remilapeyre remilapeyre changed the title Service registered using /v1/catalog/register are removed during ani-entropy sync Service registered using /v1/catalog/register are removed during anti-entropy sync Mar 26, 2020
@mkeeler
Copy link
Member

mkeeler commented Mar 26, 2020

The core problem here is that a Consul agent assumes it is the source-of-truth for all services registered with its node. When it sees the terraform registered service that it knows nothing about, it treats that just like a service that was registered locally but then deregistered and removes it.

"External" nodes can be registered and services for them managed in this manner.

The question this brings up is whether or not it might be a good idea for Consul to support some way to register a service via the Catalog API and mark it as "managed externally" but still have the Agent run the health checks.

There are many UX related things to consider here like what happens if someone then tries to go direct to the Agent apis and register or update that same service? I am sure there are others too.

@jsosulska jsosulska added the ux label Mar 30, 2020
@remilapeyre
Copy link
Contributor Author

The documentation at https://www.consul.io/api/catalog.html#register-entity could be improved. It currently reads:

It is usually preferable to instead use the agent endpoints for registration as they are simpler and perform anti-entropy.

But it's never appropriate to register internal services that way.

When a node exists, would it be possible to forward the registration call to the agent so it can register the new service?

@jsosulska jsosulska added theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner and removed ux labels Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/operator-usability Replaces UX. Anything related to making things easier for the practitioner
Projects
None yet
Development

No branches or pull requests

3 participants