Unable to deregister a service #1188

Open
drsnyder opened this Issue Aug 20, 2015 · 40 comments

Projects

None yet
@drsnyder

I brought this to the attention of the mailing list here. @slackpad asked me to go ahead and file a bug. Below summary of the issue from the discussion thread.

We have services that are being orphaned and we cannot deregister them. The orphans show up under one or more of the master nodes. In our configuration the master nodes are dev-consul, dev-consul-s1, and dev-broker.

The health check of the orphaned node looks something like the following:

{
    "Node": "dev-consul",
    "CheckID": "service:discussion_8080",
    "Name": "Service 'discussion' check",
    "ServiceName": "discussion",
    "Notes": "",
    "Status": "critical",
    "ServiceID": "discussion_8080",
    "Output": ""
}

I attempted to deregister via:

user@dev-consul $ curl -X PUT -d '{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister

The node was removed but then reappears within 30-60s. As @slackpad's recommended, I tried deregistering with:

user@dev-consul $ curl -v http://localhost:8500/v1/agent/service/deregister/discussion_8080
user@dev-consul $ curl -v -X PUT -d'{"CheckID": "service:discussion_8080", "ServiceID": "discussion_8080", "Node": "dev-consul", "Datacenter": "dev"}' http://localhost:8500/v1/catalog/deregister

Both commands returned status 200 OK. But that also failed. You can see the output in this gist as well as the debug logs from consul.

From the debug logs in consul we see:

Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered service 'discussion_8080'
Aug 20 16:57:45 dev-broker consul[2221]: agent: Check 'service:discussion_8080' in sync
Aug 20 16:57:45 dev-broker consul[2221]: agent: Deregistered check 'service:discussion_8080'
Aug 20 16:57:45 dev-broker consul[2221]: http: Request /v1/agent/service/deregister/discussion_8080 (19.73968ms)
Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080, error: CheckID does not have associated TTL
Aug 20 16:57:46 dev-broker consul[2221]: http: Request /v1/agent/check/pass/service:discussion_8080 (246.298µs)
Aug 20 16:57:47 dev-broker consul[2221]: agent: Synced service 'discussion_8080' <--- SHADY!

The annotation is from @slackpad.

It's also noteworthy that the orphans are always associated with one of the master nodes (e.g. dev-consul) and not the node (dev-mesos) that's running the service that was registered. I should also mention (it could be a coincidence) that the service (discussion) is also flapping though from what I can tell from the debug logs for consul on dev-mesos everything is fine.

Our consul version:

$ consul version
Consul v0.5.2
Consul Protocol: 2 (Understands back to: 1)

Thanks!

@slackpad slackpad self-assigned this Aug 25, 2015
@drsnyder
drsnyder commented Sep 8, 2015

I'm not sure if this helps with the solution to the problem, but the services can be deregistered if you deregister them on all of the servers in the cluster using the local agent at more or less the same time. See this tool for what we used to force the de-registration.

So in our case, I ran the linked tool above on the three servers in the cluster. It removed about 250 orphaned services that couldn't otherwise be deregistered.

@milosgajdos83

We are seeing something equally obscure in consul I'm completely at loss of understanding what is going on but it seems somewhat similar to the issue described above.

Consul version:

# consul version
Consul v0.5.2
Consul Protocol: 2 (Understands back to: 1)

One of the services which is registered with consul dies, but consul never removes the registered key., although it does seems like it does. We figured we would use the consul's HTTP API to deregister the service. Pointless exercise, as we learnt later. Even though the consul seems to think, for a bit of time, that the record has been removed, after a bit of time the removed data re-appears out of the blue, and we are totally clueless why.

Here's the actual description:

We can curl the registered service at the beginning as expected

$ curl node1:8500/v1/catalog/service/my_service | python -mjson.tool
[
    {
        "Address": “1.2.3.4”,
        "Node": “my_service”,
        "ServiceAddress": "0.0.0.0",
        "ServiceID": "my_service:9042",
        "ServiceName": "my_service",
        "ServicePort": 9042,
        "ServiceTags": null
    }
]

We can query consul and receive the reply easily as one would expect correctly (ignore the actual IP):

$ dig -p 8500 @consul_node1 my_service.service.dc1.consul +short
1.2.3.4
$

$ dig -p 8500 @consul_node2 my_service.service.dc1.consul +short
1.2.3.4
$

$ dig -p 8500 @node3 my_service.service.dc1.consul +short
1.2.3.4
$

Now we try to deregister the service. This is the JSON payload:

$ cat my_service.json
{
  "Datacenter": "dc1",
  "Node": "my-service",
  "ServiceID": "my-service:9042"
}

We PUT it to the leader in the cluster (which is node1) - this goes fine as expected on every node in the cluster:

$ curl -X PUT -d @my_service.json node1:8500/v1/catalog/deregister
true
$
$ curl node1:8500/v1/catalog/service/my_service
[]
$
$ curl node2:8500/v1/catalog/service/my_service
[]
$
$ curl node3:8500/v1/catalog/service/my_service
[]
$

Then in about a minute or so, this happens:

$ tail logs (on node1)
…
…
2015/09/30 19:21:52 [INFO] agent: Synced service 'my_service:9042'
$

Curling service catalog indeed returns the entry:

$ curl node1:8500/v1/catalog/service/my_service | python -mjson.tool
[
    {
        "Address": “1.2.3.4”,
        "Node": “my_service”,
        "ServiceAddress": "0.0.0.0",
        "ServiceID": "my_service:9042",
        "ServiceName": "my_service",
        "ServicePort": 9042,
        "ServiceTags": null
    }
]

Now, can someone tell me what is going on here ?

@drsnyder
drsnyder commented Oct 1, 2015

I don't know the specifics of what's going on but what we have learned is that when this happens you have to deregister the service from all the consul servers. So, if you have three then you need to deregister the service from all three.

We have been this tool to clean them up. We have plans to productize it as an orphaned service reaper but we aren't there yet.

@milosgajdos83

Thanks, I'll check it out. Nevertheless this is something I'd love to understand as random data re-appearance does not fill me with confidence if I'm entirely honest.

Bugs happen in every SW, but I'd love to understand the actualy underlying problem so it does not surprise me at 3AM in the morning as it always happens in Murphy's law.

@volkantufekci

Hi,
I'm running a single node consul(v.0.5.2) and similar issue here. I unregister a service via diplomat(ruby client) and the consul says in its stdout:

2015/11/03 11:33:08 [INFO] agent: Deregistered service 'vcs4'

But "vcs4" can still be observed in http api and web ui.

@volkantufekci

My issue is solved.
The problem was the message in consul's output. It says a service is "deregistered" even it doesn't exist. For example I don't have a service registered with serviceID "THIS_DOES_NOT_EXIST" but when I call

curl  http://CONSUL_AGENT_URL:8500/v1/agent/service/deregister/THIS_DOES_NOT_EXIST

Consul logs as:

2015/11/03 15:11:56 [INFO] agent: Deregistered service 'THIS_DOES_NOT_EXIST'

So, in my case I was trying to deregister with a wrong serviceID, and consul's output was misleading me as it says service is deregistered instead of warning me as "there is no such service with that id"...

@thpham
thpham commented Dec 4, 2015

Hello,

I got similar problem trying to deregister services created with registrator for docker container. It tooks me a half a day to notice that the ServiceID was generated with special characters and I had to call the API endpoint with an URL-ENCODED string ! @milosgajdos83 , if I take your previous example you should call the API like this:

curl -v -X PUT http://CONSUL_AGENT_URL:8500/v1/agent/service/deregister/my_service%3A9042

CONSUL_AGENT_URL should be the node hostname/ip where the agent registered the service.

hope It will help some people :-)

@codelotus

It would appear as though the error checking in the consul http api is not complete. (I have not looked at the code to verify this). Hence why a successful response from a failed deregistration. @milosgajdos83 I was able to successfully register and deregister your service by changing the format of the json and by using the /v1/catalog/ endpoint.

To register a service

curl -XPUT -d @consulServiceRegister.json http://localhost:8500/v1/catalog/register

where consulServiceRegister.json is:

{ 
  "Address": "1.2.3.4",
  "Node": "test-node", 
  "Address": "0.0.0.0",
  "Service": {
    "ID": "my_service:9042",
    "Service": "my_service",
    "Port": 9042 
  }
}  

To deregister a service (note the Address is required):

curl -XPUT -d @consulServiceDeRegister.json http://localhost:8500/v1/catalog/deregister

where consulServiceDeRegister.json is:

{ 
  "Datacenter": "test-dc",
  "Node": "test-node",
  "ServiceID": "my-service:9042",
  "Address": "1.2.3.4"
}  

At this point the registered service has been successfully deregistered and after 15 minutes the service has not returned:

curl http://localhost:8500/v1/catalog/services                                                   
{"consul":[]} 
@cjhkramer

The example from @codelotus works as long as you register with the catalog and not with the agent.
If you do the following call:

curl -XPUT -d @consulServiceRegisterAgent.json http://10.98.204.21:8500/v1/agent/service/register

Where consulServiceRegisterAgent.json is:

{
  "ID": "my_service:9042",
  "Name": "my_service",
  "Address": "1.2.3.4",
  "Port": 9042
}

And then do a deregister:

curl -XPUT -d @consulServiceDeRegister.json http://localhost:8500/v1/catalog/deregister

where consulServiceDeRegister.json is:

{ 
  "Datacenter": "test-dc",
  "Node": "test-node",
  "ServiceID": "my-service:9042",
  "Address": "1.2.3.4"
}  

The service will respawn in a minute or so :(

@slackpad slackpad added the bug label Jan 13, 2016
@cabrinoob

Same problem here. I have Zombies services which come back to life whatever deregistering technics I use.

@peterklipfel

I'm load balancing with consul-template, and this is causing me some major headaches. Round robin load balancing to services that may or may not exist creates ridiculous, cascading networking bugs.

What I found was that the master said that one of my members had failed, but that member thought that it was still alive. I made the member leave, and then rejoin. This fixed the issue.

@slackpad
Contributor

Wanted to clarify - I think there are a few things going on here in this issue:

  1. The original problem posted by @drsnyder looks to be an issue with services registered on the Consul servers - that is an outstanding thing we need to track down.
  2. The error checking problem pointed out by @volkantufekci needs to be fixed because that adds to confusion by returning bogus success responses.
  3. The problems encountered by @milosgajdos83 and @cjhkramer look like a common source of confusion around using Consul. We need to beef up the docs on this - an explanation of this follows.

In Consul it's extremely rare to use the Catalog API directly. The Agent API (https://www.consul.io/docs/agent/http/agent.html) should almost always be used. For services running on Consul agents, the agent is the source of truth, not the catalog maintained by the servers. Periodically, the agents perform an anti-entropy sync and use the Catalog API internally to update the servers to have the correct state. This means that if you use the catalog API to deregister a service, it will disappear for a little while then the agent will put that back on the next sync. If you use the Agent API it will take care of removing the service from the catalog for you.

The call to https://www.consul.io/docs/agent/http/agent.html#agent_service_deregister should be made on the agent where the service is registered.

@ch3lo
ch3lo commented Feb 19, 2016

I had zombie services in /v1/catalog/service/... but not in /v1/agent/services. I did a "consul rejoin" in the agent related with the zombie and they disappeared. I think something rare are with the entropy sync from slaves to master.

@doublerebel

Am being bitten by this today. Attempting to set maintenance mode on a nonexistent service correctly returns a 404. But, I can send any ID to a deregister endpoint and get a 200 OK, whether the service exists or not. I would expect any endpoint that takes an ID to return a 404 if that ID does not exist.

(I also can't seem to deregister a service with a . in the ID, despite that being a legal URL and not needing URL encoding EDIT: this might not be the case). Issues #1333, #1138, #1096 are related in case anyone there needs this thread.

I did notice that a successful service deregister also prints Deregistered check... to the logs. A nonexistent service has no checks. (I did make sure to do all this with the agent and not the catalog.)

Now I'm also wishing for a "deregister service" button in the UI, to solve this for me. Thanks all for your suggestions and helper examples.

@josegonzalez

Seems like the file for that service actually still exists on a box, even when issuing a deregistration to that box (testing with a single consul instance).

Removing the service file on the box and deregistering didn't appear to fix it. Neither did removing the local.snapshot on it. Removing both the local and remote snapshot did have an effect though.

@webertlima

Hello.
Is there any progress on this? I am having the issues described here and a lot of trouble.

@kbroughton

same. Pretty major flaw. Consul-template picks up the old service.

@lowzj
lowzj commented May 13, 2016

Hello.
same problem here. I try to deregister some critical services from consul server that is stopped but not deregister correctly. Is there any progress?

@babbottscott
Contributor

For configuring a client consuming a service, would service health rather than service catalog be a more appropriate option? I may be underestimating the bug here, but ISTM service catalog is prone to extraneous data (either from new services not yet ready for consumption, or decommissioned services) by design.

@alexykot

I can confirm that on version 0.6.4 I cannot reproduce this issue any more on a test setup.

I've build a small test setup with three consul agents sitting in containers on the same node talking to each other.

Then I've created a test service through endpoint PUT /v1/agent/service/register on one node, and confirmed it has propagated in seconds to other two agents and is available through GET /v1/catalog/services on each agent.

And when I deregistered service on the same agent it was created on with DELETE /v1/agent/service/deregister/test-service1 - it has gone away instantly from catalogs on all three nodes.

@n8gard
n8gard commented Jun 18, 2016

I just stood up the Consul UI in our environment then killed some EC2 instances which means they didn't gracefully leave the system. I see them as failed nodes in the UI--so far so good. But, when I click the Deregister button, they do go away. However, upon reloading the UI, they are there again. Have done this many times. It very well could be something wrong on my side as this is a very new environment and I'm doing this for the first time, but, it sounds exactly like this issue.

I'm on v0.6.4 on Ubuntu 14.04 LTS.

@skyrocknroll

@alexykot Actually the checks are registered by nomad in my case

@webertlima
webertlima commented Jun 21, 2016 edited

@alexykot

And when I deregistered service on the same agent it was created on with DELETE /v1/agent/service/deregister/test-service1 - it has gone away instantly from catalogs on all three nodes.

have you tried to deregister from other nodes different than the one you have used to register? That's when they come back. I'm not sure if it is supposed to work this way though.

@alexykot

@webertlima
You cannot deregister a service from the agent on a different node, service only exists on the agent you have registered with. It also exists in the catalog on all nodes, but that is not related to the agent itself. And to be honest I don't understand why there is a catalog/deregister endpoint at all, in my opinion catalog should be a read-only service list.

@webertlima

@alexykot thanks for clearing that up.

@flypenguin
flypenguin commented Jun 21, 2016 edited

I go crazy right now. I use consul as a service registry (which it apparently is), but I am completely unable to deregister services. I am trying to use the deregister endpoint, and I am having the exact same behavior - and it seriously f*cks with my network setup.

I use a consul-template to configure haproxy for services which appear and vanish). Because of my system setup I use only one central agent to register services with, and it seems I will be unable forever to deregister them.

This is a superbly bad situation, and I really do not understand the point of the /deregister endpoint if it can't be used, and I even with a read-only catalog I would assume I could remove services at some point. (What's the point of a distributed system if you have some weird logic about which nodes to use for some operations anyway?)

Update: I also tried de-registering on the node where the service runs, and still it's coming back.

I just. Don't. Get. It.

@flypenguin

I have now managed to get rid of those services, by stopping all consul versions, killing the data directory, and re-starting them. this is not the way to go IMHO. for a single test case the de-registration now seems to work fine, for whatever reason. I am thinking of moving away from consul as fast as possible now, because this kind of undeterministic behavior makes it impossible to rely on it as a central infrastructure part, and consul currently is the backbone of my service management.

I really like consul though and would be super happy if there could be a solution for this.

@slackpad
Contributor

Hi @flypenguin sorry you are having trouble. There are some issues called out in #1188 (comment), but Consul's behavior is definitely deterministic. I think you are running into problems because of this:

I use only one central agent to register services with

Consul's really not designed to run all registrations through a central set of agents. In Consul, the agent holds the information about which services are registered, and then takes responsibility for syncing that information up to the catalog maintained by the servers. If you delete a service from the catalog, the agent will put it back (which I agree is confusing and we need to document that more clearly). To remove a service you always need to remove it using the Agent API and it will remove it from the catalog for you.

If you run an agent on each node and always register/deregister using that agent for the services on that node then things should work properly (and if that node dies all of its services will eventually be reaped automatically). If you are running a small number of agents and registering everything through those, setting the addresses manually, it is easy to lose track of where a service was registered, making it hard to remove it. I'd strongly recommend against running Consul like this - it also prevents reaping as described above, and things like sessions from working properly. Consul is designed to have the agent running on each node in the cluster.

The other issue on here where you get a 200 when deleting even if a service doesn't exist adds to the confusion; we will also fix that. Sorry for the trouble - hopefully you can get things working well in your setup!

@flypenguin
flypenguin commented Jun 23, 2016 edited

Okay, I understand. Three things come to my mind on this:

First, I actually did use the same instance for registering and de-registering. I also tried de-registration on ANY instance I have. It didn't work, and I assume it was because of the Agent-vs-nonAgent-API situation. This is - if at all - very poorly documented, and it seems to not work at all. So the very existence of this endpoint seems completely pointless (if not harmful) to me.

Second, I find the behavior very non-intuitive. I would expect a Consul cluster to propagate those events. IMO the nodes could still manage "their own" services even if not registered via them - the registered service should just "travel" to the node in question and stay there. Same, actually, with the deregistration. (or why bother including the service address if it really should only ever be "the" localhost?)

Third. Since "second" does not seem to behave like I thought it would I would expect a very clear, unmistakeable, prominent documentation about how this works, which I think the current one is not at the moment.

Some background maybe on my setup, so you understand why using a central instance seemed like a natural choice for me:

  • I am using Rancher to manage a containerized environment.
  • The services Rancher manages pop up on random hosts. I use a self-written service to enter those services in consul, to create a dynamic load balancing with consul template on it
  • The service registering the services in consul is running in a container as well, so it can't connect to "localhost" to get to Consul, nor does it know the actual host IP. That's why I give it a central consul instance to use. (Since consul is meant to be used differently as I realize now, I will change this behavior now)

Anyway, thanks a lot for your feedback!

@webertlima

@flypenguin
Hi, I see you are (or were) making some of the same confusions I was making using the Consul Agent.
Picture this:
If you have ONE Consul Server ONLY (the Master), you use this instance to register and deregister your services through the HTTP API, no matter how many services and nodes you have. This MUST work, unless you are doing something wrong with the request.

Now if you have MORE THAN ONE Consul servers (Like 1, 2 or 3 Consul Masters and Other Consul Clients), if you use the instance 2 to REGISTER, you HAVE to use the same instance 2 to DEREGISTER. It doesn't matter what instance of Consul you use to Register A service, as long as you use the same instance to deregister that same service (same ID).

Good practice: run the Consul Client in every machine that runs services and use 127.0.0.1 to register/deregister. It'll work. If the Service is in a Docker container, use it's default route.

@Keets2016

I still can't remove zombie service instance by Agent API. those instances still appear in ui.

@flypenguin

okay, I have now tried to work with this for a while, and I still very much think the approach has "room for improvement".

I really propose that you should be able to use any agent to register and de-register services, and this information should just be forwarded to the agent in question, who is then still responsible for doing this. In the not-so-unlikely use case when there is no consul on the node with the service in question, a hashing algorithm could be used to explicitly determine the "responsible" node. in any case it should not be the node where the AP Irequest comes in, because that node is arbitrary, and arbitrary is evil.

simple example: I use a single URL to register/de-register services, and behind this URL is a load balancer which balances between several consul instances for HA. very simple approach, easy to do, and ... not working with the current setup.

my main three lines of thinking are:

  • first, if I use only one consul behind this URL, I now have a "single node of failure" - if that node dies all my services die with it, right? that is confusing, unneccessary, and counterproductive IMHO.
  • second, if I do load balance, then I have arbitrary failure nodes, so that if one of them drops I get random service outages.
  • third, I can't register services on nodes without a consul on them if I have to use the consul on the service node in question, which I also don't like.

I really think there are good reasons to switch to a more sophisticated service registration management.

@lowzj
lowzj commented Sep 29, 2016

@Keets2016 you should use Agent Deregister API to deregister the service instance on the same consul agent that your service instance registered.

@flypenguin

@lowzj yes and that's exactly what should change. Consul is a distributed system but can't be used in a distributed way.

@mdirkse
mdirkse commented Sep 29, 2016

Agreed, unless I'm missing something it's kinda crazy that it works like this.

@lowzj
lowzj commented Sep 29, 2016

@flypenguin yeah, at first I was totally confused by the mechanism that consul using, and spent a lot of time on this. IMO, a service should be able to be deregistered by using the same way if it can be registered by using 'catalog api`.

But consul is much different from other system I used before, for example zookeeper. It's more like SmartStack, but simpler. Consul server like zookeeper at SmartStack, a key/value distributed store system. And consul client like Nerve at SmartStack, manage local services, perform health checks and report infos to the consul server/zookeeper.
So why I cannot de-register a service from consul using 'catalog api'? It just like that I directly delete an service(represented by ephemeral znode) from zk, but Nerve will recreate the ephemeral znode because of the health checking of the service is succeed.

Why consul works in this different way? From my experience, I think the biggest reason may be that consul is much easier and more convenient for services to use. From a service's point of view, the local consul client is God, everything could be done through it, and don't care anything else. But it's a little difficult to maintain the whole consul cluster(one agent per node). We must write tools to deploy/monitor/auto-restart consul agents. Yes, I also wrote a shell scripts to deregister critical zombie services.-_-!!

Sorry for my poor English, please don't mind.

@flypenguin

ah you misunderstood me - I don't actually care about which API to use. that is ok for me.

I case about that you have to use the SAME HOST for registering and de-registering, otherwise it won't work as expected it seeems.

This should be changed.

@webertlima

I agree with @flypenguin 's point of view about infrastructure. I had 2 system outages because of kernel panics, because the services were not deregistered on the machine that died, and I couldn't deregister them until I brought that machine back online.

@slackpad slackpad added this to the 0.7.1 milestone Oct 25, 2016
@slackpad slackpad modified the milestone: 0.7.2, 0.7.1 Nov 10, 2016
@vtahlani

If i try to register external service(https://www.consul.io/docs/guides/external.html) with Agent API getting error invalid tag DataCenter. Can someone please provide me example of registering external services with agent API.

@pocesar
pocesar commented Nov 23, 2016

just happened to me, after deregister in local agent (on 127.0.0.1) I'm getting that the service is still up and running, even health checks are returning critical 1s after the unregister call. that's really counter intuitive, even less in a local agent that should reflect changes immediately

@slackpad slackpad modified the milestone: 0.7.3, 0.7.2 Dec 15, 2016
@slackpad slackpad modified the milestone: 0.7.4, 0.7.3 Jan 17, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment