Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some services are being registered all the time #18538

Open
jorgemarey opened this issue Sep 20, 2023 · 6 comments
Open

Some services are being registered all the time #18538

jorgemarey opened this issue Sep 20, 2023 · 6 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration theme/consul type/bug

Comments

@jorgemarey
Copy link
Contributor

Nomad version

Nomad v1.5.6

Issue

We had a problem where we saw some consul service flapping to critical for a moment (less than a second) and then become healthy again. While doing some investigation about this we saw that nomad is re-registering the services constantly.

In the client logs we can see:

2023-09-20T10:04:07.331Z [TRACE] consul.sync: execute sync: reason=periodic
2023-09-20T10:04:07.406Z [TRACE] consul.sync: registrations different: id=_nomad-task-6de92b6e-9099-654a-00f2-93ac2b255142-xxxx-8080 field=connect_sidecar wanted="\"xxxx\"" existing="\"xxxx\""
2023-09-20T10:04:07.406Z [TRACE] consul.sync: must register service: id=_nomad-task-6de92b6e-9099-654a-00f2-93ac2b255142-group-xxxx-8080 exists=true reason=periodic
2023-09-20T10:04:07.717Z [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=0 deregistered_checks=1

We found that this happen with all services that have upstreams configured. We found that the problem occurs here, when in nomad the mesh gateway mode is left empty, consul sets it to local, so when checking the difference between the value in the job and in consul, nomad finds a difference and tries to register it again.

Related to this, in that service we modified the interval of the check, and saw that in the process of registering again the service, the old and new checks are both being registered (for a moment before removing again the old check in the sync proccess.

  "_nomad-check-01300b3da45ff5b95964e40301deb8058883665d": {
    "Node": "wrkr-ec26d795-8a28",
    "CheckID": "_nomad-check-01300b3da45ff5b95964e40301deb8058883665d",
    "Name": "xxxx-http",
    "Status": "passing",
    "Notes": "",
    "Output": "HTTP GET http://10.15.36.66:21795/_service/v0/health: 200  Output: {\n  \"status\" : \"UP\"\n}",
    "ServiceID": "_nomad-task-6de92b6e-9099-654a-00f2-93ac2b255142-group-xxxx-8080",
    "ServiceName": "xxxx",
    "ServiceTags": [],
    "Type": "http",
    "Interval": "30s",
    "Timeout": "5s",
    "ExposedPort": 0,
    "Definition": {},
    "Partition": "default",
    "Namespace": "default",
    "CreateIndex": 0,
    "ModifyIndex": 0
  },
  "_nomad-check-036f17f519b332a2439eb8b5c240f3e873273cdd": {
    "Node": "wrkr-ec26d795-8a28",
    "CheckID": "_nomad-check-036f17f519b332a2439eb8b5c240f3e873273cdd",
    "Name": "xxxx-http",
    "Status": "critical",
    "Notes": "",
    "Output": "",
    "ServiceID": "_nomad-task-6de92b6e-9099-654a-00f2-93ac2b255142-group-xxxx-8080",
    "ServiceName": "xxxx",
    "ServiceTags": [],
    "Type": "http",
    "Interval": "10s",
    "Timeout": "2s",
    "ExposedPort": 0,
    "Definition": {},
    "Partition": "default",
    "Namespace": "default",
    "CreateIndex": 0,
    "ModifyIndex": 0
  },

Reproduction steps

  • Register a job with an upstream.

For the second problem

  • Register a job with an upstream and a check
  • Modify the check
  • Now every sync in consul should appear for a moment both checks.
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Sep 20, 2023
@apollo13
Copy link
Contributor

I am seeing the same at the trace level, happening all 30 seconds:

2023-09-29T22:09:01.139+0200 [TRACE] consul.sync: registrations different: id=_nomad-task-03b36744-c471-b2db-7eed-9e56ddbd6c08-group-vector-vector-8686 field=connect_sidecar wanted="\"vector\"" existing="\"vector\""
2023-09-29T22:09:01.139+0200 [TRACE] consul.sync: must register service: id=_nomad-task-03b36744-c471-b2db-7eed-9e56ddbd6c08-group-vector-vector-8686 exists=true reason=periodic
2023-09-29T22:09:01.144+0200 [TRACE] consul.sync: registrations different: id=_nomad-task-6e9302b4-ffbd-1a72-fa15-99b0fe556c2a-group-kibana-kibana-5601 field=connect_sidecar wanted="\"kibana\"" existing="\"kibana\""
2023-09-29T22:09:01.144+0200 [TRACE] consul.sync: must register service: id=_nomad-task-6e9302b4-ffbd-1a72-fa15-99b0fe556c2a-group-kibana-kibana-5601 exists=true reason=periodic
2023-09-29T22:09:01.153+0200 [DEBUG] consul.sync: sync complete: registered_services=2 deregistered_services=0 registered_checks=0 deregistered_checks=0

Interestingly enough this only seems to happen for 2 out of 4 services with sidecars on the host. Which information can I provide to aid debugging=

@lgfa29
Copy link
Contributor

lgfa29 commented Oct 7, 2023

Thanks for the report @jorgemarey.

I was able to reproduce the first issue (but not the second so far). Curiously, in my case, the diff was caused by DestinationType. Looking through Consul's docs I think these are two fields that could cause problems because they have a Consul default that Nomad is not aware of.

In the case of MeshGateway.Mode it can be even trickier because an empty string in the job means that Consul will look for the actual value in several places, including service defaults, which Nomad is not aware of. Do you have any service default that could be affecting this value?

#18692 helps the situation a bit by allowing these two values to be empty, meaning Nomad will just accept whatever is set in Consul as correct.

@apollo13 unfortunately we don't have very descriptive logs on this diff logic. I'm assuming @jorgemarey patched the Nomad code to drill down exactly which filed was the problem 😅

One thing you could try is to compare these services definition in your job and what's in the Consul agent as reported by the /v1/agent/services endpoint.

In my case I noticed the DestinationType was set to service and so I updated my service definition to this and the sync stopped:

service {
  name = "nginx"
  port = "http"
                                       
  connect {
    sidecar_service {
      proxy {
        upstreams {
          destination_name = "whoami"
+         destination_type = "service"
          local_bind_port  = 8080
        }
      }
    }
  }
}

@lgfa29 lgfa29 added theme/consul theme/consul/connect Consul Connect integration stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Oct 7, 2023
@lgfa29 lgfa29 moved this from Needs Triage to Needs Roadmapping in Nomad - Community Issues Triage Oct 7, 2023
@apollo13
Copy link
Contributor

apollo13 commented Oct 7, 2023

@lgfa29 Thank you, adding destination_type = "service" fixed it here as well. I will go through all hosts in the cluster and see if that is all that is needed or if there are other fields missing.

EDIT:// Funny story though: I looked through the nodes of a test cluster and only one job has this issue (and not the same job as last time). So I wonder if depending on how/when against which consul version the job got registered it might have worked? Or maybe it depends on the moon…

@jorgemarey
Copy link
Contributor Author

Thanks for the report @jorgemarey.

I was able to reproduce the first issue (but not the second so far). Curiously, in my case, the diff was caused by DestinationType. Looking through Consul's docs I think these are two fields that could cause problems because they have a Consul default that Nomad is not aware of.

In the case of MeshGateway.Mode it can be even trickier because an empty string in the job means that Consul will look for the actual value in several places, including service defaults, which Nomad is not aware of. Do you have any service default that could be affecting this value?

Yeah, we have a consul proxy-defaults config with

    MeshGateway = {
      Mode = "local"
    }

I guess that's what caused this.

#18692 helps the situation a bit by allowing these two values to be empty, meaning Nomad will just accept whatever is set in Consul as correct.

Seeing that PR. I see a situation that I don't know what will happen:
If someone has a value in the nomad configuration and then deletes it. Will this change the service in consul only the first time the job is launched but avoid changing it later constantly? If it doesn't register the service with that value empty, the value won't change in consul and the user may expect it to change.

@apollo13 unfortunately we don't have very descriptive logs on this diff logic. I'm assuming @jorgemarey patched the Nomad code to drill down exactly which filed was the problem 😅

Yep, I did patch the code to output where nomad and consul had the diff.

Regarding the second issue. The problem with that is that the check is only registered with consul for a few milliseconds before it's removed again by the synchronization process. Once the synchronization starts nomad will register the service with both checks as it sees a difference in the service between what it knows and it sees in consul and later during that process nomad will see a check that shouldn't be there and will remove it, so it'll only last for a few milliseconds in consul. But I managed to notice that by making blocking queries to the consul agent API and also in the nomad logs I could see (constantly):

2023-09-20T10:04:07.717Z [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=0 deregistered_checks=1

I'll try reproduce it myself and provide some information about how to do it if I manage to so.

@lgfa29
Copy link
Contributor

lgfa29 commented Oct 11, 2023

If someone has a value in the nomad configuration and then deletes it. Will this change the service in consul only the first time the job is launched but avoid changing it later constantly? If it doesn't register the service with that value empty, the value won't change in consul and the user may expect it to change.

Hum...yeah, I think you're right. To make things worse I think the service will not be updated at all because that difference method is also used to detect job updates sync.

I think we're lacking expressiveness here in which an string here means "whatever Consul says it should be", meaning we shouldn't sync, but what user's usually intend for an empty value is for it to be set to Consul's default. So I think I will need to go back to the drawing board on this one 🤔

Ask workaround for now, to avoid the unnecessary syncs you can set all the expected values directly in the job. More specifically, destination_type and gateway.mode.

@dani
Copy link

dani commented Feb 12, 2024

Not only destination_type and gateway.mode are affected. When an upstream is declared as an http service (in service-defaults Protocol = "http"), we must also add this in the job description (in my example, immich-ml is the upstream, http service, and immich is the consumer of this service)

          proxy {
            upstreams {
              destination_name = "immich-ml"
              destination_type = "service"
              local_bind_port  = 3003
              config {
                protocol = "http"
              }
            }
          }

Without this, Nomad agent is also re-syncing the service on Consul every 30 sec :

févr. 12 13:15:32 ct-w-1 nomad[2691136]:     2024-02-12T13:15:32.910+0100 [TRACE] consul.sync: must register service: id=_nomad-task-c5309ab5-1650-63be-9801-2f9caec34ab8-group-immich-immich-3001 exists=true reason=periodic
févr. 12 13:15:33 ct-w-1 nomad[2691136]:     2024-02-12T13:15:33.092+0100 [DEBUG] consul.sync: sync complete: registered_services=1 deregistered_services=0 registered_checks=0 deregistered_checks=0
févr. 12 13:16:03 ct-w-1 nomad[2691136]:     2024-02-12T13:16:03.092+0100 [TRACE] consul.sync: execute sync: reason=periodic
févr. 12 13:16:03 ct-w-1 nomad[2691136]:     2024-02-12T13:16:03.107+0100 [TRACE] consul.sync: registrations different: id=_nomad-task-c5309ab5-1650-63be-9801-2f9caec34ab8-group-immich-immich-3001 field=connect_sidecar wanted="\"immich\"" existing="\"immich\""

This is a pain to maintain :-(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/consul/connect Consul Connect integration theme/consul type/bug
Projects
Development

No branches or pull requests

4 participants