Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthchecks resynced every ~2mins, causes consul index to grow #3259

Closed
shays10 opened this issue Jul 11, 2017 · 24 comments
Closed

Healthchecks resynced every ~2mins, causes consul index to grow #3259

shays10 opened this issue Jul 11, 2017 · 24 comments
Labels
theme/health-checks Health Check functionality type/bug Feature does not function as expected
Milestone

Comments

@shays10
Copy link

shays10 commented Jul 11, 2017

Hey,

This is very much related to #1244
I'm using blocking HTTP queries and I encountered a strange behavior where consul index is incremented even when healthchecks didn't change their status (or their output)
So I looked at the value of X-Consul-Index, searched for an offending healthcheck that had that exact same number (via /health/checks/any) and presumed that I'll find that the check state was changed.
But all I could find in the log is
https://gist.github.com/shays10/d0bdb43e0ae3664af1686aa139f18965

All checks are synced again, and this happens every 90-120 seconds. I'm guessing that it happens when antiEntropy kicks in and it finds that the service is synced but the checks are not.
I managed to find a correlation between the timestamp of the sync (according to the log) and the time where the ModifyIndex was bumped in all the checks.
Any idea on how to find the root cause? I'm having lots of checks on a lot of nodes and this causes the index to go wild.

I tried to make sure as much as I can that there isn't a flaky health check that fails randomly or returns a different output.

Client: 0.8.3
Server: 0.8.3

Client:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 11
	services = 11
build:
	prerelease = 
	revision = ea2a82b
	version = 0.8.3
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 8
	goroutines = 60
	max_procs = 2
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 243
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1496
	members = 32
	query_queue = 0
	query_time = 1

Server:

agent:
	check_monitors = 3
	check_ttls = 0
	checks = 3
	services = 1
build:
	prerelease = 
	revision = ea2a82b
	version = 0.8.3
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.31.9.104:8300
	server = true
raft:
	applied_index = 2553095
	commit_index = 2553095
	fsm_pending = 0
	last_contact = 0
	last_log_index = 2553095
	last_log_term = 435
	last_snapshot_index = 2551334
	last_snapshot_term = 435
	latest_configuration = [{Suffrage:Voter ID:172.31.13.71:8300 Address:172.31.13.71:8300} {Suffrage:Voter ID:172.31.11.117:8300 Address:172.31.11.117:8300} {Suffrage:Voter ID:172.31.9.104:8300 Address:172.31.9.104:8300}]
	latest_configuration_index = 1180420
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 435
runtime:
	arch = amd64
	cpu_count = 2
	goroutines = 247
	max_procs = 2
	os = linux
	version = go1.8.1
serf_lan:
	encrypted = false
	event_queue = 0
	event_time = 243
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1496
	members = 32
	query_queue = 0
	query_time = 1
serf_wan:
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 92
	members = 3
	query_queue = 0
	query_time = 1
@shays10
Copy link
Author

shays10 commented Jul 17, 2017

A quick update: I don't know what was the root cause, but the problem was resolved after we downgraded to 0.7.5

@slackpad
Copy link
Contributor

@shays10 thanks for the update. This kind of thing is usually related to some changing output, but the downgrade fixing it doesn't really fit in with that explanation. Do you have ACLs enabled?

@shays10
Copy link
Author

shays10 commented Jul 18, 2017

@slackpad Thanks for your reply! No, ACLs are currently disabled in our cluster

@slackpad slackpad added type/bug Feature does not function as expected theme/health-checks Health Check functionality labels Sep 1, 2017
@slackpad
Copy link
Contributor

slackpad commented Sep 1, 2017

Hi @shays10

A quick update: I don't know what was the root cause, but the problem was resolved after we downgraded to 0.7.5

Did you downgrade the clients, servers, or both?

@shays10
Copy link
Author

shays10 commented Sep 1, 2017 via email

@ponimas
Copy link

ponimas commented Sep 4, 2017

I've just faced the same issue with latest consul.
It seems that index is incremented on every http check.
Are http response headers considered as part of output?

client

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 1
        services = 2
build:
        prerelease =
        revision = 75ca2ca
        version = 0.9.2
consul:
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 44
        max_procs = 1
        os = linux
        version = go1.8.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 315
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 638
        members = 33
        query_queue = 0
        query_time = 35

server

agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease =
        revision = 75ca2ca
        version = 0.9.2
consul:
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 172.31.24.147:8300
        server = true
raft:
        applied_index = 15877761
        commit_index = 15877761
        fsm_pending = 0
        last_contact = 22.483µs
        last_log_index = 15877761
        last_log_term = 784
        last_snapshot_index = 15873877
        last_snapshot_term = 784
        latest_configuration = [{Suffrage:Voter ID:11325065-d2e6-788a-6a23-7d7c0af1a7d9 Address:172.31.22.9:8300} {Suffrage:Voter ID:78ca8bae-0f85-70de-329e-bf947b058a54 Address:172.31.24.147:8300} {Suffrage:Voter ID:f1193568-725f-7048-05a4-12bbd1a1d1e8 Address:172.31.25.174:8300}]
        latest_configuration_index = 15617842
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 784
runtime:
        arch = amd64
        cpu_count = 1
        goroutines = 225
        max_procs = 1
        os = linux
        version = go1.8.3
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 315
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 638
        members = 33
        query_queue = 0
        query_time = 35
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 1
        members = 1
        query_queue = 0
        query_time = 1

@magiconair
Copy link
Contributor

No. https://github.com/hashicorp/consul/blob/master/agent/check.go#L400-L415

Have you got a timestamp in the body?

@ponimas
Copy link

ponimas commented Sep 4, 2017

Have you got a timestamp in the body?

Nope.
Sometimes it's empty body and sometimes it's simple json (without any arrays).
Same response every time from the same service.

@magiconair
Copy link
Contributor

@ponimas Can you provide the check definition and a full check request/response for testing?

@ponimas
Copy link

ponimas commented Sep 5, 2017

@magiconair

{
          "http": "http://172.31.17.136:3000/",
          "interval": "60s",
          "timeout": "1s"
}

as far as I understand request is a simple GET request
and response is

curl -i http://172.31.17.136:3000/

HTTP/1.1 200 OK
Date: Tue, 05 Sep 2017 05:08:32 GMT
Content-Type: application/json; charset=utf-8
x-request-id: 15045881129851FD6B543
Content-Length: 60
Server: Jetty(9.2.z-SNAPSHOT)

{"status":"ok","message":"V2. Follow me, I know a shortcut"}
curl -i /v1/health/checks/susanin

HTTP/1.1 200 OK
Content-Type: application/json
Date: Tue, 05 Sep 2017 05:17:39 GMT
Transfer-Encoding: chunked
X-Consul-Index: 15897185
X-Consul-Knownleader: true
X-Consul-Lastcontact: 0
[
...
{
        "CheckID": "service:registrator.susanin-prod-cc8b.vpc:susanin:3000",
        "CreateIndex": 15616870,
        "ModifyIndex": 15897062,
        "Name": "Service 'susanin' check",
        "Node": "susanin-prod-cc8b",
        "Notes": "",
        "Output": "HTTP GET http://172.31.17.136:3000/: 200 OK Output: {\"status\":\"ok\",\"message\":\"V2. Follow me, I know a shortcut\"}",
        "ServiceID": "registrator.susanin-prod-cc8b.vpc:susanin:3000",
        "ServiceName": "susanin",
        "ServiceTags": [
            "urlprefix-/internal/susanin strip=/internal/susanin"
        ],
        "Status": "passing"
    },
...
]

@ponimas
Copy link

ponimas commented Sep 11, 2017

@magiconair is there any news on this issue?

@123BLiN
Copy link

123BLiN commented Sep 14, 2017

consul 0.9.2 - same behaviour, is it possible that header Date may trigger index to grow?

PS C:\tools\curl> .\curl.exe -k --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
Test consul site
PS C:\tools\curl> .\curl.exe -k -I --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
HTTP/1.1 200 OK
Content-Length: 16
Content-Type: text/html
Last-Modified: Thu, 22 Jun 2017 14:10:40 GMT
Accept-Ranges: bytes
ETag: "b64af75361ebd21:0"
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 14 Sep 2017 11:50:27 GMT

PS C:\tools\curl> .\curl.exe -k -I --resolve ztest.service.consul:443:127.0.0.1 https://ztest.service.consul
HTTP/1.1 200 OK
Content-Length: 16
Content-Type: text/html
Last-Modified: Thu, 22 Jun 2017 14:10:40 GMT
Accept-Ranges: bytes
ETag: "b64af75361ebd21:0"
Server: Microsoft-IIS/8.5
X-Powered-By: ASP.NET
Date: Thu, 14 Sep 2017 11:50:37 GMT

@magiconair
Copy link
Contributor

I'm currently a bit sick and also preparing for HashiConf next week. Should be able to catch my breath after that.

@magiconair magiconair self-assigned this Sep 14, 2017
@ponimas
Copy link

ponimas commented Oct 10, 2017

@magiconair is that gonna be fixed in 1.0?

@magiconair
Copy link
Contributor

@ponimas If I can find the root cause then probably. However, usually this is triggered by the Output field being different. I'll try to have a look this week but it is a short week for me.

@slackpad
Copy link
Contributor

@ponimas does that x-request-id: 15045881129851FD6B543 field change on every GET? That would cause some churn if so.

@ponimas
Copy link

ponimas commented Oct 11, 2017

@slackpad yup it changes. But it's header.
AFAIU #3259 (comment) - it shouldn't affect anything

@slackpad
Copy link
Contributor

@ponimas ah ok - I thought you were actually running that curl command as the check.

@pierresouchay
Copy link
Contributor

@slackpad I think you should have a look to @yfouquet patch.
In our cluster, the X-Consul-Index is increasing continuously while state is not varying that much. We have lots of checks on services with tags and we have very large amount of sync at cluster level, probably for this reason

@123BLiN
Copy link

123BLiN commented Nov 22, 2017

btw our indexes are still growing insanely fast w/out health check output. However consul watchers and consul-template do not trigger on them and in general they trigger only on changes - there should be some undocumented magic here, as I always thought that trigger event for such blocking queries is x-consul-index change.

@pierresouchay
Copy link
Contributor

Might be related as well: #3712

kamaradclimber pushed a commit to criteo-forks/consul that referenced this issue Dec 5, 2017
Since commit 9685bdc, service tags are added to the health checks.
Otherwise, when adding a service, tags are not added to its check.

In updateSyncState, we compare the checks of the local agent with the checks of the catalog.
It appears that the service tags are different (missing in one case), and so the check is synchronized.
That increase the ModifyIndex periodically when nothing changes.

Fixed it by adding serviceTags to the check.

Note that the issue appeared in version 0.8.2.
Looks related to hashicorp#3259.
yfouquet added a commit to yfouquet/consul that referenced this issue Dec 12, 2017
Since commit 9685bdc, service tags are added to the health checks.
Otherwise, when adding a service, tags are not added to its check.

In updateSyncState, we compare the checks of the local agent with the checks of the catalog.
It appears that the service tags are different (missing in one case), and so the check is synchronized.
That increase the ModifyIndex periodically when nothing changes.

Fixed it by adding serviceTags to the check.

Note that the issue appeared in version 0.8.2.
Looks related to hashicorp#3259.
@slackpad slackpad added this to the 1.0.2 milestone Dec 14, 2017
@slackpad
Copy link
Contributor

Closing this as fixed by #3642 which looks to be the root cause of this.

@42wim
Copy link
Contributor

42wim commented Jan 28, 2018

We've upgraded to consul 1.0.2 but we're seeing this issue with the nomad health check.

Almost every 2 minutes it does a synced check of the Nomad Client HTTP Check
Running nomad 0.7.1

Jan 29 00:28:45 box consul[110606]: 2018/01/29 00:28:45 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:30:17 box consul[110606]: 2018/01/29 00:30:17 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:32:28 box consul[110606]: 2018/01/29 00:32:28 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:34:40 box consul[110606]: 2018/01/29 00:34:40 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:37:02 box consul[110606]: 2018/01/29 00:37:02 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"
Jan 29 00:39:13 box consul[110606]: 2018/01/29 00:39:13 [INFO] agent: Synced check "413645baec3b17ac7b399423d71b16c9fe5b206e"

@42wim
Copy link
Contributor

42wim commented Jan 29, 2018

After a lot of debugging I found the issue (for my issue), and submitted a PR
cc: @slackpad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/health-checks Health Check functionality type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

7 participants