Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zookeeper Serverset SD broken on 1.4.1 #2254

Closed
jjneely opened this Issue Dec 5, 2016 · 13 comments

Comments

Projects
None yet
9 participants
@jjneely
Copy link
Contributor

jjneely commented Dec 5, 2016

What did you do?

Upgraded from 1.2.1 to 1.4.1.

What did you expect to see?

Known working serverset configuration should continue to function.

What did you see instead? Under which circumstances?

The /target page refuses to load. Scraping of all targets does not happen.

Environment

  • System information:

    Linux 3.13.0-85-generic x86_64
    Ubuntu Trusty

  • Prometheus version:

      # prometheus -version
      prometheus, version 1.4.1 (branch: XXX, revision: 1.4.1-1+JENKINS~trusty.20161202194014)
        build user:       jjneely@42lines.net
        build date:       2016-12-02T19:41:12Z
        go version:       go1.7.4
    
  • Prometheus configuration file:

Complete configuration can be made available privately. The following is what my server set looks like. (Yes, JSON.)

    {
      "job_name": "pushgateway",
      "serverset_sd_configs": [
        {
          "servers": [
            "mesos-zk-discovery-001-g6.XXX.com:2181",
            "mesos-zk-discovery-002-g6.XXX.com:2181",
            "mesos-zk-discovery-003-g6.XXX.com:2181"
          ],
          "paths": [
            "/aurora/site-visibility/prod/pushgateway"
          ]
        }
      ],
      "honor_labels": true,
      "relabel_configs": [
        ...
      ]
    }

Removing the serverset_sd_configs section from the configuration makes the Prometheus 1.4.1 instance function normally. Issues return when the serverset section is re-added.

  • Logs:

No additional logs produced, even with -log.leve=debug present. I believe the target manager is deadlocked.

@cristifalcas

This comment has been minimized.

Copy link

cristifalcas commented Dec 6, 2016

We have the same issue. After 1.4.0, zookeepeer scrape_config doesn't work anymore.

@prfalken

This comment has been minimized.

Copy link
Contributor

prfalken commented Dec 8, 2016

This obviously affects the Nerve Service Discovery as well, as it's based on zookeeper.

@JustinVenus

This comment has been minimized.

Copy link

JustinVenus commented Jan 4, 2017

Ran into this issue as well. Prometheus-1.3.1 is the last release version that isn't affected by this bug.

@jacobrichard

This comment has been minimized.

Copy link

jacobrichard commented Jan 4, 2017

Looks like a lot of ZK service discovery changes in 1.4.0. I'd have to pick through what is being done here, but it looks like nerve and serverset got merged into one file. Seems like a lot of locking logic was rewritten or removed, which could contribute to a deadlock.

See the changes in: a1eec44

@Jackey2015

This comment has been minimized.

Copy link

Jackey2015 commented Jan 11, 2017

in this file: util/treecache/treecache.go
tc.events <- ZookeeperTreeCacheEvent{Path: path, Data: node.data}
change to
go func() { tc.events <- ZookeeperTreeCacheEvent{Path: path, Data: node.data} }()

rebuild prometheus after change code, it works OK

@cristifalcas

This comment has been minimized.

Copy link

cristifalcas commented Mar 3, 2017

Any updates on this?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 3, 2017

I am not familiar with the Zookeeper SD at all and don't have a Zookeper around to try anything out myself, but just from having a quick look at the code I suspect the problem to be this deadlock:

Perhaps it is sufficient to remove the initial synchronous sync by just removing these lines:

err := tc.recursiveNodeUpdate(path, tc.head)
if err != nil {
log.Errorf("Error during initial read of Zookeeper: %s", err)
}

If someone could try that out, that would be great.

@StephanErb

This comment has been minimized.

Copy link
Contributor

StephanErb commented Mar 3, 2017

@juliusv thanks for doing the code analysis. I will test it out and report back here.

@StephanErb

This comment has been minimized.

Copy link
Contributor

StephanErb commented Mar 3, 2017

Following Julius excellent lead, I was able to assemble a fix. See #2470 for details.

@fabxc fabxc closed this in 3038d0e Mar 6, 2017

@StephanErb

This comment has been minimized.

Copy link
Contributor

StephanErb commented Mar 21, 2017

Any chance to get this into a bugfix release?

(I can understand if this is too much effort, but maybe you have good automation in place to make this feasible :-). Thanks!)

@jjneely

This comment has been minimized.

Copy link
Contributor Author

jjneely commented Mar 21, 2017

I filed #2481 regarding this patch, there seems to be issues after a SIGHUP reload where the Zookeeper code begins consuming a large amount of CPU time.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 21, 2017

The 1.6 release is not too far away. (If I only wouldn't be drawn into production incidents all the time, or had to prepare talks for CloudNativeCon ;).

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.