Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cgo dependency with getconf invocation #2

Closed
wants to merge 1 commit into from

Conversation

grobie
Copy link
Member

@grobie grobie commented Feb 11, 2015

This change removes the cgo dependency from prometheus/procfs, which is
used in prometheus/client_golang. As this is the only cgo dependency in
all prometheus components** (prometheus server, *_exporter, etc.),
removing this dependency has been considered more important than
preventing an external system call during program initialization.

In the rare case that either the getconf call or the value parsing
fails, the library falls back to the de-facto standard CLK_TCK value of
100. For the even rarer case that a different value is used on a system,
a warning is printed to stderr.

@juliusv @discordianfish

** I'll replace the proc/stat parser in node_exporter with a call to this library.

func ticks() float64 {
return float64(C.sysconf(C._SC_CLK_TCK)) // most likely 100
func clkTckWarning(err error) {
f := "[warning] github.com/prometheus/procfs could not read CLK_TCK, falling back to default value %d: %s\n"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The line hasn't been wrapped after 80 characters on purpose to make it easier to grep for the error message.

This change removes the cgo dependency from prometheus/procfs, which is
used in prometheus/client_golang. As this is the only cgo dependency in
all prometheus componentes (prometheus server, *_exporter, etc.),
removing this dependency has been considered more important than
preventing an external system call during program initialization.

In the rare case that either the getconf call or the value parsing
fails, the library falls back to the de-facto standard CLK_TCK value of
100. For the even rarer case that a different value is used on a system,
a warning is printed to stderr.
@juliusv
Copy link
Member

juliusv commented Feb 11, 2015

A bit nervous about someone missing the stderr output if they use only glog files (or similar) for their logs, but I guess it's better than panicking, and should only happen in the super-rare case where getconf is not available AND 100 is not the actual HZ value.

👍 unless someone has a better idea :)

@juliusv
Copy link
Member

juliusv commented Feb 11, 2015

One additional concern: shelling out when loading this library prevents any application using the Go Prometheus client library from running with security frameworks/settings which prevent forking. Those users would either have to relax their security (increasing risk of being exploited by someone forking a shell through a vulnerability) or manually change the library code.

Then again, I think this would mainly lead to the forking to fail in a way that will be returned as a Go error, in which case the default will simply be used (like when using rlimits to prevent forking). But it's possible that some frameworks will hard-kill the process if it tries to do any syscalls which it isn't allowed to do.

If that ever does become an issue, we could still make this all configurable via a flag or environment variable (using flags in libs doesn't seem to be a no-no, given that glog does it).

var clkTck = 100.

func init() {
output, err := exec.Command("getconf", "CLK_TCK").Output()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's security implications here, how do we know that 'getconf' that we're calling with all of our privileges is what we think it is?

This will also cause any go code that merely links to client_golang to do a fork+exec at startup, which may not be appropriate in all scenarios (think thousands of very short lived tasks talking to the pushgateway) and lead to questions from an strace.

We may be better off hardcoding, where have we seen that the value isn't 100?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the glibc and linux source, it's 100 - except on ia64 and alpha. ia64 is configurable at kernel compile time (CONFIG_HZ) and alpha is 1024.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the nastiness is: looking at the Linux source at one point in time is not enough. We have to think this through in these dimensions:

  • how this changed over time
  • in different OSes (Linux, Mac, *BSD, ...)
  • on different architectures

Meh.

But I see you're already on it, which is great :)

I agree that it makes me feel queasy to call an external binay at startup for every application that ever includes this library, so if there's a better way, that'd be much appreciated.

@discordianfish
Copy link
Member

So I did some research and first all all, the situation is super confusing.

I finally found out how the kernel communicates the rate to the userland: http://article.gmane.org/gmane.linux.kernel/38936 (assuming this didn't change again).

ps also isn't using sysconf at all. The code comments there sound like it's a compile-time macro which means if we build our prometheus services on a system with different clock rate than on the systems the services will run on, our calculation breaks: https://github.com/mmalecki/procps/blob/master/proc/sysinfo.c#L93

But I just compile a little c programm and verified that it in fact calls a function at runtime.

What's contributing to the confusion is that there are multiple CLK_TCK timers, some seem deprecated other not so..

Anyways, it seems on kernels >= 2.4.0, the right way to use get the clock rate is by looking at the ELF note section, whether via the sysconf stuff or directly. Using sysconf seems to be more compatible since it's a posix function and the ELF stuff is linux specific, so I would propose we:

  • with CGO=1, we use sysconf as we did before
  • with CGO=0 on linux we use debug/elf(?) to read the clock from prometheus' own ELF
  • with CGO=0 on other systems, we just use a hardcoded value of 100 and print a warning

@discordianfish
Copy link
Member

Okay another idea to throw in here:
https://github.com/davecheney/junk/blob/master/clock/examples/processtime/main.go

That uses the CLOCK_GETTIME syscall. But that again would only work on linux.

@grobie
Copy link
Member Author

grobie commented Feb 12, 2015

We can probably just ignore non-linux systems for procfs for now. AFAIK the BSDs have all removed it by now and osx never had it. And who uses system9 actually? Solaris users might come though.

@grobie
Copy link
Member Author

grobie commented Mar 31, 2015

Superseded by #4.

@grobie grobie closed this Mar 31, 2015
@grobie grobie deleted the remove-cgo-dependency branch March 31, 2015 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants