Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: patched intel rdt to allow sudo #9527

Merged
merged 2 commits into from
Oct 12, 2021

Conversation

xavpaice
Copy link
Contributor

Required for all PRs:

  • Updated associated README.md.
  • Wrote appropriate unit tests.

resolves #9380

When running telegraf as a non-root user, this change allows the use
of sudo to run the necessary pqos command for the intel_rdt plugin.

@xavpaice
Copy link
Contributor Author

Note: supersedes #9501, includes the same changes plus addresses the shutdown issues discussed there.

@srebhan
Copy link
Contributor

srebhan commented Jul 22, 2021

@xavpaice can you please coordinate with the guys from #9501 to not have two PRs for the same feature and either fold this change to #9501 or close it.

@xavpaice
Copy link
Contributor Author

xavpaice commented Jul 22, 2021 via email

Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @xavpaice, thanks for taking over this topic.

I have some concerns (see comments in the code) especially around escaping the pqos command and, even more severe, killing the process. With the current code, you kill all pqos processes even though they might not be started by telegraf. This should definitively be fixed.

plugins/inputs/intel_rdt/README.md Outdated Show resolved Hide resolved
plugins/inputs/intel_rdt/intel_rdt.go Show resolved Hide resolved
plugins/inputs/intel_rdt/intel_rdt.go Outdated Show resolved Hide resolved
plugins/inputs/intel_rdt/intel_rdt.go Outdated Show resolved Hide resolved
plugins/inputs/intel_rdt/intel_rdt.go Outdated Show resolved Hide resolved
@srebhan
Copy link
Contributor

srebhan commented Jul 22, 2021

Guoqiao is on leave at the moment, like I was last week - though I'm assigned to the work now and will be around for a few weeks to tidy it up. Perhaps we can close off #9501 since the same change is included here?

Sure can you do it or do you want me to close it?

Copy link
Contributor

@KubaTrojan KubaTrojan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @xavpaice,

I already posted two comments in previous PRs so I will mention them: comment1 and comment2.

I think that shutDownPqos method doesn't have to be divided into two sections and I would like to refactor this method a little. As interrupting isn't working on privileged pqos, and I believe we don't want to grand telegraf access to perform sudo killall command, I suggest something which looks like this:

func shutDownPqos(pqos *exec.Cmd) error {
	timeout := time.Second

	if pqos.Process != nil {
		// try to send interrupt signal
		pqos.Process.Signal(os.Interrupt)

		// wait and constantly check if pqos is still running
		ctx, cancel := context.WithTimeout(context.Background(), timeout)
		defer cancel()
		for {
			if err := pqos.Process.Signal(syscall.Signal(0)); err == os.ErrProcessDone {
				return nil
			} else if ctx.Err() != nil {
				break
			}
		}

		// if pqos is still running after some period, try to kill it
		err := pqos.Process.Kill()
		if err != nil {
			return fmt.Errorf("failed to shut down pqos: %v", err)
		}
	}
	return nil
}

Privileged pqos have a strong will to survive, so with sudo_true option, there always have been some troubles with interrupting it. However, it seems that pqos.Process.Kill() is doing the job (but leaves garbages which we are leaving for now).

What do you think about the proposed changes?

if err == nil {
err = cmd.Wait()
}
// TODO find a way to make sure it's dead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be checked like this, but be aware that from sending a signal, to pqos being closed, may pass some time.

if err := pqos.Process.Signal(syscall.Signal(0)); err == os.ErrProcessDone {
	return nil
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.ErrProcessDone was introduced in 1.16, so I'd have to use an alternative.

Do you think it's worth adding some kind of wait condition with a timeout here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.ErrProcessDone looks like this:

// ErrProcessDone indicates a Process has finished.
var ErrProcessDone = errors.New("os: process already finished")

So we can introduce our own, local error variable equals: errors.New("os: process already finished").
I would suggest adding wait condition with break like:

...
errProcessDone := errors.New("os: process already finished")
// wait and constantly check if pqos is still running
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
for {
	if err := pqos.Process.Signal(syscall.Signal(0)); err == errProcessDone {
		return nil
	} else if ctx.Err() != nil {
		break
	}
}
...

if err != nil {
err = pqos.Process.Kill()
if pqos.Path == "/bin/sh" { // the process is running under sudo
cmd := exec.Command("sudo", "killall", "-SIGINT", "pqos")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As telegraf is running in unprivileged mode, this operation: sudo killall -SIGINT is not permitted. I assume that in this case, a proper adjustment has to be written in the sudoers file, but pqos PID will be different each time so it wouldn't be unambiguous and I think setting telegraf to has possibility to use sudo killal is not a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is that sending SIGINT to sh does not guarantee to be passed on to the child process. Therefore, my best bet would be to use gopsutil to get the child processes of this sh (we know the PID) and send SIGINT to those processes first.

if err != nil {
return fmt.Errorf("failed to shut down pqos: %v", err)
}
} else {
err := pqos.Process.Signal(os.Interrupt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases error isn't returned and pqos is still running. If you want to have a shutting down section split (sudo and not sudo), put checking for pqos presence also here, and try to terminate it when checking failed.

@reimda reimda added the waiting for response waiting for response from contributor label Aug 5, 2021
@exsdev0
Copy link

exsdev0 commented Sep 21, 2021

I'm unsure if the privileged pqos binary can be shutdown given the current constraints (telegraf running as an unprivileged user, pqos running as root, and no sudo kill/killall). The pqos process is running as root (via sudo). Even if we were to use gopsutils to determine the pid of pqos, an unprivileged user (e.g. telegraf) is not permitted (in this instance) to send SIGINT to a root-owned process.

Results from my limited testing confirm this:

$ kill -SIGINT <pqos-pid>
-bash: kill: (<pqos-pid>) - Operation not permitted

Please let me know if this is not correct.


An alternative solution that we've been discussing (for which we'd like feedback from upstream): We can have pqos started (as root) via a systemd service. The telegraf user can then be given sudo access to start/stop the pqos service. The telegraf plugin can monitor stdout of journalctl (e.g. journalctl --follow -u pqos). Does upstream have any comments for this solution?

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Sep 21, 2021
@xavpaice
Copy link
Contributor Author

Thanks for the feedback on this change. The bulk of the comments are easily integrated into an updated commit.

There's one remaining challenge with the change itself, which is stopping the process when we want to. This has been the blocker for completing the change and we haven't found a solution as yet.

If we use gopsutil to find the child pid of the sudo, for the pqos process, we cannot issue a signal to it without using sudo to do so as it runs as root. Unfortunately, we can't add that command to sudoers without giving blanket 'kill' access to any pid, so Telegraf would have the ability to stop any process it likes. We can't provide a process name using killall, because of other users possibly running pqos. I know there's a small amount of cruft left behind if we stop the parent shell for pqos, but is that significant enough to warrant the difficulty and security issues with killing it cleanly?

If pqos is actually not responding to the parent process being killed, I wonder if we should actually be raising a bug against pqos to clean up after itself and accept a signal gracefully.

@guoqiao
Copy link
Contributor

guoqiao commented Oct 1, 2021

Hi Everyone:

I have been digging into this issue this week.
And I found that, with intel_rdt enabled, even telegraf run as root, it still has trouble to stop.
A new bug raised in #9844

@guoqiao
Copy link
Contributor

guoqiao commented Oct 1, 2021

Also, re: when kill $PQOS_PID by default with -SIGTERM, it leaves garbage in /sys/fs/resctrl/mon_groups.

A bug was also raised at: intel/intel-cmt-cat#197

The developer already respond with: "We have pushed a fix on master branch".

@guoqiao
Copy link
Contributor

guoqiao commented Oct 2, 2021

Hi All:

  1. I have confirmed Issue telegraf can not stop gracefully when intel_rdt plugin enabled, even running as root #9844 can be fixed by PR fix: logging in intel_rdt.go caused service stop timeout even as root (#9844) #9850
  2. on top of 1, plus a sudoer rule to allow telegraf run sudo pqos, Issue When running as non-root user, intel_rdt plugin has no output #9380 can be fixed by a simplified version of this PR, both telegraf and pqos can stop quickly.
  3. re: garbage from pqos on SIGTERM, a fix is already merged into intel/intel-cmt-cat master.

TODO:

@guoqiao
Copy link
Contributor

guoqiao commented Oct 4, 2021

Hi All:
Issue 9844 and PR 9850 sorted.
I have rebased and updated this PR, the new patch is simple:

  • add use_sudo
  • prepend sudo to pqos cmd when use_sudo = true

review and feedback appreciated !

Copy link
Contributor

@powersj powersj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, however, it would be good to confirm that this actually fixes the reported issues from someone. Thanks!

@guoqiao
Copy link
Contributor

guoqiao commented Oct 4, 2021

hi @powersj, it does fix our issue #9380.
our env:

  • ubuntu 18.04
  • telegraf running as non-root telegraf user
  • a sudoer file added to allow telegraf user to run sudo /usr/sbin/pqos ...
  • intel_rdt plugin enabled

when telegraf starts, it can start sudo /usr/sbin/pqos ... process.
when sudo service telegraf stop fired, both telegraf and pqos process are closed quickly.

Two tiny issues:

  1. the msg in log is a little bit unpleasant:
2021-10-04T20:52:23Z E! [inputs.intel_rdt] failed to shut down pqos: operation not permitted
2021-10-04T20:52:23Z E! [inputs.intel_rdt] pqos: signal: terminated
2021-10-04T20:52:23Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-10-04T20:52:23Z I! [agent] Stopping running output

It seems intel_rdt plugin first tried to kill pqos and has no permission (we didn't grant it kill /usr/sbin/pqos ... permission in sudoer and that's out of the scope of this plugin). However, when systemd kills telegraf, pqos is also killed, with a SIGTERM.

  1. when pqos killed with SIGTERM, it will leave garbage. which is resolved upstream. We need to wait for the new deb package though.

@guoqiao
Copy link
Contributor

guoqiao commented Oct 5, 2021

Hey @xavpaice, thanks for taking over this topic.

I have some concerns (see comments in the code) especially around escaping the pqos command and, even more severe, killing the process. With the current code, you kill all pqos processes even though they might not be started by telegraf. This should definitively be fixed.

Hi @srebhan : I see this PR is pending on above change request, but that issue doesn't exist any more. I didn't see option to close/resolve it, could you help, please ?

Copy link
Contributor

@srebhan srebhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @xavpaice and @guoqiao for sorting out the issues!

@srebhan srebhan self-assigned this Oct 6, 2021
@srebhan srebhan added area/system plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. labels Oct 6, 2021
@KubaTrojan
Copy link
Contributor

hi @powersj, it does fix our issue #9380. our env:

  • ubuntu 18.04
  • telegraf running as non-root telegraf user
  • a sudoer file added to allow telegraf user to run sudo /usr/sbin/pqos ...
  • intel_rdt plugin enabled

when telegraf starts, it can start sudo /usr/sbin/pqos ... process. when sudo service telegraf stop fired, both telegraf and pqos process are closed quickly.

Two tiny issues:

  1. the msg in log is a little bit unpleasant:
2021-10-04T20:52:23Z E! [inputs.intel_rdt] failed to shut down pqos: operation not permitted
2021-10-04T20:52:23Z E! [inputs.intel_rdt] pqos: signal: terminated
2021-10-04T20:52:23Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-10-04T20:52:23Z I! [agent] Stopping running output

It seems intel_rdt plugin first tried to kill pqos and has no permission (we didn't grant it kill /usr/sbin/pqos ... permission in sudoer and that's out of the scope of this plugin). However, when systemd kills telegraf, pqos is also killed, with a SIGTERM.

  1. when pqos killed with SIGTERM, it will leave garbage. which is resolved upstream. We need to wait for the new deb package though.

Hi @guoqiao, thanks for your engagement in a itnel_rdt plugin!

I just want to pay attention to the number 1 issue. I have fetched your changes, and it seems that the unpleasant log message is not the only side effect of proposed solution. It seems that Telegraf cannot close itself when run with '--test-wait' flag. So I suppose that it can be closed only when an external impulse appears (like interruption).

However, for me, this changes (not for being merged) KubaTrojan@4116c35 resolve mentioned issue. Could you check this out?

@guoqiao
Copy link
Contributor

guoqiao commented Oct 6, 2021

@KubaTrojan thanks for the review.
I have apply your changes in shutDownPqos on top of this PR and test in our env.
When I run service telegraf stop, the new log is now like:

2021-10-06T20:50:17Z I! [inputs.socket_listener] Listening on tcp://127.0.0.1:8094
2021-10-06T20:50:29Z E! [inputs.intel_rdt] pqos: signal: terminated
2021-10-06T20:50:29Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-10-06T20:50:29Z I! [agent] Stopping running outputs

So the first operation not permitted log is gone.
Does this meet your expectation, please ?

@guoqiao guoqiao force-pushed the patched_intel_rdt branch 2 times, most recently from 9f9c81d to 10d2e3f Compare October 6, 2021 21:46
@sspaink sspaink changed the title Patched intel rdt fix: patched intel rdt Oct 6, 2021
@sspaink sspaink changed the title fix: patched intel rdt fix: patched intel rdt to allow sudo Oct 6, 2021
@guoqiao
Copy link
Contributor

guoqiao commented Oct 6, 2021

@KubaTrojan , PR updated, new review appreciated !

Copy link
Contributor

@KubaTrojan KubaTrojan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guoqiao, I see that you applied a new shutDownPqos method. But it does not resolve the problem of Telegraf being not able to kill pqos (for example with --test-wait flag).
That can be fixed by modifying the exec command by prefixing it with "bin/sh". It could looks similar to that: KubaTrojan@4116c35#diff-623e31993589bb59405f9757412c6a68e1518ed7ffbbc6a23b9e26d922f932a0R263-R267

Please let me know if that works for your env.

}
}

// if pqos is still running after some period, try to kill it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add here information about pqos not cleaning garbage in /sys/fs/resctrl/mon_groups when being terminated. Include issue link: intel/intel-cmt-cat#197 and information that the new pqos version is going to fix that issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added in comment. Thanks.

Comment on lines 263 to 268
if r.UseSudo {
// prepend sudo to pqos cmd when use_sudo = true
cmd = exec.Command("sudo", append([]string{r.PqosPath}, args...)...)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want a plugin to be able to terminate pqos, please prefix the command with /bin/sh

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added.

  1. It doesn't make any diffenerce to telegraf.log when stopping:
2021-10-07T22:29:22Z E! [inputs.intel_rdt] pqos: signal: terminated
2021-10-07T22:29:22Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-10-07T22:29:22Z I! [agent] Stopping running outputs
  1. when run telegrafa with --test-wait 2, pqos process does start and stop without issue.

@sspaink sspaink added waiting for response waiting for response from contributor and removed ready for final review This pull request has been reviewed and/or tested by multiple users and is ready for a final review. labels Oct 7, 2021
Xav Paice and others added 2 commits October 8, 2021 11:42
1. introduce bool option `use_sudo` and defaults to `false`
2. when set to `true`, prepend `sudo` to pqos cmd.
3. add doc in plugin README

fixes: influxdata#9380
intel_rdt plugin will print this log when it tries to stop pqos:

    E! [inputs.intel_rdt] failed to shut down pqos: operation not permitted

which causes telegraf fail to close itself without external interruption.

This patch allows intel_rdt plugin to shutdown pqos process properly.
@guoqiao
Copy link
Contributor

guoqiao commented Oct 7, 2021

@KubaTrojan code updated, new review appreciated !

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Oct 7, 2021
Copy link
Contributor

@KubaTrojan KubaTrojan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks, @guoqiao.

@guoqiao
Copy link
Contributor

guoqiao commented Oct 11, 2021

Hi everyone, is there anything else we need to do to get this merged, please ?

@sspaink sspaink merged commit 4321f8a into influxdata:master Oct 12, 2021
@guoqiao
Copy link
Contributor

guoqiao commented Oct 12, 2021

Thanks for all the help!

powersj pushed a commit that referenced this pull request Oct 27, 2021
Co-authored-by: Joe Guo <joe.guo@canonical.com>
(cherry picked from commit 4321f8a)
VladislavSenkevich pushed a commit to gwos/telegraf that referenced this pull request Nov 23, 2021
Co-authored-by: Joe Guo <joe.guo@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/system plugin/input 1. Request for new input plugins 2. Issues/PRs that are related to input plugins
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When running as non-root user, intel_rdt plugin has no output
9 participants