-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: patched intel rdt to allow sudo #9527
Conversation
Note: supersedes #9501, includes the same changes plus addresses the shutdown issues discussed there. |
e1cab12
to
c1a857f
Compare
Guoqiao is on leave at the moment, like I was last week - though I'm
assigned to the work now and will be around for a few weeks to tidy it up.
Perhaps we can close off #9501 since the same change is included here?
…On Thu, 22 Jul 2021 at 19:28, Sven Rebhan ***@***.***> wrote:
@xavpaice <https://github.com/xavpaice> can you please coordinate with
the guys from #9501 <#9501> to
not have two PRs for the same feature and either fold this change to #9501
<#9501> or close it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9527 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGNMWOK2NPWJLOYOCYLLK3TY7CBTANCNFSM5AY6PYTA>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @xavpaice, thanks for taking over this topic.
I have some concerns (see comments in the code) especially around escaping the pqos
command and, even more severe, killing the process. With the current code, you kill all pqos
processes even though they might not be started by telegraf. This should definitively be fixed.
Sure can you do it or do you want me to close it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @xavpaice,
I already posted two comments in previous PRs so I will mention them: comment1 and comment2.
I think that shutDownPqos
method doesn't have to be divided into two sections and I would like to refactor this method a little. As interrupting isn't working on privileged pqos
, and I believe we don't want to grand telegraf access to perform sudo killall
command, I suggest something which looks like this:
func shutDownPqos(pqos *exec.Cmd) error {
timeout := time.Second
if pqos.Process != nil {
// try to send interrupt signal
pqos.Process.Signal(os.Interrupt)
// wait and constantly check if pqos is still running
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
for {
if err := pqos.Process.Signal(syscall.Signal(0)); err == os.ErrProcessDone {
return nil
} else if ctx.Err() != nil {
break
}
}
// if pqos is still running after some period, try to kill it
err := pqos.Process.Kill()
if err != nil {
return fmt.Errorf("failed to shut down pqos: %v", err)
}
}
return nil
}
Privileged pqos have a strong will to survive, so with sudo_true
option, there always have been some troubles with interrupting it. However, it seems that pqos.Process.Kill()
is doing the job (but leaves garbages which we are leaving for now).
What do you think about the proposed changes?
if err == nil { | ||
err = cmd.Wait() | ||
} | ||
// TODO find a way to make sure it's dead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be checked like this, but be aware that from sending a signal, to pqos being closed, may pass some time.
if err := pqos.Process.Signal(syscall.Signal(0)); err == os.ErrProcessDone {
return nil
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.ErrProcessDone was introduced in 1.16, so I'd have to use an alternative.
Do you think it's worth adding some kind of wait condition with a timeout here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
os.ErrProcessDone looks like this:
// ErrProcessDone indicates a Process has finished.
var ErrProcessDone = errors.New("os: process already finished")
So we can introduce our own, local error variable equals: errors.New("os: process already finished")
.
I would suggest adding wait condition with break like:
...
errProcessDone := errors.New("os: process already finished")
// wait and constantly check if pqos is still running
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
for {
if err := pqos.Process.Signal(syscall.Signal(0)); err == errProcessDone {
return nil
} else if ctx.Err() != nil {
break
}
}
...
if err != nil { | ||
err = pqos.Process.Kill() | ||
if pqos.Path == "/bin/sh" { // the process is running under sudo | ||
cmd := exec.Command("sudo", "killall", "-SIGINT", "pqos") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As telegraf is running in unprivileged mode, this operation: sudo killall -SIGINT
is not permitted. I assume that in this case, a proper adjustment has to be written in the sudoers file, but pqos PID will be different each time so it wouldn't be unambiguous and I think setting telegraf to has possibility to use sudo killal
is not a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is that sending SIGINT
to sh
does not guarantee to be passed on to the child process. Therefore, my best bet would be to use gopsutil
to get the child processes of this sh
(we know the PID) and send SIGINT
to those processes first.
if err != nil { | ||
return fmt.Errorf("failed to shut down pqos: %v", err) | ||
} | ||
} else { | ||
err := pqos.Process.Signal(os.Interrupt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases error isn't returned and pqos
is still running. If you want to have a shutting down section split (sudo and not sudo), put checking for pqos presence also here, and try to terminate it when checking failed.
I'm unsure if the privileged pqos binary can be shutdown given the current constraints (telegraf running as an unprivileged user, pqos running as root, and no sudo kill/killall). The pqos process is running as root (via sudo). Even if we were to use gopsutils to determine the pid of pqos, an unprivileged user (e.g. telegraf) is not permitted (in this instance) to send SIGINT to a root-owned process. Results from my limited testing confirm this:
Please let me know if this is not correct. An alternative solution that we've been discussing (for which we'd like feedback from upstream): We can have pqos started (as root) via a systemd service. The telegraf user can then be given sudo access to start/stop the pqos service. The telegraf plugin can monitor stdout of journalctl (e.g. |
Thanks for the feedback on this change. The bulk of the comments are easily integrated into an updated commit. There's one remaining challenge with the change itself, which is stopping the process when we want to. This has been the blocker for completing the change and we haven't found a solution as yet. If we use gopsutil to find the child pid of the sudo, for the pqos process, we cannot issue a signal to it without using sudo to do so as it runs as root. Unfortunately, we can't add that command to sudoers without giving blanket 'kill' access to any pid, so Telegraf would have the ability to stop any process it likes. We can't provide a process name using killall, because of other users possibly running pqos. I know there's a small amount of cruft left behind if we stop the parent shell for pqos, but is that significant enough to warrant the difficulty and security issues with killing it cleanly? If pqos is actually not responding to the parent process being killed, I wonder if we should actually be raising a bug against pqos to clean up after itself and accept a signal gracefully. |
Hi Everyone: I have been digging into this issue this week. |
Also, re: when A bug was also raised at: intel/intel-cmt-cat#197 The developer already respond with: "We have pushed a fix on master branch". |
Hi All:
TODO:
|
c1a857f
to
02288b6
Compare
Hi All:
review and feedback appreciated ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, however, it would be good to confirm that this actually fixes the reported issues from someone. Thanks!
Looks like new artifacts were built from this PR. Get them here!Artifact URLs |
hi @powersj, it does fix our issue #9380.
when telegraf starts, it can start Two tiny issues:
It seems intel_rdt plugin first tried to kill pqos and has no permission (we didn't grant it
|
Hi @srebhan : I see this PR is pending on above change request, but that issue doesn't exist any more. I didn't see option to close/resolve it, could you help, please ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @guoqiao, thanks for your engagement in a itnel_rdt plugin! I just want to pay attention to the number 1 issue. I have fetched your changes, and it seems that the unpleasant log message is not the only side effect of proposed solution. It seems that Telegraf cannot close itself when run with '--test-wait' flag. So I suppose that it can be closed only when an external impulse appears (like interruption). However, for me, this changes (not for being merged) KubaTrojan@4116c35 resolve mentioned issue. Could you check this out? |
@KubaTrojan thanks for the review.
So the first |
9f9c81d
to
10d2e3f
Compare
@KubaTrojan , PR updated, new review appreciated ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@guoqiao, I see that you applied a new shutDownPqos method. But it does not resolve the problem of Telegraf being not able to kill pqos (for example with --test-wait flag).
That can be fixed by modifying the exec command by prefixing it with "bin/sh". It could looks similar to that: KubaTrojan@4116c35#diff-623e31993589bb59405f9757412c6a68e1518ed7ffbbc6a23b9e26d922f932a0R263-R267
Please let me know if that works for your env.
} | ||
} | ||
|
||
// if pqos is still running after some period, try to kill it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add here information about pqos not cleaning garbage in /sys/fs/resctrl/mon_groups when being terminated. Include issue link: intel/intel-cmt-cat#197 and information that the new pqos version is going to fix that issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added in comment. Thanks.
if r.UseSudo { | ||
// prepend sudo to pqos cmd when use_sudo = true | ||
cmd = exec.Command("sudo", append([]string{r.PqosPath}, args...)...) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want a plugin to be able to terminate pqos, please prefix the command with /bin/sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is added.
- It doesn't make any diffenerce to telegraf.log when stopping:
2021-10-07T22:29:22Z E! [inputs.intel_rdt] pqos: signal: terminated
2021-10-07T22:29:22Z I! [agent] Hang on, flushing any cached metrics before shutdown
2021-10-07T22:29:22Z I! [agent] Stopping running outputs
- when run telegrafa with
--test-wait 2
, pqos process does start and stop without issue.
1. introduce bool option `use_sudo` and defaults to `false` 2. when set to `true`, prepend `sudo` to pqos cmd. 3. add doc in plugin README fixes: influxdata#9380
intel_rdt plugin will print this log when it tries to stop pqos: E! [inputs.intel_rdt] failed to shut down pqos: operation not permitted which causes telegraf fail to close itself without external interruption. This patch allows intel_rdt plugin to shutdown pqos process properly.
10d2e3f
to
09cdf4d
Compare
@KubaTrojan code updated, new review appreciated ! |
📦 Looks like new artifacts were built from this PR. Expand this list to get them here! 🐯Artifact URLs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks, @guoqiao.
Hi everyone, is there anything else we need to do to get this merged, please ? |
Thanks for all the help! |
Co-authored-by: Joe Guo <joe.guo@canonical.com> (cherry picked from commit 4321f8a)
Co-authored-by: Joe Guo <joe.guo@canonical.com>
Required for all PRs:
resolves #9380
When running telegraf as a non-root user, this change allows the use
of sudo to run the necessary pqos command for the intel_rdt plugin.