Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-control: avoid kill race #8273

Merged
merged 4 commits into from
Apr 16, 2018
Merged

job-control: avoid kill race #8273

merged 4 commits into from
Apr 16, 2018

Conversation

justinmk
Copy link
Member

@justinmk justinmk commented Apr 14, 2018

see #8269

@oni-link @bfredl @jamessan @blueyed sanity-check appreciated.

children_kill_cb() is racey. One obvious problem is that process_close_handles() is queued by on_process_exit(), so when children_kill_cb() is invoked, the dead process might still be in the
loop->children list. If the OS already reclaimed the dead PID, Nvim may try to SIGKILL it.

Avoid that by checking proc->status.

Vim doesn't have this problem because it doesn't attempt to kill processes that ignored SIGTERM after a timeout. (Future note: maybe we should not attempt this, too ...)

@@ -228,29 +228,26 @@ void process_stop(Process *proc) FUNC_ATTR_NONNULL_ALL
}

Loop *loop = proc->loop;
if (!loop->children_stop_requests++) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small cleanup: I removed loop->children_stop_requests because it seems useless. It will always be 0 here, because process_stop() is guarded by proc->stopped_time.

Copy link
Member Author

@justinmk justinmk Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think I get it now: loop->children_stop_requests is "global" for all processes, and we only want one children_kill_timer to reap all processes.

But the libuv doc for uv_timer_start says:

If the timer is already active, it is simply updated.

so it's harmless to just call uv_timer_start every time a process-stop is attempted.

static void children_kill_cb(uv_timer_t *handle)
{
Loop *loop = handle->loop->data;
uint64_t now = os_hrtime();

kl_iter(WatcherPtr, loop->children, current) {
Process *proc = (*current)->data;
if (!proc->stopped_time) {
bool exited = (proc->status >= 0);
if (exited || !proc->stopped_time) {
Copy link
Member Author

@justinmk justinmk Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Attempt to avoid the race in #8269

nvim process. The process will not get killed
when nvim exits. If the process dies before
nvim exits, "on_exit" will still be invoked.
detach : (non-pty only) Detach the job process, so it will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would focus on "The process will not get killed when nvim exits." because that is the main effect this option now has regardless of platform.

@bfredl
Copy link
Member

bfredl commented Apr 14, 2018

Looks sane to me on a quick glance

&& !--loop->children_stop_requests) {
// Stop the timer if no more stop requests are pending
DLOG("Stopping process kill timer");
ILOG("exited: pid=%d status=%d stoptime=%" PRId64, proc->pid,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proc->stopped_time is an uint64_t, so PRIu64 instead of PRId64?

DLOG("Stopping process kill timer");
ILOG("exited: pid=%d status=%d stoptime=%" PRId64, proc->pid,
proc->status, proc->stopped_time);
if (proc->stopped_time) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we already have a process that is waiting to be killed, but the next ending process stops the kill-timer now? Then the first process would be waiting until another process starts the kill timer.

@justinmk justinmk force-pushed the job-kill-race branch 2 times, most recently from 284dfc1 to b123f98 Compare April 14, 2018 15:27
@justinmk
Copy link
Member Author

What if we already have a process that is waiting to be killed, but the next ending process stops the kill-timer now?

@oni-link The last commit changes the timer to be non-repeating, and uv_timer_stop is never called.

uv_timer_start(&loop->children_kill_timer, children_kill_cb,
KILL_TIMEOUT_MS, KILL_TIMEOUT_MS);
uv_timer_start(&proc->loop->children_kill_timer, children_kill_cb,
KILL_TIMEOUT_MS, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the process loop here also the main loop? If so all processes use the same timer and a sequence of stopped processes could reset the timer without ever being killed.

Copy link
Member Author

@justinmk justinmk Apr 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the process loop here also the main loop? If so all processes use the same timer

Yes, that's how it has always been.

Based on libuv docs and my testing, each uv_timer_start call resets the timer. (So the minimum window is 2s, with unlimited "maximum".)

And the timer callback always iterates the processes.

So I think this is fine?

continue;
}
uint64_t elapsed = (now - proc->stopped_time) / 1000000 + 1;

if (elapsed >= KILL_TIMEOUT_MS) {
int sig = proc->type == kProcessTypePty && elapsed < KILL_TIMEOUT_MS * 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this code be changed so that always a SIGKILL is send?

  • Calling process_stop() for a pty that ignores SIGTERM would probably never on its own reach an elapsed time of 2*KILL_TIMEOUT_MS. The pty would have to wait until another process is stoped so that the kill timer is restarted.
  • If the timer fires too early no signal could be send and the process would again had to wait for another stopped process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oni-link Good catch again. @blueyed was there a specific reason we need to try SIGTERM before SIGKILL, if a PTY process didn't terminate after SIGHUP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not uncommon for there to be special handling of SIGHUP, e.g. to reload config or show progress. SIGTERM, on the other hand, is unlikely to have an overloaded meaning. Therefore issuing SIGTERM still gives an opportunity for graceful shutdown before the sledgehammer of SIGKILL.

@justinmk
Copy link
Member Author

justinmk commented Apr 15, 2018

was there a specific reason we need to try SIGTERM before SIGKILL, if a PTY process didn't terminate after SIGHUP?

SIGTERM still gives an opportunity for graceful shutdown before the sledgehammer of SIGKILL.

@jamessan Ok, 6cb2be8 c86775a fixes that mostly. Though it still has an (unlikely) edge case where it's possible that SIGKILL might follow SIGTERM too quickly (for PTY processes only).

@oni-link
Copy link
Contributor

@justinmk, in what situation can a SIGKILL follow a SIGTERM too quickly for ptys?

@justinmk
Copy link
Member Author

@oni-link good call :) I was thinking of the situation where "elapsed time" was used as the guard. But now that's removed.

It serves no purpose because process_stop() is already guarded by
`proc->stopped_time`.
children_kill_cb() is racey. One obvious problem is that
process_close_handles() is *queued* by on_process_exit(), so when
children_kill_cb() is invoked, the dead process might still be in the
`loop->children` list.  If the OS already reclaimed the dead PID, Nvim
may try to SIGKILL it.

Avoid that by checking `proc->status`.

Vim doesn't have this problem because it doesn't attempt to kill
processes that ignored SIGTERM after a timeout.

closes neovim#8269
Before f31c26f the timer was used to try SIGTERM *and* SIGKILL, so
a repeating timer was needed.  After f31c26f process_stop() sends
SIGTERM immediately, and the timer only sends SIGKILL.

So we don't need a repeating timer.
- Simplifies the logic: don't need to call uv_timer_stop() explicitly.
- Avoids a problem: if process_stop() is called more than once in the
  2-second window, the first on_process_exit() would call
  uv_timer_stop() which stops the timer for all stopped processes.
1. Don't check elapsed time in children_kill_cb(), it's already implied
   by the start-time of the timer itself.
2. Restart timer from children_kill_cb() for PTY jobs, to send SIGKILL
   after SIGTERM. There is an edge case where SIGKILL might follow
   SIGTERM too quickly, if jobstop() is called near the 2-second timer
   window.  But this edge case is not worth code complication.
@justinmk justinmk merged commit ed6a113 into neovim:master Apr 16, 2018
@justinmk justinmk deleted the job-kill-race branch April 16, 2018 06:44
justinmk added a commit that referenced this pull request Jun 11, 2018
FEATURES:
3cc7ebf #7234 built-in VimL expression parser
6a7c904 #4419 implement <Cmd> key to invoke command in any mode
b836328 #7679 'startup: treat stdin as text instead of commands'
58b210e :digraphs : highlight with hl-SpecialKey #2690
7a13611 #8276 'startup: Let `-s -` read from stdin'
1e71978 events: VimSuspend, VimResume #8280
1e7d5e8 #6272 'stdpath()'
f96d99a #8247 server: introduce --listen
e8c39f7 #8226 insert-mode: interpret unmapped META as ESC
98e7112 msg: do not scroll entire screen (#8088)
f72630b #8055 let negative 'writedelay' show all redraws
5d2dd2e win: has("wsl") on Windows Subsystem for Linux #7330
a4f6cec cmdline: CmdlineEnter and CmdlineLeave autocommands (#7422)
207b7ca #6844 channels: support buffered output and bytes sockets/stdio

API:
f85cbea #7917 API: buffer updates
418abfc #6743 API: list information about all channels/jobs.
36b2e3f #8375 API: nvim_get_commands
273d2cd #8329 API: Make nvim_set_option() update `:verbose set …`
8d40b36 #8371 API: more reliable/descriptive VimL errors
ebb1acb #8353 API: nvim_call_dict_function
9f994bb #8004 API: nvim_list_uis
3405704 #7520 API/UI: forward option updates to UIs
911b1e4 #7821 API: improve nvim_command_output

WINDOWS OS:
9cefd83 #8084, #8516 build/win: support MSVC
ee4e1fd win: Fix reading content from stdin (#8267)

TUI:
ffb8904 #8309 TUI: add support for mouse release events in urxvt
8d5a46e #8081 TUI: implement "standout" attribute
6071637 TUI: support TERM=konsole-256color
67848c0 #7653 TUI: report TUI info with -V3 ('verbose' >= 3)
3d0ee17 TUI/rxvt: enable focus-reporting
d109f56 #7640 TUI: 'term' option: reflect effective terminal behavior

FIXES:
ed6a113 #8273 'job-control: avoid kill-timer race'
4e02f1a #8107 'jobs: separate process-group'
451c48a terminal: flush vterm output buffer on pty output #8486
5d6732f :checkhealth fixes #8335
53f11dc #8218 'Fix errors reported by PVS'
d05712f inccommand: pause :terminal redraws (#8307)
51af911 inccommand: do not execute trailing commands #8256
84359a4 terminal: resize to the max dimensions (#8249)
d49c1dd #8228 Make vim_fgets() return the same values as in Vim
60e96a4 screen: winhl=Normal:Background should not override syntax (#8093)
0c59ac1 #5908 'shada: Also save numbered marks'
ba87a2c cscope: ignore EINTR while reading the prompt (#8079)
b1412dc #7971 ':terminal Enter/Leave should not increment jumplist'
3a5721e TUI: libtermkey: force CSI driver for mouse input #7948
6ff13d7 #7720 TUI: faster startup
1c6e956 #7862 TUI: fix resize-related segfaults
a58c909 #7676 TUI: always hide cursor when flushing, never flush buffers during unibilium output
303e1df #7624 TUI: disable BCE almost always
249bdb0 #7761 mark: Make sure that jumplist item will not have zero lnum
6f41ce0 #7704 macOS: Set $LANG based on the system locale
a043899 #7633 'Retry fgets on EINTR'

CHANGES:
ad60927 #8304 default to 'nofsync'
f3f1970 #8035 defaults: 'fillchars'
a6052c7 #7984 defaults: sidescroll=1
b69fa86 #7888 defaults: enable cscopeverbose
7c4bb23 defaults: do :filetype stuff unless explicitly "off"
2aa308c #5658 'Apply :lmap in macros'
8ce6393 terminal: Leave 'relativenumber' alone (#8360)
e46534b #4486 refactor: Remove maxmem, maxmemtot options
131aad9 win: defaults: 'shellcmdflag', 'shellxquote' #7343
c57d315 #8031 jobwait(): return -2 on interrupt also with timeout
6452831 clipboard: macOS: fallback to tmux if pbcopy is broken #7940
300d365 #7919 Make 'langnoremap' apply directly after a map
ada1956 #7880 'lua/executor: Remove lightuserdata'

INTERNAL:
de0a954 #7806 internal statistics for list impl
dee78a4 #7708 rewrite internal list impl
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants