Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PID file handling a little more robust. #6738

Merged
merged 2 commits into from Jul 17, 2019

Conversation

@raskchanky
Copy link
Member

commented Jul 17, 2019

First, we now use AtomicWriter to write out PID files, which should make
it more resilient to sudden failures. In addition, instead of blindly
unwrapping the result of read_pid, which results a Result, we now check
for the presence of an error and log it.

Crucially, if read_pid returns an error, our branching logic returns None,
which causes the corrupt PID file to get deleted and rewritten by the
supervisor. Now, not only are the chances of having a 0 byte PID file
reduced, but even if it does manage to occur, the supervisor will
proceed along as though no PID file ever existed to begin with,
rather than crashing.

Closes #6270
Signed-off-by: Josh Black raskchanky@gmail.com

First, we now use AtomicWriter to write out PID files, which should make
it more resilient to sudden failures. In addition, instead of blindly
unwrapping the result of read_pid, which results a Result, we now check
for the presence of an error and log it.

Crucially, if read_pid returns an error, our branching logic returns None,
which causes the corrupt PID file to get deleted and rewritten by the
supervisor. Now, not only are the chances of having a 0 byte PID file
reduced, but even if it does manage to occur, the supervisor will
proceed along as though no PID file ever existed to begin with,
rather than crashing.

Signed-off-by: Josh Black <raskchanky@gmail.com>
Copy link
Contributor

left a comment

Looks good to me. I had some ideas for simplification, but nothing essential

w.with_writer(|f| {
write!(f, "{}", pid)?;
Ok(())
})

This comment has been minimized.

Copy link
@baumanj

baumanj Jul 17, 2019

Contributor

I think you can save a couple lines by replacing L249–253 with

            fs::atomic_write(&self.pid_file, pid.to_string())

This comment has been minimized.

Copy link
@raskchanky

raskchanky Jul 17, 2019

Author Member

Oh, nice. That's much better.

error!("Error reading PID file: {}", e);
None
}
}

This comment has been minimized.

Copy link
@baumanj

baumanj Jul 17, 2019

Contributor

This is the only caller for read_pid and it doesn't propagate the error, so if we move the error printing into read_pid, and have it return an Option we reduce match statement to:

                    read_pid(&self.pid_file)

Further, since L59 is redundant, we can reduce the whole let pid block starting on L58 to:

        let pid = self.pid.or_else(|| {
                              if self.pid_file.exists() {
                                  read_pid(&self.pid_file)
                              } else {
                                  None
                              }
                          });

And if we move the self.pid_file.exists() check inside read_pid (we could even avoid the second syscall with a match arm like:

        // Don't log an error if the file doesn't exist
        Err(ref err) if err.kind() == std::io::ErrorKind::NotFound => return None,

), we further simplify to:

        let pid = self.pid.or_else(|| read_pid(&self.pid_file));

But since we do an if let Some(pid) = pid directly afterward on L74, why not combine with and_then? That gets a little deep with the process::is_alive condition and need to return, but if we pull that out and recognize that we always assign self.pid, we get something quite nice:

        self.pid = self.pid
                       .or_else(|| read_pid(&self.pid_file))
                       .and_then(|pid| {
                           if process::is_alive(pid) {
                               Some(pid)
                           } else {
                               debug!("Could not find a live process with pid {:?}", pid);
                               None
                           }
                       });

        if self.pid.is_some() {
            self.change_state(ProcessState::Up);
        } else {
            self.change_state(ProcessState::Down);
            self.cleanup_pidfile();
        }
        self.pid.is_some()

I believe that's equivalent with the exception that you'll never see Could not find a live process with pid None in debug logs. I realize that's a pretty major refactoring, so take or leave what you like.

This comment has been minimized.

Copy link
@christophermaier

christophermaier Jul 17, 2019

Contributor

This is very similar to what I was kicking around the other day on another PR, as a clarification of the check_process function, completely separate from this unwrap issue.

Funnily enough, a big thing that spurred me to start poking around here was noticing that the error message about not finding a live process with the PID None was nonsense 😆 So I'm totally fine with the fact that we don't emit that message.

I think this is equivalent in behavior, otherwise.

This comment has been minimized.

Copy link
@raskchanky

raskchanky Jul 17, 2019

Author Member

Nice, that's a great improvement.

Copy link
Contributor

left a comment

One behavioral thing to note is that if the Supervisor restarts while leaving services up, and one of those services' PID files is corrupted or otherwise unreadable, the Supervisor will now try to repeatedly start the service again, even though a service is still running, because there isn't a way for it to truly synchronize its internal state with what's actually happening on the system. I don't think there's really anything that can be done in that case, though (and it's certainly preferable to crashing the Supervisor). Killing the old process will fix the situation.

A future refactoring where the Supervisor can get the PID information directly from the Launcher will definitely solve that issue, though.

* Get rid of the error types that are only ever used in a single
function.
* Consolidate and simplify the PID handling logic

Signed-off-by: Josh Black <raskchanky@gmail.com>
@raskchanky raskchanky self-assigned this Jul 17, 2019
@raskchanky raskchanky merged commit 5734cce into master Jul 17, 2019
5 checks passed
5 checks passed
DCO This commit has a DCO Signed-off-by
Details
buildkite/habitat-sh-habitat-master-verify Build #2697 passed (31 minutes, 33 seconds)
Details
buildkite/habitat-sh-habitat-master-website Build #3362 passed (1 minute, 30 seconds)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
expeditor/config-validation Validated your Expeditor config file
Details
@raskchanky raskchanky deleted the jb/atomic-pids branch Jul 17, 2019
chef-ci added a commit that referenced this pull request Jul 17, 2019
Obvious fix; these changes are the result of automation not creative thinking.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.