New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pantsd client logs exceptions from server processes #6539

Merged
merged 7 commits into from Sep 22, 2018

Conversation

Projects
None yet
3 participants
@stuhood
Copy link
Member

stuhood commented Sep 21, 2018

Problem

#6530 describes cases where when pantsd (or a pantsd-runner instance) is killed, there can be gaps in our exception logging (and rendering) that result in exceptions not reaching either the client or a log.

Solution

Build atop #6533 to consume written logs and render them for a pantsd client. When either pantsd or a pantsd-runner instance die, if the ExceptionSink (or Exiter) has installed an exception hook that renders the exception to the appropriate pid, it will be logged, and rendered to the user.

Result

Killed pantsd-runner or pantsd processes have increased visibility and debuggability.

@stuhood stuhood force-pushed the twitter:stuhood/client-logs-exceptions-from-runner branch from 436fc26 to 51f7cdb Sep 21, 2018

@stuhood stuhood requested review from cosmicexplorer , kwlzn and Eric-Arellano Sep 21, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Sep 21, 2018

There is one test added here that we're expecting to fail until @cosmicexplorer's second component of #6530 lands and the killed pantsd-runner actually renders something.

@stuhood stuhood changed the title WIP: pantsd client logs exceptions from server processes pantsd client logs exceptions from server processes Sep 21, 2018

@stuhood

This comment has been minimized.

Copy link
Member

stuhood commented Sep 21, 2018

This is now reviewable.

# We care more about this log than the shared log, so completely write to it first. This
# avoids any errors with concurrent modification of the shared log affecting the per-pid log.
with safe_open(cls._exceptions_log_path(for_pid=pid), 'a') as pid_error_log:
pid_error_log.write(fatal_error_log_entry)
safe_file_dump(cls.exceptions_log_path(for_pid=pid), fatal_error_log_entry, mode='w')

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Sep 21, 2018

Contributor

I don’t think this will work on Py3. You’re encoding it to bytes, but then writing in Unicode mode. The two must be consistent

This comment has been minimized.

@stuhood

stuhood Sep 21, 2018

Member

safe_file_dump defaults to binary_mode=True.

:returns: The pailgun port number of the running pantsd instance.
:rtype: int
:returns: The pailgun pid and port numbers of the running pantsd instance.
:rtype: tuple

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Sep 21, 2018

Contributor

Possible to say a tuple of what?

Type hints annotation is Tuple[foo, bar, baz]. Not sure for comments.

This comment has been minimized.

@stuhood

stuhood Sep 21, 2018

Member

I'll add the datatype that Danny suggested.

@@ -100,18 +100,34 @@ def safe_mkdir_for_all(paths):
created_dirs.add(dir_to_make)


def safe_file_dump(filename, payload, binary_mode=True):
def safe_file_dump(filename, payload, binary_mode=True, mode='w'):

This comment has been minimized.

@Eric-Arellano

Eric-Arellano Sep 21, 2018

Contributor

Hm I see now why you were just using mode=w, because binary mode was true.

I think it would be better to change this functions signature to one of these two:

  • just mode
  • binary_mode and append, both booleans. Append defaults to false

This avoids the possibility of trying to pass mode=wb and having binary_mode=true

This comment has been minimized.

@stuhood

stuhood Sep 21, 2018

Member

I agree, but removing an existing parameter is out of scope for this PR I think. I can open a new ticket for that?

This comment has been minimized.

@stuhood

stuhood Sep 21, 2018

Member

Opened #6543 about it, and begin stitching these together in a more reasonable way here.

@cosmicexplorer
Copy link
Contributor

cosmicexplorer left a comment

I'm only iffy on the safe_file_dump change because "safe" is a good word to be sure about, and the deleted tests, the rest seems great, especially the added pantsd test.

@@ -81,8 +81,8 @@ def maybe_launch(cls, bootstrap_options=None):
"""Creates and launches a daemon instance if one does not already exist.
:param Options bootstrap_options: The bootstrap options, if available.
:returns: The pailgun port number of the running pantsd instance.
:rtype: int
:returns: The pailgun pid and port numbers of the running pantsd instance.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

Explicitly returning the same tuple in two methods => I would make a datatype to give a name to it, especially since it's transitively returned in tail calls in many methods in this file. It also means fewer docstrings to write.

@@ -402,11 +405,11 @@ def launch(self):
self._logger.debug('launching pantsd')
self.daemon_spawn()
# Wait up to 60 seconds for pantsd to write its pidfile.
self.await_pid(60)
pantsd_pid = self.await_pid(60)

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

A non-configurable 60 seconds sounds like a pretty long time, and probably something that was done with CI in mind, but if the daemon hangs starting up then the user gets to wait a whole minute before failure (if I'm reading this right). Is that something we could turn into a bootstrap option?

I wouldn't expect this to be related to the issue the logging changes are for however so I can break out a new ticket for it.

This comment has been minimized.

@stuhood

stuhood Sep 21, 2018

Member

await logs every 5 seconds while it is waiting, so it does not look like a silent hang.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

Ok, great, that sounds fine and not like a problem.

@@ -26,17 +26,6 @@ class AnonymousSink(ExceptionSink): pass
def test_unset_destination(self):
self.assertEqual(os.getcwd(), self._gen_sink_subclass().get_destination())

def test_set_invalid_destination(self):

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

Confused as to why these are deleted?

This comment has been minimized.

@stuhood

stuhood Sep 21, 2018

Member

There is less code, and thus less surface area now... I could add a test of safe_file_open, I suppose... but that is used widely enough that it is plenty covered already.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

In the end, I would like to make sure we are testing the directory before we try to write anything to it in a signal handler, but also, I don't think these tests need to be in this PR.

@@ -100,18 +100,34 @@ def safe_mkdir_for_all(paths):
created_dirs.add(dir_to_make)


def safe_file_dump(filename, payload, binary_mode=True):
def safe_file_dump(filename, payload, binary_mode=True, mode='w'):

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

Thinking along the same lines as this comment from my last PR that you've already read about whether append is atomic, I think calling this "safe" makes sense if we're always truncating the file with 'w' or 'wb', but I'm not sure it makes sense adding append into the picture. If there's a link that shows this is definitely atomic in the case of multiple processes I'd like it to be commented here, or if that can't be done, I don't know if I'd want to extend this method with a mode argument, and if the method of assuring atomicity for appends requires extra work, I'd probably want it in a separate method anyway.

This comment has been minimized.

@stuhood

stuhood Sep 21, 2018

Member

I did not remove the TODO that references atomicity: using safe_file_open was not intended to address atomicity.

The extent to which safe_file_open is "safe" is explained in the safe_open method. I can reference it from the docstring.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

Yep, ok, looked at safe_open and I think no change is needed here. I would like to not make it easy to append to a file in a non-atomic way, which isn't a concern shared with 'r' or 'w' modes -- but I think the better way is probably to lift that out into a separate method or something.

)

# Wait for the python run to be running.
time.sleep(15)

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

Could consider a retry strategy for this "chaos monkey" testing approach should this test ever become flaky.

delta_processes = after_processes.difference(before_processes)
context = TrackedProcessesContext(name, set(_safe_iter_matching_processes(name)))
yield context
delta_processes = context.current_processes

This comment has been minimized.

@stuhood

stuhood Sep 21, 2018

Member

Hah. Taking the value of a method is unlikely to pass CI.

stuhood added some commits Sep 21, 2018

"""A handle to a "probably running" pantsd instance.
We attempt to verify that the pantsd instance is still running when we create a Handle, but
after it has been created it is entirely process that the pantsd instance perishes.

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

Probably meant "...entirely possible that" here? I think this is a very useful note to make.

self.assertEqual(read_file(test_filename), test_content)
safe_file_dump(test_filename, test_content, **dump_kwargs)
self.assertEqual(read_file(test_filename, **read_kwargs), test_content)

This comment has been minimized.

@cosmicexplorer

cosmicexplorer Sep 21, 2018

Contributor

Nice!

@Eric-Arellano
Copy link
Contributor

Eric-Arellano left a comment

Refer to discussion in #6543

Thanks Stu!

@stuhood stuhood merged commit 5d850fc into pantsbuild:master Sep 22, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@stuhood stuhood deleted the twitter:stuhood/client-logs-exceptions-from-runner branch Sep 22, 2018

cosmicexplorer added a commit that referenced this pull request Oct 3, 2018

pantsd client logs exceptions from server processes (#6539)
### Problem

#6530 describes cases where when `pantsd` (or a `pantsd-runner` instance) is killed, there can be gaps in our exception logging (and rendering) that result in exceptions not reaching either the client or a log.

### Solution

Build atop #6533 to consume written logs and render them for a `pantsd` client. When either `pantsd` or a `pantsd-runner` instance die, if the `ExceptionSink` (or `Exiter`) has installed an exception hook that renders the exception to the appropriate `pid`, it will be logged, and rendered to the user.

### Result

Killed `pantsd-runner` or `pantsd` processes have increased visibility and debuggability.

cosmicexplorer added a commit that referenced this pull request Oct 9, 2018

first attempt at centralizing more global error logging state in Exce…
…ptionSink (#6552)

### Problem

See #6530. This is bullet point 2 of that issue.

### Solution

- rename `ExceptionSink#set_destination()` to `reset_log_location()`.
- (re)?acquire open file handles and persist them as `ExceptionSink` private fields so we can write to them in signal handlers. 
  - use `faulthandler.enable()` to log fatal signals (`SIGABRT`, `SIGSEGV`, ...) and stacktraces to the process-specific error stream from #6539.
- add `reset_exiter()` and delegate to the exiter to exit on an unhandled exception.
- add `reset_interactive_output_stream()` to change which stream the Exiter is provided as the `out` argument, and to enable stacktraces to be sent to that stream when the process receives SIGUSR2 (on par with current functionality)
- move everything besides exiting out of the Exiter class

### Result

We can un-xfail [the integration test](https://github.com/pantsbuild/pants/blob/f70e7c0b7d55a389d31b9b3b706d0901bc9f655e/tests/python/pants_test/pantsd/test_pantsd_integration.py#L416) introduced in #6539. This makes it significantly easier to diagnose pantsd communication errors and their causes, both locally and in CI.

cosmicexplorer added a commit that referenced this pull request Oct 10, 2018

pantsd client logs exceptions from server processes (#6539)
### Problem

#6530 describes cases where when `pantsd` (or a `pantsd-runner` instance) is killed, there can be gaps in our exception logging (and rendering) that result in exceptions not reaching either the client or a log.

### Solution

Build atop #6533 to consume written logs and render them for a `pantsd` client. When either `pantsd` or a `pantsd-runner` instance die, if the `ExceptionSink` (or `Exiter`) has installed an exception hook that renders the exception to the appropriate `pid`, it will be logged, and rendered to the user.

### Result

Killed `pantsd-runner` or `pantsd` processes have increased visibility and debuggability.

cosmicexplorer added a commit that referenced this pull request Oct 10, 2018

first attempt at centralizing more global error logging state in Exce…
…ptionSink (#6552)

See #6530. This is bullet point 2 of that issue.

- rename `ExceptionSink#set_destination()` to `reset_log_location()`.
- (re)?acquire open file handles and persist them as `ExceptionSink` private fields so we can write to them in signal handlers.
  - use `faulthandler.enable()` to log fatal signals (`SIGABRT`, `SIGSEGV`, ...) and stacktraces to the process-specific error stream from #6539.
- add `reset_exiter()` and delegate to the exiter to exit on an unhandled exception.
- add `reset_interactive_output_stream()` to change which stream the Exiter is provided as the `out` argument, and to enable stacktraces to be sent to that stream when the process receives SIGUSR2 (on par with current functionality)
- move everything besides exiting out of the Exiter class

We can un-xfail [the integration test](https://github.com/pantsbuild/pants/blob/f70e7c0b7d55a389d31b9b3b706d0901bc9f655e/tests/python/pants_test/pantsd/test_pantsd_integration.py#L416) introduced in #6539. This makes it significantly easier to diagnose pantsd communication errors and their causes, both locally and in CI.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment