New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
first attempt at centralizing more global error logging state in ExceptionSink #6552
Changes from 68 commits
0d07c66
3eeac82
8ad5689
c809c9d
353a977
3cd1b0a
4233faa
bc291d3
7fe1787
ce0a71f
fb68b02
08d6e94
b1a95c6
c2708e5
08ccce9
0d0fc2a
35bd7fb
3255137
769d254
ddce65b
525185e
12c13d9
577f3b6
8e8fb5b
68e1eb5
d479305
3db3117
11c21a9
f9d9abf
813967e
058801d
b197ae5
c5122cd
64068f7
0afc14a
f8e9b47
5767e2a
b2494d0
39b8220
8bcb877
e460584
921cda8
3f1dd8a
ee83b31
b12c33f
e06c48b
74f3031
c71b671
d1f1b3a
1dcb660
bd3ac85
bd07573
a909cbf
c89d27d
9bef93a
859136f
df6e546
c665342
adee768
a8ed3d9
f70e7c0
9111c5f
d3934fa
3252202
7abcf10
865efe3
700b2ea
17f6787
0dcf448
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,12 +5,18 @@ | |
from __future__ import absolute_import, division, print_function, unicode_literals | ||
|
||
import datetime | ||
import faulthandler | ||
import logging | ||
import os | ||
import signal | ||
import sys | ||
from builtins import object | ||
import traceback | ||
from builtins import object, str | ||
|
||
from pants.util.dirutil import safe_file_dump | ||
from pants.base.exiter import Exiter | ||
from pants.util.dirutil import safe_mkdir, safe_open | ||
from pants.util.meta import classproperty | ||
from pants.util.osutil import IntegerForPid | ||
|
||
|
||
logger = logging.getLogger(__name__) | ||
|
@@ -19,28 +25,211 @@ | |
class ExceptionSink(object): | ||
"""A mutable singleton object representing where exceptions should be logged to.""" | ||
|
||
_destination = os.getcwd() | ||
# NB: see the bottom of this file where we call reset_log_location() and other mutators in order | ||
# to properly setup global state. | ||
_log_dir = None | ||
# We need an exiter in order to know what to do after we log a fatal exception or handle a | ||
# catchable signal. | ||
_exiter = None | ||
# Where to log stacktraces to in a SIGUSR2 handler. | ||
_interactive_output_stream = None | ||
|
||
# These persistent open file descriptors are kept so the signal handler can do almost no work | ||
# (and lets faulthandler figure out signal safety). | ||
_pid_specific_error_fileobj = None | ||
_shared_error_fileobj = None | ||
|
||
# Integer code to exit with on an unhandled exception. | ||
UNHANDLED_EXCEPTION_EXIT_CODE = 1 | ||
|
||
# faulthandler.enable() installs handlers for SIGSEGV, SIGFPE, SIGABRT, SIGBUS and SIGILL, but we | ||
# also want to dump a traceback when receiving these other usually-fatal signals. | ||
GRACEFULLY_HANDLED_SIGNALS = [ | ||
signal.SIGTERM, | ||
] | ||
|
||
# These signals are converted into a KeyboardInterrupt exception. | ||
KEYBOARD_INTERRUPT_SIGNALS = [ | ||
signal.SIGINT, | ||
signal.SIGQUIT, | ||
] | ||
|
||
@classproperty | ||
def all_gracefully_handled_signals(cls): | ||
"""Return all the signals which are handled when calling reset_log_location(). | ||
|
||
These signals are handled by the method handle_signal_gracefully(). | ||
|
||
NB: A very basic SIGUSR2 handler is installed by reset_interactive_output_stream(). | ||
""" | ||
return cls.GRACEFULLY_HANDLED_SIGNALS + cls.KEYBOARD_INTERRUPT_SIGNALS | ||
|
||
def __new__(cls, *args, **kwargs): | ||
raise TypeError('Instances of {} are not allowed to be constructed!' | ||
.format(cls.__name__)) | ||
|
||
class ExceptionSinkError(Exception): pass | ||
|
||
# All reset_* methods are ~idempotent! | ||
@classmethod | ||
def reset_log_location(cls, new_log_location): | ||
"""Re-acquire file handles to error logs based in the new location. | ||
|
||
Class state: | ||
- Overwrites `cls._log_dir`, `cls._pid_specific_error_fileobj`, and | ||
`cls._shared_error_fileobj`. | ||
OS state: | ||
- May create a new directory. | ||
- Overwrites signal handlers for many fatal and non-fatal signals. | ||
|
||
:raises: :class:`ExceptionSink.ExceptionSinkError` if the directory does not exist or is not | ||
writable. | ||
""" | ||
# We could no-op here if the log locations are the same, but there's no reason not to have the | ||
# additional safety of re-acquiring file descriptors each time (and erroring out early if the | ||
# location is no longer writable). | ||
|
||
# Create the directory if possible, or raise if not writable. | ||
cls._check_or_create_new_destination(new_log_location) | ||
|
||
pid_specific_error_stream, shared_error_stream = cls._recapture_fatal_error_log_streams( | ||
new_log_location) | ||
|
||
# NB: mutate process-global state! | ||
if faulthandler.is_enabled(): | ||
logger.debug('re-enabling faulthandler') | ||
# Call Py_CLEAR() on the previous error stream: | ||
# https://github.com/vstinner/faulthandler/blob/master/faulthandler.c | ||
faulthandler.disable() | ||
# Send a stacktrace to this file if interrupted by a fatal error. | ||
faulthandler.enable(file=pid_specific_error_stream, all_threads=True) | ||
# Log a timestamped exception and exit gracefully on non-fatal signals. | ||
for signum in cls.all_gracefully_handled_signals: | ||
signal.signal(signum, cls.handle_signal_gracefully) | ||
signal.siginterrupt(signum, False) | ||
|
||
# NB: mutate the class variables! | ||
cls._log_dir = new_log_location | ||
cls._pid_specific_error_fileobj = pid_specific_error_stream | ||
cls._shared_error_fileobj = shared_error_stream | ||
|
||
@classmethod | ||
def set_destination(cls, dir_path): | ||
cls._destination = dir_path | ||
def reset_exiter(cls, exiter): | ||
""" | ||
Class state: | ||
- Overwrites `cls._exiter`. | ||
Python state: | ||
- Overwrites sys.excepthook. | ||
""" | ||
assert(isinstance(exiter, Exiter)) | ||
# NB: mutate the class variables! This is done before mutating the exception hook, because the | ||
# uncaught exception handler uses cls._exiter to exit. | ||
cls._exiter = exiter | ||
# NB: mutate process-global state! | ||
sys.excepthook = cls._log_unhandled_exception_and_exit | ||
|
||
@classmethod | ||
def get_destination(cls): | ||
return cls._destination | ||
def reset_interactive_output_stream(cls, interactive_output_stream): | ||
""" | ||
Class state: | ||
- Overwrites `cls._interactive_output_stream`. | ||
OS state: | ||
- Overwrites the SIGUSR2 handler. | ||
|
||
This is where the the error message on exit will be printed to as well. | ||
""" | ||
# NB: mutate process-global state! | ||
if faulthandler.unregister(signal.SIGUSR2): | ||
logger.debug('re-registering a SIGUSR2 handler') | ||
# This permits a non-fatal `kill -31 <pants pid>` for stacktrace retrieval. | ||
faulthandler.register(signal.SIGUSR2, interactive_output_stream, | ||
all_threads=True, chain=False) | ||
|
||
# NB: mutate the class variables! | ||
# We don't *necessarily* need to keep a reference to this, but we do here for clarity. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This field is used in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds like a fantastic idea. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Broke out #6596! |
||
cls._interactive_output_stream = interactive_output_stream | ||
|
||
@classmethod | ||
def exceptions_log_path(cls, for_pid=None): | ||
intermediate_filename_component = '.{}'.format(for_pid) if for_pid else '' | ||
def exceptions_log_path(cls, for_pid=None, in_dir=None): | ||
"""Get the path to either the shared or pid-specific fatal errors log file.""" | ||
if for_pid is None: | ||
intermediate_filename_component = '' | ||
else: | ||
assert(isinstance(for_pid, IntegerForPid)) | ||
intermediate_filename_component = '.{}'.format(for_pid) | ||
in_dir = in_dir or cls._log_dir | ||
return os.path.join( | ||
cls.get_destination(), | ||
in_dir, | ||
'logs', | ||
'exceptions{}.log'.format(intermediate_filename_component)) | ||
|
||
@classmethod | ||
def log_exception(cls, msg): | ||
"""Try to log an error message to this process's error log and the shared error log. | ||
|
||
NB: Doesn't raise (logs an error instead). | ||
""" | ||
pid = os.getpid() | ||
fatal_error_log_entry = cls._format_exception_message(msg, pid) | ||
|
||
# We care more about this log than the shared log, so write to it first. | ||
try: | ||
cls._try_write_with_flush(cls._pid_specific_error_fileobj, fatal_error_log_entry) | ||
except Exception as e: | ||
logger.error( | ||
"Error logging the message '{}' to the pid-specific file handle for {} at pid {}:\n{}" | ||
.format(msg, cls._log_dir, pid, e)) | ||
|
||
# Write to the shared log. | ||
try: | ||
# TODO: we should probably guard this against concurrent modification by other pants | ||
# subprocesses somehow. | ||
cls._try_write_with_flush(cls._shared_error_fileobj, fatal_error_log_entry) | ||
except Exception as e: | ||
logger.error( | ||
"Error logging the message '{}' to the shared file handle for {} at pid {}:\n{}" | ||
.format(msg, cls._log_dir, pid, e)) | ||
|
||
@classmethod | ||
def _try_write_with_flush(cls, fileobj, payload): | ||
"""This method is here so that it can be patched to simulate write errors. | ||
|
||
This is because mock can't patch primitive objects like file objects. | ||
""" | ||
fileobj.write(payload) | ||
fileobj.flush() | ||
|
||
@classmethod | ||
def _check_or_create_new_destination(cls, destination): | ||
try: | ||
safe_mkdir(destination) | ||
except Exception as e: | ||
raise cls.ExceptionSinkError( | ||
"The provided exception sink path at '{}' is not writable or could not be created: {}." | ||
.format(destination, str(e)), | ||
e) | ||
|
||
@classmethod | ||
def _recapture_fatal_error_log_streams(cls, new_log_location): | ||
# NB: We do not close old file descriptors under the assumption their lifetimes are managed | ||
# elsewhere. | ||
# We recapture both log streams each time. | ||
pid = os.getpid() | ||
pid_specific_log_path = cls.exceptions_log_path(for_pid=pid, in_dir=new_log_location) | ||
shared_log_path = cls.exceptions_log_path(in_dir=new_log_location) | ||
assert(pid_specific_log_path != shared_log_path) | ||
try: | ||
# Truncate the pid-specific error log file. | ||
pid_specific_error_stream = safe_open(pid_specific_log_path, mode='w') | ||
# Append to the shared error file. | ||
shared_error_stream = safe_open(shared_log_path, mode='a') | ||
except Exception as e: | ||
raise cls.ExceptionSinkError( | ||
"Error opening fatal error log streams for log location '{}': {}" | ||
.format(new_log_location, str(e))) | ||
|
||
return (pid_specific_error_stream, shared_error_stream) | ||
|
||
@classmethod | ||
def _iso_timestamp_for_now(cls): | ||
return datetime.datetime.now().isoformat() | ||
|
@@ -59,20 +248,110 @@ def _format_exception_message(cls, msg, pid): | |
timestamp=cls._iso_timestamp_for_now(), | ||
args=sys.argv, | ||
pid=pid, | ||
message=msg, | ||
) | ||
message=msg) | ||
|
||
_traceback_omitted_default_text = '(backtrace omitted)' | ||
|
||
@classmethod | ||
def log_exception(cls, msg): | ||
def _format_traceback(cls, tb, should_print_backtrace): | ||
if should_print_backtrace: | ||
traceback_string = ''.join(traceback.format_tb(tb)) | ||
else: | ||
traceback_string = cls._traceback_omitted_default_text | ||
return traceback_string | ||
|
||
_UNHANDLED_EXCEPTION_LOG_FORMAT = """\ | ||
Exception caught: ({exception_type}) | ||
{backtrace} | ||
Exception message: {exception_message}{maybe_newline} | ||
""" | ||
|
||
@classmethod | ||
def _format_unhandled_exception_log(cls, exc, tb, add_newline, should_print_backtrace): | ||
exception_message = str(exc) if exc else '(no message)' | ||
maybe_newline = '\n' if add_newline else '' | ||
return cls._UNHANDLED_EXCEPTION_LOG_FORMAT.format( | ||
exception_type=type(exc), | ||
backtrace=cls._format_traceback(tb, should_print_backtrace=should_print_backtrace), | ||
exception_message=exception_message, | ||
maybe_newline=maybe_newline) | ||
|
||
@classmethod | ||
def _exit_with_failure(cls, terminal_msg): | ||
# Exit with failure, printing a message to the terminal (or whatever the interactive stream is). | ||
cls._exiter.exit(result=cls.UNHANDLED_EXCEPTION_EXIT_CODE, | ||
msg=terminal_msg, | ||
out=cls._interactive_output_stream) | ||
|
||
@classmethod | ||
def _log_unhandled_exception_and_exit(cls, exc_class=None, exc=None, tb=None, add_newline=False): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method is called from outside of the class, and should maybe be public. It's also worth considering whether this should (mostly) stay in the Exiter, and just call back to the ExceptionSink for the purposes of writing to the exception logs...? I'm not very clear on that though, given the fact that they both call one another at various times. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I like that idea a lot. I think it specifically should be done, along with the comment on the Exiter holding onto its output stream, in a followup. I will continue addressing your comments and seeing what can be done here without lots more code changes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Broke out #6596! |
||
"""A sys.excepthook implementation which logs the error and exits with failure.""" | ||
exc_class = exc_class or sys.exc_info()[0] | ||
exc = exc or sys.exc_info()[1] | ||
tb = tb or sys.exc_info()[2] | ||
|
||
extra_err_msg = None | ||
try: | ||
pid = os.getpid() | ||
fatal_error_log_entry = cls._format_exception_message(msg, pid) | ||
# We care more about this log than the shared log, so completely write to it first. This | ||
# avoids any errors with concurrent modification of the shared log affecting the per-pid log. | ||
safe_file_dump(cls.exceptions_log_path(for_pid=pid), fatal_error_log_entry, mode='w') | ||
# TODO: we should probably guard this against concurrent modification somehow. | ||
safe_file_dump(cls.exceptions_log_path(), fatal_error_log_entry, mode='a') | ||
# Always output the unhandled exception details into a log file, including the traceback. | ||
exception_log_entry = cls._format_unhandled_exception_log(exc, tb, add_newline, | ||
should_print_backtrace=True) | ||
cls.log_exception(exception_log_entry) | ||
except Exception as e: | ||
# TODO: If there is an error in writing to the exceptions log, we may want to consider trying | ||
# to write to another location (e.g. the cwd, if that is not already the destination). | ||
logger.error('Problem logging original exception: {}'.format(e)) | ||
extra_err_msg = 'Additional error logging unhandled exception {}: {}'.format(exc, e) | ||
logger.error(extra_err_msg) | ||
|
||
# Generate an unhandled exception report fit to be printed to the terminal (respecting the | ||
# Exiter's should_print_backtrace field). | ||
stderr_printed_error = cls._format_unhandled_exception_log( | ||
exc, tb, add_newline, | ||
should_print_backtrace=cls._exiter.should_print_backtrace) | ||
if extra_err_msg: | ||
stderr_printed_error = '{}\n{}'.format(stderr_printed_error, extra_err_msg) | ||
cls._exit_with_failure(stderr_printed_error) | ||
|
||
_CATCHABLE_SIGNAL_ERROR_LOG_FORMAT = """\ | ||
Signal {signum} was raised. Exiting with failure. | ||
{formatted_traceback} | ||
""" | ||
|
||
@classmethod | ||
def handle_signal_gracefully(cls, signum, frame): | ||
"""Signal handler for non-fatal signals which raises or logs an error and exits with failure. | ||
|
||
NB: This method is registered for exactly the signals in | ||
ExceptionSink.all_gracefully_handled_signals! | ||
|
||
:raises: :class:`KeyboardInterrupt` if `signum` is in ExceptionSink.KEYBOARD_INTERRUPT_SIGNALS. | ||
""" | ||
if signum in cls.KEYBOARD_INTERRUPT_SIGNALS: | ||
raise KeyboardInterrupt('User interrupted execution with control-c!') | ||
tb = frame.f_exc_traceback or sys.exc_info()[2] | ||
|
||
# Format an entry to be written to the exception log. | ||
formatted_traceback = cls._format_traceback(tb, should_print_backtrace=bool(tb)) | ||
signal_error_log_entry = cls._CATCHABLE_SIGNAL_ERROR_LOG_FORMAT.format( | ||
signum=signum, | ||
formatted_traceback=formatted_traceback) | ||
# TODO: determine the appropriate signal-safe behavior here (to avoid writing to our file | ||
# descriptors re-entrantly, which raises an IOError). | ||
# This method catches any exceptions raised within it. | ||
cls.log_exception(signal_error_log_entry) | ||
|
||
# Format a message to be printed to the terminal or other interactive stream, if applicable. | ||
formatted_traceback_for_terminal = cls._format_traceback( | ||
tb, should_print_backtrace=cls._exiter.should_print_backtrace and bool(tb)) | ||
terminal_log_entry = cls._CATCHABLE_SIGNAL_ERROR_LOG_FORMAT.format( | ||
signum=signum, | ||
formatted_traceback=formatted_traceback_for_terminal) | ||
cls._exit_with_failure(terminal_log_entry) | ||
|
||
|
||
|
||
# Setup global state such as signal handlers and sys.excepthook with probably-safe values at module | ||
# import time. | ||
# Sets fatal signal handlers with reasonable defaults to catch errors early in startup. | ||
ExceptionSink.reset_log_location(os.getcwd()) | ||
# Sets except hook. | ||
ExceptionSink.reset_exiter(Exiter(print_backtraces=True)) | ||
# Sets a SIGUSR2 handler. | ||
ExceptionSink.reset_interactive_output_stream(sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should likely be written to both the current
trace_stream
and the current exception log(s).Doing the latter thing would require keeping an open file handle for the current exception log(s), which is a bit harder... but I think that getting the test I added to pass will require this, because killing the process with a signal will only result in this firing, rather than also logging an exception.
Additionally, (re)-opening the exception log for our current
pid
(on a best effort basis...) would likely require an explicit step, likeset_pid
: we'd call it as soon as we know we've forked.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn't consider this such a huge change when we talked was I think because the biggest change I see is just "having an actual signal handler for SIGUSR2 instead of using
faulthandler
", and the code I threw away was already doing that in order to handle SIGINT and SIGQUIT. I'm not sure now whether the handling of those signals should be done in this PR. We currently set handlers for those signals in RemotePantsRunner to send a signal to the pantsd-runner process on control-c.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the biggest change is figuring out where to write in a signal handler as described in your comment, but after that's figured out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you haven't already, take a look at the restrictions on signal handlers: https://docs.python.org/3/library/faulthandler.html ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's extremely helpful! As an aside, I have used the glibc docs on signal handlers before and found the bit on re-entrancy and signal handlers to be super illuminating to understand what's allowed in a signal handler and why and have found it applicable to other languages. I will peruse the python docs you linked because I was looking for that exactly and literally have a tab open to that page but just was only looking for method names and didn't read it. Thanks!