Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-31384: Add extra information to log record #134

Merged
merged 3 commits into from
Aug 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/changes/DM-31884.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
* Add the output run to the log record.
* Add ``--log-label`` option to ``pipetask`` command to allow extra information to be injected into the log record.
11 changes: 9 additions & 2 deletions python/lsst/ctrl/mpexec/cli/pipetask.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,13 @@
import click

from lsst.daf.butler.cli.butler import LoaderCLI
from lsst.daf.butler.cli.opt import log_level_option, long_log_option, log_file_option, log_tty_option
from lsst.daf.butler.cli.opt import (
log_level_option,
long_log_option,
log_file_option,
log_tty_option,
log_label_option,
)


class PipetaskCLI(LoaderCLI):
Expand All @@ -35,7 +41,8 @@ class PipetaskCLI(LoaderCLI):
@long_log_option()
@log_file_option()
@log_tty_option()
def cli(log_level, long_log, log_file, log_tty):
@log_label_option()
def cli(log_level, long_log, log_file, log_tty, log_label):
# log_level is handled by get_command and list_commands, and is called in
# one of those functions before this is called.
pass
Expand Down
2 changes: 1 addition & 1 deletion python/lsst/ctrl/mpexec/singleQuantumExecutor.py
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ def captureLogging(self, taskDef, quantum, butler):

ctx = _LogCaptureFlag()
try:
with ButlerMDC.set_mdc({"LABEL": label}):
with ButlerMDC.set_mdc({"LABEL": label, "RUN": butler.run}):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this does not cause surprises for people who may be parsing MDC from our current logs 🙂

Maybe worth adding news fragment, this feels like user-visible new feature?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we really want that info in every record? I guess this will be repeating for all records in the same log file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was about to write the news fragment. It will only appear in the JSON record as a separate entry and is automatically handled by ButlerLogRecord. I don't think anyone is parsing our JSON log records at the moment and if they are I don't see a problem. The motivation for adding this to every record is that in LogStash or wherever it might make it easier for people to find their own log records in the sea of log records arriving from all the workflows being executed. The output run is the one piece of information provided by the person submitting the workflow and so is the one thing that they know should be there and is easy to find. The job_id is something allocated by the workflow system itself and I would expect to turn up via the --log-label option but it's surely easier for scientists to know their RUN than to know the workflow submission ID. This RUN information is not turning up in long-log output. Do you think we need to add a command line option to pipetask to control whether RUN is included? Clearly bps could add --log-label run={outputRun},job_id={jobId} to its configuration but I was assuming we'd always like to have RUN in the searchable logs and that's something we already knew.

Maybe @ktlim has an opinion on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with RUN and other info being there if it's useful for people. For log harvesting I can imagine we will want a bit more metadata associated with records, but that does not have to come from the records themselves, we'll probably have some other way to specify that extra per-log file or per-dataset metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any given log is associated with a run automatically (in the butler or in some artefact relating to that pipetask run) but I'm worrying about the log records that get injected directly into logstash. Maybe we end up with doing that injection outside of pipetask in the fluentd client or something but at the moment it's a bit undefined. I can always remove RUN later -- shouldn't hurt anything to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that most log analysis systems will flatten out any per-file metadata into per-record columns anyway. (Most of these things compress out very well, particularly if columnar.)

The output run may not be provided by the submitter: it could be auto-generated, no? Still, I agree that in the absence of an easily-derived-by-user quantum identifier, adding RUN automatically will be beneficial for filtering logs.

yield ctx
finally:
# Ensure that the logs are stored in butler.
Expand Down