New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-31384: Add extra information to log record #134
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, one minor comment.
@@ -236,7 +236,7 @@ def captureLogging(self, taskDef, quantum, butler): | |||
|
|||
ctx = _LogCaptureFlag() | |||
try: | |||
with ButlerMDC.set_mdc({"LABEL": label}): | |||
with ButlerMDC.set_mdc({"LABEL": label, "RUN": butler.run}): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope this does not cause surprises for people who may be parsing MDC from our current logs 🙂
Maybe worth adding news fragment, this feels like user-visible new feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we really want that info in every record? I guess this will be repeating for all records in the same log file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was about to write the news fragment. It will only appear in the JSON record as a separate entry and is automatically handled by ButlerLogRecord. I don't think anyone is parsing our JSON log records at the moment and if they are I don't see a problem. The motivation for adding this to every record is that in LogStash or wherever it might make it easier for people to find their own log records in the sea of log records arriving from all the workflows being executed. The output run is the one piece of information provided by the person submitting the workflow and so is the one thing that they know should be there and is easy to find. The job_id is something allocated by the workflow system itself and I would expect to turn up via the --log-label
option but it's surely easier for scientists to know their RUN than to know the workflow submission ID. This RUN information is not turning up in long-log output. Do you think we need to add a command line option to pipetask to control whether RUN is included? Clearly bps
could add --log-label run={outputRun},job_id={jobId}
to its configuration but I was assuming we'd always like to have RUN in the searchable logs and that's something we already knew.
Maybe @ktlim has an opinion on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with RUN and other info being there if it's useful for people. For log harvesting I can imagine we will want a bit more metadata associated with records, but that does not have to come from the records themselves, we'll probably have some other way to specify that extra per-log file or per-dataset metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any given log is associated with a run automatically (in the butler or in some artefact relating to that pipetask run) but I'm worrying about the log records that get injected directly into logstash. Maybe we end up with doing that injection outside of pipetask in the fluentd client or something but at the moment it's a bit undefined. I can always remove RUN later -- shouldn't hurt anything to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that most log analysis systems will flatten out any per-file metadata into per-record columns anyway. (Most of these things compress out very well, particularly if columnar.)
The output run may not be provided by the submitter: it could be auto-generated, no? Still, I agree that in the absence of an easily-derived-by-user quantum identifier, adding RUN automatically will be beneficial for filtering logs.
Requires lsst/daf_butler#557
Adds the output RUN to every record and also now supports the
--log-label
command.Checklist
doc/changes