Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-31384: Add extra information to log record #134

Merged
merged 3 commits into from Aug 12, 2021
Merged

Conversation

timj
Copy link
Member

@timj timj commented Aug 12, 2021

Requires lsst/daf_butler#557

Adds the output RUN to every record and also now supports the --log-label command.

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Copy link
Collaborator

@andy-slac andy-slac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, one minor comment.

@@ -236,7 +236,7 @@ def captureLogging(self, taskDef, quantum, butler):

ctx = _LogCaptureFlag()
try:
with ButlerMDC.set_mdc({"LABEL": label}):
with ButlerMDC.set_mdc({"LABEL": label, "RUN": butler.run}):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this does not cause surprises for people who may be parsing MDC from our current logs 🙂

Maybe worth adding news fragment, this feels like user-visible new feature?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we really want that info in every record? I guess this will be repeating for all records in the same log file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was about to write the news fragment. It will only appear in the JSON record as a separate entry and is automatically handled by ButlerLogRecord. I don't think anyone is parsing our JSON log records at the moment and if they are I don't see a problem. The motivation for adding this to every record is that in LogStash or wherever it might make it easier for people to find their own log records in the sea of log records arriving from all the workflows being executed. The output run is the one piece of information provided by the person submitting the workflow and so is the one thing that they know should be there and is easy to find. The job_id is something allocated by the workflow system itself and I would expect to turn up via the --log-label option but it's surely easier for scientists to know their RUN than to know the workflow submission ID. This RUN information is not turning up in long-log output. Do you think we need to add a command line option to pipetask to control whether RUN is included? Clearly bps could add --log-label run={outputRun},job_id={jobId} to its configuration but I was assuming we'd always like to have RUN in the searchable logs and that's something we already knew.

Maybe @ktlim has an opinion on this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with RUN and other info being there if it's useful for people. For log harvesting I can imagine we will want a bit more metadata associated with records, but that does not have to come from the records themselves, we'll probably have some other way to specify that extra per-log file or per-dataset metadata.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any given log is associated with a run automatically (in the butler or in some artefact relating to that pipetask run) but I'm worrying about the log records that get injected directly into logstash. Maybe we end up with doing that injection outside of pipetask in the fluentd client or something but at the moment it's a bit undefined. I can always remove RUN later -- shouldn't hurt anything to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that most log analysis systems will flatten out any per-file metadata into per-record columns anyway. (Most of these things compress out very well, particularly if columnar.)

The output run may not be provided by the submitter: it could be auto-generated, no? Still, I agree that in the absence of an easily-derived-by-user quantum identifier, adding RUN automatically will be beneficial for filtering logs.

@timj timj merged commit ea1ed07 into master Aug 12, 2021
@timj timj deleted the tickets/DM-31384 branch August 12, 2021 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants