DM-29071: report success/failure via callbacks in RawIngestTask #360

TallJimbo · 2021-03-05T03:06:36Z

No description provided.

srp3rd · 2021-03-05T04:25:45Z

python/lsst/obs/base/ingest.py

@@ -554,6 +587,7 @@ def run(self, files, *, pool: Optional[Pool] = None, processes: int = 1, run: Op
            try:
                self.butler.registry.syncDimensionData("exposure", exposure.record)
            except Exception as e:
+                self._on_ingest_failure(exposure, e)


Will exposure here allow me to get the path to the file that failed to ingest? My use case is that on ingest failure, I move the file to a special directory so it can be looked at later.

As you can see by the warning issued below, this is a per-exposure failure and not a per file failure. I report the obsid and instrument name so the set of 189 files can be identified, but it's no longer a per-file error.

I took a look at the whole file. I can use exposure.files to get at that information though, right?

Ah yes, it's a RawExposureData. Serves me right for not reading higher. Per-file metadata failures are explicitly a file failure higher up.

I'm going to have to change all this to be ButlerURI based imminently.

If you've already got a branch with ButlerURI conversions, I'm happy to rebase on that and make the adjustments. If not, I assume you'd just like me to turn this around quickly?

And yes, you can get the files from the object the callback is given here, but it'll be all of the files you gave it for each exposure at once, because we commit them all in one transaction.

I haven't got that branch. I have got #357 which is quite big but it's possible that that is orthogonal to your changes. I won't be able to merge #357 for a while because the review is still pending so you can go ahead. I'll take a look in the next half hour once I get immutable ButlerURI finished up.

timj

Looks good.

timj · 2021-03-05T17:39:43Z

python/lsst/obs/base/ingest.py

@@ -249,6 +280,7 @@ def extractMetadata(self, filename: str) -> RawFileData:
            datasets = []
            FormatterClass = Formatter
            instrument = None
+            self._on_metadata_failure(filename, e)


We probably should mention that this triggers before failFast logic triggers in the description above. Maybe this is telling us that we should remove the failFast logic completely some time in the future. Something for @mxk62 to ponder.

I do like failFast option as it allows me to get really nice and informative messages, practically effortlessly. Here's an example from the recent Gen3 ingestion of NTS Comcam data:

status | error | count ---------+----------------------------------------------------------------------------------+------- SUCCESS | | 8769 FAILURE | KeyError: "Could not find ['RAFTBAY'] in header" | 135 FAILURE | ValueError: None of the registered translation classes ['DECam', 'SuprimeCam', ' | 27

(Note that I "cut" the error message column to fit the limited space here; the table keeps error messages in their entirety.)

However, if the new error handling mechanism will allow me to access these pieces of information without much hassle, I won't have any fundamental reasons to be against removing failFast option.

I think the big question is whether you ever want to ingest more than one file at a time; if so, I think the new approach would let you do that (which might be more efficient than single-file ingests) while getting exactly the same per-file error information back. It will also let you get that information with single-file ingests, but failFast does that already.

The automated ingesters were written to ingest a single file at a time merely because that was the only way to get a useful error message back from the code without parsing logs after the fact. The ingester could easily be changed to call ingest with up to some max X files per RawIngestTask call if those calls return enough information to determine success or why it failed for each file. Note, at this time there is no science knowledge built into the automatic ingesters such that it would know to group by exposure or wait until have all of the files for the exposure. Current grouping would just be a limit on files in some arbitrary order.

timj · 2021-03-05T17:50:04Z

python/lsst/obs/base/ingest.py

@@ -570,8 +606,12 @@ def run(self, files, *, pool: Optional[Pool] = None, processes: int = 1, run: Op
                runs.add(this_run)
            try:
                with self.butler.transaction():
-                    refs.extend(self.ingestExposureDatasets(exposure, run=this_run))
+                    datasets_for_exposure = self.ingestExposureDatasets(exposure, run=this_run)
+                self._on_success(datasets_for_exposure)


I think we need a try block around this. Otherwise if it fails it will immediately trigger an ingest failure even though it ingested just fine.

Should I just put it in an else clause, so the callback could raise, but we wouldn't incorrectly classify it as an ingest failure? (And maybe document that if it does raise, it'll bring down the whole ingest run.)

Putting it in the else seems like the right thing to do (and documenting that it will stop ingest).

srp3rd reviewed Mar 5, 2021

View reviewed changes

timj approved these changes Mar 5, 2021

View reviewed changes

TallJimbo force-pushed the tickets/DM-29071 branch from 6277feb to 07b3e45 Compare March 5, 2021 19:43

TallJimbo added 3 commits March 5, 2021 23:14

Move unpacking of FileDatasets for better diagnostics.

1354113

Fix docs and return type for RawIngestTask.prep.

824f25a

Use callbacks to report success/failure in ingest.

cca9027

TallJimbo force-pushed the tickets/DM-29071 branch from 07b3e45 to cca9027 Compare March 6, 2021 04:15

TallJimbo merged commit af5bd1b into master Mar 6, 2021

TallJimbo deleted the tickets/DM-29071 branch March 6, 2021 04:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-29071: report success/failure via callbacks in RawIngestTask #360

DM-29071: report success/failure via callbacks in RawIngestTask #360

TallJimbo commented Mar 5, 2021

srp3rd Mar 5, 2021

timj Mar 5, 2021

srp3rd Mar 5, 2021

timj Mar 5, 2021

timj Mar 5, 2021

TallJimbo Mar 5, 2021

timj Mar 5, 2021

timj left a comment

timj Mar 5, 2021

mxk62 Mar 5, 2021

TallJimbo Mar 5, 2021

MichelleGower Mar 5, 2021 •

edited

timj Mar 5, 2021

TallJimbo Mar 5, 2021

timj Mar 5, 2021

DM-29071: report success/failure via callbacks in RawIngestTask #360

DM-29071: report success/failure via callbacks in RawIngestTask #360

Conversation

TallJimbo commented Mar 5, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichelleGower Mar 5, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichelleGower Mar 5, 2021 •

edited