Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-29071: report success/failure via callbacks in RawIngestTask #360

Merged
merged 3 commits into from Mar 6, 2021

Conversation

TallJimbo
Copy link
Member

No description provided.

@@ -554,6 +587,7 @@ def run(self, files, *, pool: Optional[Pool] = None, processes: int = 1, run: Op
try:
self.butler.registry.syncDimensionData("exposure", exposure.record)
except Exception as e:
self._on_ingest_failure(exposure, e)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will exposure here allow me to get the path to the file that failed to ingest? My use case is that on ingest failure, I move the file to a special directory so it can be looked at later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see by the warning issued below, this is a per-exposure failure and not a per file failure. I report the obsid and instrument name so the set of 189 files can be identified, but it's no longer a per-file error.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the whole file. I can use exposure.files to get at that information though, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, it's a RawExposureData. Serves me right for not reading higher. Per-file metadata failures are explicitly a file failure higher up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to have to change all this to be ButlerURI based imminently.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've already got a branch with ButlerURI conversions, I'm happy to rebase on that and make the adjustments. If not, I assume you'd just like me to turn this around quickly?

And yes, you can get the files from the object the callback is given here, but it'll be all of the files you gave it for each exposure at once, because we commit them all in one transaction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't got that branch. I have got #357 which is quite big but it's possible that that is orthogonal to your changes. I won't be able to merge #357 for a while because the review is still pending so you can go ahead. I'll take a look in the next half hour once I get immutable ButlerURI finished up.

Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@@ -249,6 +280,7 @@ def extractMetadata(self, filename: str) -> RawFileData:
datasets = []
FormatterClass = Formatter
instrument = None
self._on_metadata_failure(filename, e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should mention that this triggers before failFast logic triggers in the description above. Maybe this is telling us that we should remove the failFast logic completely some time in the future. Something for @mxk62 to ponder.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do like failFast option as it allows me to get really nice and informative messages, practically effortlessly. Here's an example from the recent Gen3 ingestion of NTS Comcam data:

status  |                                      error                                       | count 
---------+----------------------------------------------------------------------------------+-------
SUCCESS |                                                                                  |  8769
FAILURE | KeyError: "Could not find ['RAFTBAY'] in header"                                 |   135
FAILURE | ValueError: None of the registered translation classes ['DECam', 'SuprimeCam', ' |    27

(Note that I "cut" the error message column to fit the limited space here; the table keeps error messages in their entirety.)

However, if the new error handling mechanism will allow me to access these pieces of information without much hassle, I won't have any fundamental reasons to be against removing failFast option.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the big question is whether you ever want to ingest more than one file at a time; if so, I think the new approach would let you do that (which might be more efficient than single-file ingests) while getting exactly the same per-file error information back. It will also let you get that information with single-file ingests, but failFast does that already.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The automated ingesters were written to ingest a single file at a time merely because that was the only way to get a useful error message back from the code without parsing logs after the fact. The ingester could easily be changed to call ingest with up to some max X files per RawIngestTask call if those calls return enough information to determine success or why it failed for each file. Note, at this time there is no science knowledge built into the automatic ingesters such that it would know to group by exposure or wait until have all of the files for the exposure. Current grouping would just be a limit on files in some arbitrary order.

@@ -570,8 +606,12 @@ def run(self, files, *, pool: Optional[Pool] = None, processes: int = 1, run: Op
runs.add(this_run)
try:
with self.butler.transaction():
refs.extend(self.ingestExposureDatasets(exposure, run=this_run))
datasets_for_exposure = self.ingestExposureDatasets(exposure, run=this_run)
self._on_success(datasets_for_exposure)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a try block around this. Otherwise if it fails it will immediately trigger an ingest failure even though it ingested just fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I just put it in an else clause, so the callback could raise, but we wouldn't incorrectly classify it as an ingest failure? (And maybe document that if it does raise, it'll bring down the whole ingest run.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it in the else seems like the right thing to do (and documenting that it will stop ingest).

@TallJimbo TallJimbo merged commit af5bd1b into master Mar 6, 2021
@TallJimbo TallJimbo deleted the tickets/DM-29071 branch March 6, 2021 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants