New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-30935: Add butler ingest-files subcommand #545
Conversation
255d910
to
c17576f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Functionally, the code looks good, but I'm not happy with the organization into single very long functions. This makes the code hard to follow (for which I noticed you try to compensate with extra comments) and hard to modify in the future. I'd like to see both the utility code and the test case broken up into more manageable pieces; I'll be happy to approve once that's done.
standardized = DataCoordinate.standardize(dataId, graph=datasetType.dimensions) | ||
|
||
ref = DatasetRef(datasetType, standardized, conform=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have two separate steps instead of just calling DatasetRef(..., conform=True)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny you should ask and maybe @TallJimbo can explain. The reason I did it this way is because of mypy which insists that DatasetRef can only take a DataCoordinate even though it can also take a dict if conform=True. Maybe I should have asked @TallJimbo if I can fix the annotation for the DatasetRef constructor itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sounds like the annotation should probably be changed to the DataId
type alias.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the annotation and everything broke since the overwhelming usage for conform=False
is for plain dict
in test and validation code and to appease mypy I had made that illegal. There doesn't seem to be a way to turn a non-conforming dict
into a non-conforming DataCoordinate
so I think I'm going to have to tell mypy
that the self.dataId
itself can also be DataId
and is not required to be a DataCoordinate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and to confirm, mypy is really upset if I declare that DatasetRef.dataId
might not be a DataCoordinate
. This all explains why we weren't telling mypy about the plain dict option. I'll go back to how it was I think and add an # type: ignore
(which is what we do in _butler.py
.
tests/test_cliCmdIngestFiles.py
Outdated
configFile=self.configFile) | ||
|
||
def tearDown(self): | ||
removeTestTempDir(self.root) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about self.root2
?
I suggest using addCleanup
instead, it will be easier to keep in sync.
tests/test_cliCmdIngestFiles.py
Outdated
(Table([files, [423, 424], ["DummyCamComp", "DummyCamComp"]], | ||
names=["Files", "visit", "instrument"]), | ||
("--prefix", self.root2)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code would be easier to understand if you factored the data ID components into variables (e.g., visits
, instruments
). Otherwise it's hard to discern that 423 and 424 are visit numbers, and can be arbitrary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
423 and 424 aren't arbitrary -- they are the visit numbers pre-registered in the test repo.
tests/test_cliCmdIngestFiles.py
Outdated
table_file = os.path.join(self.root2, f"table{i}.csv") | ||
table.write(table_file) | ||
|
||
run = f"u/user/test{i}" | ||
result = runner.invoke(cli, ["ingest-files", *options, | ||
self.root, "test_metric_comp", run, table_file]) | ||
self.assertEqual(result.exit_code, 0, clickResultMsg(result)) | ||
|
||
butler = Butler(self.root) | ||
refs = list(butler.registry.queryDatasets("test_metric_comp", collections=run)) | ||
self.assertEqual(len(refs), 2) | ||
|
||
for i, data in enumerate(datasets): | ||
butler_data = butler.get("test_metric_comp", visit=423+i, instrument="DummyCamComp", | ||
collections=run) | ||
self.assertEqual(butler_data, data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest moving this code into a separate method -- it looks like it would depend only on table
, options
, and the data IDs? It would make the code clearer, and certainly easier to improve in the future, if the test assertions were insulated from the exact code used to create the test data.
tests/test_cliCmdIngestFiles.py
Outdated
(Table([[os.path.join(self.root2, f) for f in files], [423, 424]], | ||
names=["Files", "visit"]), | ||
("--data-id", "instrument=DummyCamComp")), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having one method for all tests makes it hard to isolate or diagnose problems. I suggest making each of these entries its own test method, moving the initialization of test*.json
into setUp
, and factoring the loop as suggested above.
71fa33d
to
813c571
Compare
@kfindeisen I've rearranged things as requested so can you please take a second look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making these changes!
" The latter is usually used for 'raw'-type data that will be ingested in multiple." | ||
" repositories.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the warning not to use butler ingest-files
for raws, is this comment still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The photodiode camera data probably should use DATAID_TYPE_RUN since the exposure
it's associated with will have been defined when ingesting the raw data itself. That's what I mean by "raw-type" as opposed to raw. We may well decide that we should always use DATAID_TYPE_RUN for any external files being ingested but I'm not ready to make that the default without more experience.
# Convert the k=v strings into a dataId dict. | ||
universe = butler.registry.dimensions | ||
common_data_id = parse_data_id_tuple(data_id, universe) | ||
|
||
# Read the table assuming that Astropy can work out the format. | ||
table = Table.read(table_file) | ||
|
||
datasets = extract_datasets_from_table(table, common_data_id, datasetType, formatter, prefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this looks much better, though the argument list for extract_datasets_from_table
is a bit longer than I would have expected. 🤔
Returns | ||
------- | ||
datasets : `list` of `FileDataset` | ||
The `FileDataset` object corresponding to the rows in the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `FileDataset` object corresponding to the rows in the table. | |
The `FileDataset` objects corresponding to the rows in the table. |
tests/test_cliCmdIngestFiles.py
Outdated
# Associate the dataId with these datasets | ||
self.files = files | ||
self.datasets = datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you could have just defined self.files
and self.datasets
right away, at the cost of only a little more verbosity.
Takes a table of dataIds and files and ingests them.
Checklist
doc/changes