Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-30935: Add butler ingest-files subcommand #545

Merged
merged 6 commits into from Jul 6, 2021
Merged

Conversation

timj
Copy link
Member

@timj timj commented Jul 1, 2021

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

@timj timj force-pushed the tickets/DM-30935 branch 2 times, most recently from 255d910 to c17576f Compare July 1, 2021 19:32
Copy link
Member

@kfindeisen kfindeisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functionally, the code looks good, but I'm not happy with the organization into single very long functions. This makes the code hard to follow (for which I noticed you try to compensate with extra comments) and hard to modify in the future. I'd like to see both the utility code and the test case broken up into more manageable pieces; I'll be happy to approve once that's done.

python/lsst/daf/butler/_butler.py Show resolved Hide resolved
python/lsst/daf/butler/cli/cmd/commands.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/cli/cmd/commands.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/cli/cmd/commands.py Show resolved Hide resolved
python/lsst/daf/butler/script/ingest_files.py Outdated Show resolved Hide resolved
Comment on lines 122 to 124
standardized = DataCoordinate.standardize(dataId, graph=datasetType.dimensions)

ref = DatasetRef(datasetType, standardized, conform=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have two separate steps instead of just calling DatasetRef(..., conform=True)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny you should ask and maybe @TallJimbo can explain. The reason I did it this way is because of mypy which insists that DatasetRef can only take a DataCoordinate even though it can also take a dict if conform=True. Maybe I should have asked @TallJimbo if I can fix the annotation for the DatasetRef constructor itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sounds like the annotation should probably be changed to the DataId type alias.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the annotation and everything broke since the overwhelming usage for conform=False is for plain dict in test and validation code and to appease mypy I had made that illegal. There doesn't seem to be a way to turn a non-conforming dict into a non-conforming DataCoordinate so I think I'm going to have to tell mypy that the self.dataId itself can also be DataId and is not required to be a DataCoordinate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and to confirm, mypy is really upset if I declare that DatasetRef.dataId might not be a DataCoordinate. This all explains why we weren't telling mypy about the plain dict option. I'll go back to how it was I think and add an # type: ignore (which is what we do in _butler.py.

configFile=self.configFile)

def tearDown(self):
removeTestTempDir(self.root)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about self.root2?

I suggest using addCleanup instead, it will be easier to keep in sync.

Comment on lines 69 to 71
(Table([files, [423, 424], ["DummyCamComp", "DummyCamComp"]],
names=["Files", "visit", "instrument"]),
("--prefix", self.root2)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code would be easier to understand if you factored the data ID components into variables (e.g., visits, instruments). Otherwise it's hard to discern that 423 and 424 are visit numbers, and can be arbitrary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

423 and 424 aren't arbitrary -- they are the visit numbers pre-registered in the test repo.

Comment on lines 87 to 102
table_file = os.path.join(self.root2, f"table{i}.csv")
table.write(table_file)

run = f"u/user/test{i}"
result = runner.invoke(cli, ["ingest-files", *options,
self.root, "test_metric_comp", run, table_file])
self.assertEqual(result.exit_code, 0, clickResultMsg(result))

butler = Butler(self.root)
refs = list(butler.registry.queryDatasets("test_metric_comp", collections=run))
self.assertEqual(len(refs), 2)

for i, data in enumerate(datasets):
butler_data = butler.get("test_metric_comp", visit=423+i, instrument="DummyCamComp",
collections=run)
self.assertEqual(butler_data, data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest moving this code into a separate method -- it looks like it would depend only on table, options, and the data IDs? It would make the code clearer, and certainly easier to improve in the future, if the test assertions were insulated from the exact code used to create the test data.

(Table([[os.path.join(self.root2, f) for f in files], [423, 424]],
names=["Files", "visit"]),
("--data-id", "instrument=DummyCamComp")),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one method for all tests makes it hard to isolate or diagnose problems. I suggest making each of these entries its own test method, moving the initialization of test*.json into setUp, and factoring the loop as suggested above.

@timj timj force-pushed the tickets/DM-30935 branch 2 times, most recently from 71fa33d to 813c571 Compare July 3, 2021 00:09
@timj
Copy link
Member Author

timj commented Jul 3, 2021

@kfindeisen I've rearranged things as requested so can you please take a second look?

Copy link
Member

@kfindeisen kfindeisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making these changes!

Comment on lines +559 to +560
" The latter is usually used for 'raw'-type data that will be ingested in multiple."
" repositories.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the warning not to use butler ingest-files for raws, is this comment still relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The photodiode camera data probably should use DATAID_TYPE_RUN since the exposure it's associated with will have been defined when ingesting the raw data itself. That's what I mean by "raw-type" as opposed to raw. We may well decide that we should always use DATAID_TYPE_RUN for any external files being ingested but I'm not ready to make that the default without more experience.

Comment on lines +99 to +107
# Convert the k=v strings into a dataId dict.
universe = butler.registry.dimensions
common_data_id = parse_data_id_tuple(data_id, universe)

# Read the table assuming that Astropy can work out the format.
table = Table.read(table_file)

datasets = extract_datasets_from_table(table, common_data_id, datasetType, formatter, prefix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks much better, though the argument list for extract_datasets_from_table is a bit longer than I would have expected. 🤔

Returns
-------
datasets : `list` of `FileDataset`
The `FileDataset` object corresponding to the rows in the table.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `FileDataset` object corresponding to the rows in the table.
The `FileDataset` objects corresponding to the rows in the table.

Comment on lines 65 to 67
# Associate the dataId with these datasets
self.files = files
self.datasets = datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you could have just defined self.files and self.datasets right away, at the cost of only a little more verbosity.

Takes a table of dataIds and files and ingests them.
@timj timj merged commit 6e62486 into master Jul 6, 2021
@timj timj deleted the tickets/DM-30935 branch July 6, 2021 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants