DM-30935: Add butler ingest-files subcommand #545

timj · 2021-07-01T00:05:43Z

Checklist

ran Jenkins
added a release note for user-visible changes to doc/changes

kfindeisen

Functionally, the code looks good, but I'm not happy with the organization into single very long functions. This makes the code hard to follow (for which I noticed you try to compensate with extra comments) and hard to modify in the future. I'd like to see both the utility code and the test case broken up into more manageable pieces; I'll be happy to approve once that's done.

python/lsst/daf/butler/_butler.py

python/lsst/daf/butler/cli/cmd/commands.py

python/lsst/daf/butler/script/ingest_files.py

kfindeisen · 2021-07-01T19:13:13Z

python/lsst/daf/butler/script/ingest_files.py

+        standardized = DataCoordinate.standardize(dataId, graph=datasetType.dimensions)
+
+        ref = DatasetRef(datasetType, standardized, conform=False)


Why have two separate steps instead of just calling DatasetRef(..., conform=True)?

Funny you should ask and maybe @TallJimbo can explain. The reason I did it this way is because of mypy which insists that DatasetRef can only take a DataCoordinate even though it can also take a dict if conform=True. Maybe I should have asked @TallJimbo if I can fix the annotation for the DatasetRef constructor itself.

Yeah, sounds like the annotation should probably be changed to the DataId type alias.

I changed the annotation and everything broke since the overwhelming usage for conform=False is for plain dict in test and validation code and to appease mypy I had made that illegal. There doesn't seem to be a way to turn a non-conforming dict into a non-conforming DataCoordinate so I think I'm going to have to tell mypy that the self.dataId itself can also be DataId and is not required to be a DataCoordinate.

...and to confirm, mypy is really upset if I declare that DatasetRef.dataId might not be a DataCoordinate. This all explains why we weren't telling mypy about the plain dict option. I'll go back to how it was I think and add an # type: ignore (which is what we do in _butler.py.

kfindeisen · 2021-07-01T19:36:36Z

tests/test_cliCmdIngestFiles.py

+                                       configFile=self.configFile)
+
+    def tearDown(self):
+        removeTestTempDir(self.root)


What about self.root2?

I suggest using addCleanup instead, it will be easier to keep in sync.

kfindeisen · 2021-07-01T19:39:34Z

tests/test_cliCmdIngestFiles.py

+            (Table([files, [423, 424], ["DummyCamComp", "DummyCamComp"]],
+                   names=["Files", "visit", "instrument"]),
+             ("--prefix", self.root2)),


I think this code would be easier to understand if you factored the data ID components into variables (e.g., visits, instruments). Otherwise it's hard to discern that 423 and 424 are visit numbers, and can be arbitrary.

423 and 424 aren't arbitrary -- they are the visit numbers pre-registered in the test repo.

kfindeisen · 2021-07-01T19:47:30Z

tests/test_cliCmdIngestFiles.py

+                table_file = os.path.join(self.root2, f"table{i}.csv")
+                table.write(table_file)
+
+                run = f"u/user/test{i}"
+                result = runner.invoke(cli, ["ingest-files", *options,
+                                             self.root, "test_metric_comp", run, table_file])
+                self.assertEqual(result.exit_code, 0, clickResultMsg(result))
+
+                butler = Butler(self.root)
+                refs = list(butler.registry.queryDatasets("test_metric_comp", collections=run))
+                self.assertEqual(len(refs), 2)
+
+                for i, data in enumerate(datasets):
+                    butler_data = butler.get("test_metric_comp", visit=423+i, instrument="DummyCamComp",
+                                             collections=run)
+                    self.assertEqual(butler_data, data)


I suggest moving this code into a separate method -- it looks like it would depend only on table, options, and the data IDs? It would make the code clearer, and certainly easier to improve in the future, if the test assertions were insulated from the exact code used to create the test data.

kfindeisen · 2021-07-01T19:48:05Z

tests/test_cliCmdIngestFiles.py

+            (Table([[os.path.join(self.root2, f) for f in files], [423, 424]],
+                   names=["Files", "visit"]),
+             ("--data-id", "instrument=DummyCamComp")),
+        )


Having one method for all tests makes it hard to isolate or diagnose problems. I suggest making each of these entries its own test method, moving the initialization of test*.json into setUp, and factoring the loop as suggested above.

timj · 2021-07-03T00:11:10Z

@kfindeisen I've rearranged things as requested so can you please take a second look?

kfindeisen

Thanks for making these changes!

kfindeisen · 2021-07-06T17:57:16Z

python/lsst/daf/butler/cli/cmd/commands.py

+              " The latter is usually used for 'raw'-type data that will be ingested in multiple."
+              " repositories.",


Given the warning not to use butler ingest-files for raws, is this comment still relevant?

The photodiode camera data probably should use DATAID_TYPE_RUN since the exposure it's associated with will have been defined when ingesting the raw data itself. That's what I mean by "raw-type" as opposed to raw. We may well decide that we should always use DATAID_TYPE_RUN for any external files being ingested but I'm not ready to make that the default without more experience.

kfindeisen · 2021-07-06T18:22:19Z

python/lsst/daf/butler/script/ingest_files.py

+    # Convert the k=v strings into a dataId dict.
+    universe = butler.registry.dimensions
+    common_data_id = parse_data_id_tuple(data_id, universe)
+
+    # Read the table assuming that Astropy can work out the format.
+    table = Table.read(table_file)
+
+    datasets = extract_datasets_from_table(table, common_data_id, datasetType, formatter, prefix)


Thanks, this looks much better, though the argument list for extract_datasets_from_table is a bit longer than I would have expected. 🤔

kfindeisen · 2021-07-06T18:44:00Z

python/lsst/daf/butler/script/ingest_files.py

+    Returns
+    -------
+    datasets : `list` of `FileDataset`
+        The `FileDataset` object corresponding to the rows in the table.


Suggested change

The `FileDataset` object corresponding to the rows in the table.

The `FileDataset` objects corresponding to the rows in the table.

kfindeisen · 2021-07-06T19:10:48Z

tests/test_cliCmdIngestFiles.py

+        # Associate the dataId with these datasets
+        self.files = files
+        self.datasets = datasets


Seems like you could have just defined self.files and self.datasets right away, at the cost of only a little more verbosity.

Takes a table of dataIds and files and ingests them.

timj force-pushed the tickets/DM-30935 branch 2 times, most recently from 255d910 to c17576f Compare July 1, 2021 19:32

kfindeisen requested changes Jul 1, 2021

View reviewed changes

timj force-pushed the tickets/DM-30935 branch 2 times, most recently from 71fa33d to 813c571 Compare July 3, 2021 00:09

timj added 2 commits July 6, 2021 08:41

Add support for reading JSON to metrics formatter

6f95dcf

Fix type annotation for FileDataset path and fix docs

3dcf2ea

timj force-pushed the tickets/DM-30935 branch from 813c571 to fa3298c Compare July 6, 2021 15:42

kfindeisen approved these changes Jul 6, 2021

View reviewed changes

Add butler ingest-files command

443f1b6

Takes a table of dataIds and files and ingests them.

timj force-pushed the tickets/DM-30935 branch from fa3298c to f447d44 Compare July 6, 2021 22:37

timj added 3 commits July 6, 2021 15:42

Clean ups requested by reviewer

c69a8b3

Reorganize test to run one test method per ingest option

eec7775

Add news fragment

3f44be9

timj force-pushed the tickets/DM-30935 branch from f447d44 to 3f44be9 Compare July 6, 2021 22:42

timj merged commit 6e62486 into master Jul 6, 2021

timj deleted the tickets/DM-30935 branch July 6, 2021 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-30935: Add butler ingest-files subcommand #545

DM-30935: Add butler ingest-files subcommand #545

timj commented Jul 1, 2021 •

edited

kfindeisen left a comment •

edited

kfindeisen Jul 1, 2021

timj Jul 1, 2021

TallJimbo Jul 1, 2021

timj Jul 2, 2021

timj Jul 2, 2021

kfindeisen Jul 1, 2021

kfindeisen Jul 1, 2021

timj Jul 1, 2021

kfindeisen Jul 1, 2021

kfindeisen Jul 1, 2021

timj commented Jul 3, 2021

kfindeisen left a comment

kfindeisen Jul 6, 2021

timj Jul 6, 2021

kfindeisen Jul 6, 2021

kfindeisen Jul 6, 2021

kfindeisen Jul 6, 2021

		standardized = DataCoordinate.standardize(dataId, graph=datasetType.dimensions)

		ref = DatasetRef(datasetType, standardized, conform=False)

		" The latter is usually used for 'raw'-type data that will be ingested in multiple."
		" repositories.",

	The `FileDataset` object corresponding to the rows in the table.
	The `FileDataset` objects corresponding to the rows in the table.

DM-30935: Add butler ingest-files subcommand #545

DM-30935: Add butler ingest-files subcommand #545

Conversation

timj commented Jul 1, 2021 • edited

Checklist

kfindeisen left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj commented Jul 3, 2021

kfindeisen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj commented Jul 1, 2021 •

edited

kfindeisen left a comment •

edited