DM-30316: Add migrate rewrite-sqlite-registry command #3

timj · 2021-06-24T22:42:12Z

This updates a sqlite registry by constructing an entirely new one.

TallJimbo

Looks good! I don't think you've missed any registry content; my comments are just typical minor review things.

python/lsst/daf/butler_migrate/script/rewrite_sqlite_registry.py

TallJimbo · 2021-07-01T00:53:47Z

python/lsst/daf/butler_migrate/script/rewrite_sqlite_registry.py

+
+    # Check that we are really working with a SQLite database.
+    if not isinstance(source_butler.registry._db, SqliteDatabase):
+        raise RuntimeError("This command can only be used on SQLite registries.")


Without this check, would this command give us a way to convert a full PostgreSQL repo to SQLite? In fact, now I wonder if with a bit more work it could even go the other way.

The core of it is a "transfer everything from butler1 to butler2" but the check is here because I am creating my own registry to copy into and assuming I'm copying it back again. I'll move the generic transfer code to a separate function at least.

TallJimbo · 2021-07-01T00:57:03Z

python/lsst/daf/butler_migrate/script/rewrite_sqlite_registry.py

+    exported = export_non_datasets(source_butler)
+
+    # Read all the datasets we are going to transfer, removing duplicates
+    # but using a list to have a fixed ordering.


I don't think converting to list is going to help with ordering. Any given set will have the same iteration order as long as its elements don't change.

Okay. The order is important because the list returned from Butler.transfer_from has to be in the same order as the refs given to that method. Nothing should be added to it so it seems like it would be fine (it is converted to a list inside FileDatastore.transfer_from at the moment so that's probably a mistake in the "what if there are 10 million datasets" type of repo).

TallJimbo · 2021-07-01T01:04:31Z

python/lsst/daf/butler_migrate/script/rewrite_sqlite_registry.py

+    for dimension in butler.registry.dimensions.getStaticElements():
+        # Skip dimensions that are entirely derivable from other
+        # dimensions. "band" is always knowable from a "physical_filter".
+        if str(dimension).startswith("htm") or str(dimension) == "band":


A more general version of this check would be

If isinstance(dimension, SkyPixDimension) or dimension.viewOf is not None:

TallJimbo · 2021-07-01T01:18:13Z

python/lsst/daf/butler_migrate/script/rewrite_sqlite_registry.py

+def export_non_datasets(butler: Butler) -> io.StringIO:
+    """For the given butler, create an export YAML buffer."""
+    # Use a string buffer to avoid file I/O
+    yamlBuffer = io.StringIO()


Might be better to actually write to and read from a file so we aren't as limited by the amount of memory available. I imagine the dataset side of things would also, require work that's probably not worthwhile now, but this would be a start.

A lot of this code is going to breakdown if there are 10 million datasets. Butler.transfer_from() might have to start working on N datasets at a time and I might have to abandon the return of the new list with the rewritten refs in the destination in the same order as the refs in the source.

It seems pretty clear that we should have a JSON form of this export/import. Pondering the buffer vs file, Is a YAML file really going to help much over a string buffer given that the loader is going to read the entire YAML file into memory to parse it -- I suppose the issue is whether io.StringIO returns a view of the buffer or a copy of it.

Good points. It's sad that the state of JSON/YAML streaming is so bad (at least in Python), but we are pretty thoroughly boxed in by that in terms of being able to scale up import/export.

Add migrate rewrite-sqlite-registry command

1b0390e

This updates a sqlite registry by constructing an entirely new one.

timj mentioned this pull request Jun 25, 2021

DM-30316: Support rewriting of sqlite registry lsst/daf_butler#540

Merged

2 tasks

TallJimbo approved these changes Jul 1, 2021

View reviewed changes

Rearrange to separate SQLite code from butler to butler transfer code

88ae674

timj merged commit a7e4e67 into master Jul 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-30316: Add migrate rewrite-sqlite-registry command #3

DM-30316: Add migrate rewrite-sqlite-registry command #3

timj commented Jun 24, 2021

TallJimbo left a comment

TallJimbo Jul 1, 2021

timj Jul 1, 2021

TallJimbo Jul 1, 2021

timj Jul 1, 2021

TallJimbo Jul 1, 2021

TallJimbo Jul 1, 2021

timj Jul 1, 2021

timj Jul 1, 2021

TallJimbo Jul 1, 2021

DM-30316: Add migrate rewrite-sqlite-registry command #3

DM-30316: Add migrate rewrite-sqlite-registry command #3

Conversation

timj commented Jun 24, 2021

TallJimbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment