Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-30316: Add migrate rewrite-sqlite-registry command #3

Merged
merged 2 commits into from
Jul 1, 2021

Conversation

timj
Copy link
Member

@timj timj commented Jun 24, 2021

This updates a sqlite registry by constructing an entirely new one.

This updates a sqlite registry by constructing an entirely new one.
Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I don't think you've missed any registry content; my comments are just typical minor review things.


# Check that we are really working with a SQLite database.
if not isinstance(source_butler.registry._db, SqliteDatabase):
raise RuntimeError("This command can only be used on SQLite registries.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this check, would this command give us a way to convert a full PostgreSQL repo to SQLite? In fact, now I wonder if with a bit more work it could even go the other way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core of it is a "transfer everything from butler1 to butler2" but the check is here because I am creating my own registry to copy into and assuming I'm copying it back again. I'll move the generic transfer code to a separate function at least.

exported = export_non_datasets(source_butler)

# Read all the datasets we are going to transfer, removing duplicates
# but using a list to have a fixed ordering.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think converting to list is going to help with ordering. Any given set will have the same iteration order as long as its elements don't change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. The order is important because the list returned from Butler.transfer_from has to be in the same order as the refs given to that method. Nothing should be added to it so it seems like it would be fine (it is converted to a list inside FileDatastore.transfer_from at the moment so that's probably a mistake in the "what if there are 10 million datasets" type of repo).

for dimension in butler.registry.dimensions.getStaticElements():
# Skip dimensions that are entirely derivable from other
# dimensions. "band" is always knowable from a "physical_filter".
if str(dimension).startswith("htm") or str(dimension) == "band":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general version of this check would be

If isinstance(dimension, SkyPixDimension) or dimension.viewOf is not None:

def export_non_datasets(butler: Butler) -> io.StringIO:
"""For the given butler, create an export YAML buffer."""
# Use a string buffer to avoid file I/O
yamlBuffer = io.StringIO()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be better to actually write to and read from a file so we aren't as limited by the amount of memory available. I imagine the dataset side of things would also, require work that's probably not worthwhile now, but this would be a start.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of this code is going to breakdown if there are 10 million datasets. Butler.transfer_from() might have to start working on N datasets at a time and I might have to abandon the return of the new list with the rewritten refs in the destination in the same order as the refs in the source.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems pretty clear that we should have a JSON form of this export/import. Pondering the buffer vs file, Is a YAML file really going to help much over a string buffer given that the loader is going to read the entire YAML file into memory to parse it -- I suppose the issue is whether io.StringIO returns a view of the buffer or a copy of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. It's sad that the state of JSON/YAML streaming is so bad (at least in Python), but we are pretty thoroughly boxed in by that in terms of being able to scale up import/export.

@timj timj merged commit a7e4e67 into master Jul 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants