Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-25355: Add support for specifying schema versions #316

Merged
merged 2 commits into from Jun 25, 2020

Conversation

andy-slac
Copy link
Contributor

Schema versions are now specified in the registry config file and they
are stored to attributes table on schema initialization. When opening
existing database these versions are compared with configured ones,
exception is raised if they are not compatible. There is also a
groundwork for detecting schema changes by checking schema digests, it
is not used by anything yet.

Couple of unit tests needed updates to avoid repeated initialization of
existing schema.

version: 0.0.1
digest: ""
dimensions:
version: 0.0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to match the version number in dimensions.yaml?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the question that I forgot to ask with all other complications. I am not sure what how these versions relate to each other but we should try to un-confuse things here. I know that dimensions.yaml version is used internally and we can probably have multiple versions (and I think it's also a single integer number).

@TallJimbo, what are your thoughts on how that version is related to schema version? Do we actually need that one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd probably be ideal if the version in dimensions.yaml could be the only one, and also serve as a schema version. We could also probably find a way to pass a version defined in this part of the config into the dimensions system, but I'm guessing it would require adding it as an argument to a bunch of constructors. I definitely think we should have only one version number for dimensions.

I would be quite happy to use the VersionTuple class introduced on this ticket for the version information stored in DimensionUniverse, if we can organize the package structure to make that dependency okay.

As a side thought, I know the dimensions system needs some work to make backwards compatibility easier; very small changes (like adding a new HTM level) right now will make a totally new universe that knows nothing about the old one, and since dimensions are used to define tables (e.g. dataset_collection_XXXX) for some managers, that's a real problem. The (vague) ideas I have for fixing this include:

  • Start including the dimension version number into DimensionGraph.encode output so we can at least tell when a dataset_collection_XXXX table was defined with a different dimension version than we have.
  • Have dimensions.yaml include multiple versions (under different version: headers) of the complete dimensions definition. Until a repo is fully migrated away from an old version, any older dimensions definitions used in that repo must be present in configuration as well. If necessary we could add some kind of inheritance/override stuff to configs to reduce the duplication this implies.
  • Provide a way to translate a DimensionGraph from one version to another, by looking up the DimensionUniverse for both versions. This would need to assume that the translated dimensions are present in both and have the same relationships, but could otherwise differ.

That's probably mostly not relevant for this ticket (it's more about making more changes into patch or minor changes instead of forcing all changes to be major changes), but this discussion reminded me that I wanted to discuss these ideas with you, and it gives you an idea of what the version in dimensions.yaml might do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with using version from dimensions.yaml but I worry that we may be overloading that version number with different purposes. The schema version number should reflect the database side of things, and that is an implementation detail and it could change if you use different implementation of a manager class for example. This could mean that we have to switch to incompatible version number even if dimensions.yaml does not change. I'm not entirely sure but I think that existing version number has some different purpose which does not depend on implementation detail like database schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point that the version in dimensions.yaml is not quite a schema version. But I also think it's all we need (pending the question of versioning managers that's being discussed on Jira) - I think we can say that the combination of the dimensions.yaml version with versions that would be embedded in the daf_butler codebase (I'm imagining an overall daf_butler schema version + a dimension manager version, but that's not the only possibility) totally controls the dimensions part of the schema.

Or more precisely, there isn't a well-defined "dimensions part of the schema", but if we think about the full schema "version" being an unorderable hash of many different human-meaningful identifiers (see my last Jira comment), then the dimensions.yaml version number is still a sufficient number to capture any changes in that file/yaml-section, even if it has to be hashed together with other in-code versions and config labels to map to the full schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thought on dimensions - encoding of the dimension graph into dataset_collection_XXXX table names makes it very fragile w.r.t. adding or removing of the dimensions to/from the universe. There may be (complicated) workaround for that by adding a separate registry/table for mapping a set of dimensions to table name (or just a serial table number). It will be append-only and can be cached efficiently but it does add some extra logic.

Copy link
Member

@TallJimbo TallJimbo Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was already planning to follow up (on DM-24660, or maybe DM-24613 to do it right the first time for quantum tables) on your suggestion on a previous ticket review to start saving the full dataset_collection_XXXX table names into the dataset_type table, instead of relying on the encoding producing the same thing each time. I think if we encode the dimension version number, too, that may be enough to avoid that kind of logic, but we'll see.

dimensions:
version: 0.0.1
digest: ""
datastores:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another quick point, but in theory the schema for datastores can depend on the datastore. s3datastore and posixDatastore do use the same schema for their internal table but that's not required (inMemoryDatastore doesn't use a table at all).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the way this would work in practice is that the single datastore schema version would need to be updated whenever any datastore implementation changes its schema, that might still be okay (at least if all datastores continue to be implemented in daf_butler, not as extensions in other packages - that worries me a bit, but maybe that's just a hypothetical concern). It might cause us to declare some repos as incompatible when they actually aren't, which is better than not recognizing an incompatibility, but it could still be a problem.

Of course, we have the exact same problem (I think) with the core schema version and all of the registry managers. If we change one of those, I assume that means we change the core schema version - but that implies that repos that didn't use the changed manager now report as incompatible with the codebase, but really aren't.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it gets complicated. I think that schema is not the whole story, there is also data compatibility. Could s3datastore and posixDatastore actually share the same database? If they cannot then we should not use the same version number for their schema.

That question also applies to other potential cases, I can imagine that sometimes we may need to do a data migration without changing schema (e.g. changing the format/meaning of string/BLOB data). That may also be accompanied by the change of the manager class but not necessarily. That sort of change also needs to be reflected in the schema version. And I think that configuration that we are going to store in the new attributes table is another example where the data in a table has to be consistent with the schema version number because that configuration actually determines the schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

S3 and Posix datastores do currently share the same database schema but nothing requires that they continue to do so. It is defined in fileLikeDatastore at the moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My question was more about data, suppose that Posix datastore had written something to that table, could S3 datastore read and interpret those records, or at least understand that they are from different datastore and have to be ignored?

Copy link
Member

@timj timj Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each datastore has its own table. They don't share content. You could point an S3 datastore at the posix table and it would work so long as somebody behind the scenes had uploaded the data to S3 without telling butler but that wasn't something we had considered being a standard scenario.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And each butler can have multiple datastores in play each with their own tables and you can only work out how many you have by asking the datastore itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there can be multiple datastores and/or datastores can be added/removed dynamically then I think we cannot manage that kind of setup via a single version number. Each datastore probably needs its own separate schema version number and each one needs to me migrated separately. I think in datastores case we likely don't need to migrate database schema between different datastores, so they can be managed completely independently, each one of them would have a linear history. It would still be beneficial to have the same tool (alembic) to manage datastores versions, so those versions need to be identifiable like for other managers (via configuration or in the source code).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think each datastore class (s3, posix etc) needs its own version number.

Copy link
Member

@TallJimbo TallJimbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are little typos or big-picture questions about how this might be used in later tickets. I'll summarize the latter in Jira rather than here, but the code looks fine and is okay to merge unless you think the big-picture stuff should hold it up.

python/lsst/daf/butler/registry/interfaces/_database.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/versions.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/versions.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/versions.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/versions.py Show resolved Hide resolved
python/lsst/daf/butler/registry/versions.py Outdated Show resolved Hide resolved
python/lsst/daf/butler/registry/versions.py Show resolved Hide resolved
version: 0.0.1
digest: ""
dimensions:
version: 0.0.1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd probably be ideal if the version in dimensions.yaml could be the only one, and also serve as a schema version. We could also probably find a way to pass a version defined in this part of the config into the dimensions system, but I'm guessing it would require adding it as an argument to a bunch of constructors. I definitely think we should have only one version number for dimensions.

I would be quite happy to use the VersionTuple class introduced on this ticket for the version information stored in DimensionUniverse, if we can organize the package structure to make that dependency okay.

As a side thought, I know the dimensions system needs some work to make backwards compatibility easier; very small changes (like adding a new HTM level) right now will make a totally new universe that knows nothing about the old one, and since dimensions are used to define tables (e.g. dataset_collection_XXXX) for some managers, that's a real problem. The (vague) ideas I have for fixing this include:

  • Start including the dimension version number into DimensionGraph.encode output so we can at least tell when a dataset_collection_XXXX table was defined with a different dimension version than we have.
  • Have dimensions.yaml include multiple versions (under different version: headers) of the complete dimensions definition. Until a repo is fully migrated away from an old version, any older dimensions definitions used in that repo must be present in configuration as well. If necessary we could add some kind of inheritance/override stuff to configs to reduce the duplication this implies.
  • Provide a way to translate a DimensionGraph from one version to another, by looking up the DimensionUniverse for both versions. This would need to assume that the translated dimensions are present in both and have the same relationships, but could otherwise differ.

That's probably mostly not relevant for this ticket (it's more about making more changes into patch or minor changes instead of forcing all changes to be major changes), but this discussion reminded me that I wanted to discuss these ideas with you, and it gives you an idea of what the version in dimensions.yaml might do.

python/lsst/daf/butler/registry/_registry.py Show resolved Hide resolved
dimensions:
version: 0.0.1
digest: ""
datastores:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the way this would work in practice is that the single datastore schema version would need to be updated whenever any datastore implementation changes its schema, that might still be okay (at least if all datastores continue to be implemented in daf_butler, not as extensions in other packages - that worries me a bit, but maybe that's just a hypothetical concern). It might cause us to declare some repos as incompatible when they actually aren't, which is better than not recognizing an incompatibility, but it could still be a problem.

Of course, we have the exact same problem (I think) with the core schema version and all of the registry managers. If we change one of those, I assume that means we change the core schema version - but that implies that repos that didn't use the changed manager now report as incompatible with the codebase, but really aren't.

Schema versions are now specified in the registry config file and they
are stored to attributes table on schema initialization. When opening
existing database these versions are compared with configured ones,
exception is raised if they are not compatible. There is also a
groundwork for detecting schema changes by checking schema digests, it
is not used by anything yet.

Couple of unit tests needed updates to avoid repeated initialization of
existing schema.
@andy-slac andy-slac merged commit 319242f into master Jun 25, 2020
@timj timj deleted the tickets/DM-25355 branch February 16, 2024 17:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants