DM-25355: Add support for specifying schema versions #316

andy-slac · 2020-06-18T20:52:55Z

Schema versions are now specified in the registry config file and they
are stored to attributes table on schema initialization. When opening
existing database these versions are compared with configured ones,
exception is raised if they are not compatible. There is also a
groundwork for detecting schema changes by checking schema digests, it
is not used by anything yet.

Couple of unit tests needed updates to avoid repeated initialization of
existing schema.

timj · 2020-06-18T20:54:43Z

config/registry.yaml

+      version: 0.0.1
+      digest: ""
+    dimensions:
+      version: 0.0.1


Does this have to match the version number in dimensions.yaml?

This is the question that I forgot to ask with all other complications. I am not sure what how these versions relate to each other but we should try to un-confuse things here. I know that dimensions.yaml version is used internally and we can probably have multiple versions (and I think it's also a single integer number).

@TallJimbo, what are your thoughts on how that version is related to schema version? Do we actually need that one?

I think it'd probably be ideal if the version in dimensions.yaml could be the only one, and also serve as a schema version. We could also probably find a way to pass a version defined in this part of the config into the dimensions system, but I'm guessing it would require adding it as an argument to a bunch of constructors. I definitely think we should have only one version number for dimensions.

I would be quite happy to use the VersionTuple class introduced on this ticket for the version information stored in DimensionUniverse, if we can organize the package structure to make that dependency okay.

As a side thought, I know the dimensions system needs some work to make backwards compatibility easier; very small changes (like adding a new HTM level) right now will make a totally new universe that knows nothing about the old one, and since dimensions are used to define tables (e.g. dataset_collection_XXXX) for some managers, that's a real problem. The (vague) ideas I have for fixing this include:

Start including the dimension version number into DimensionGraph.encode output so we can at least tell when a dataset_collection_XXXX table was defined with a different dimension version than we have.

Have dimensions.yaml include multiple versions (under different version: headers) of the complete dimensions definition. Until a repo is fully migrated away from an old version, any older dimensions definitions used in that repo must be present in configuration as well. If necessary we could add some kind of inheritance/override stuff to configs to reduce the duplication this implies.

Provide a way to translate a DimensionGraph from one version to another, by looking up the DimensionUniverse for both versions. This would need to assume that the translated dimensions are present in both and have the same relationships, but could otherwise differ.

That's probably mostly not relevant for this ticket (it's more about making more changes into patch or minor changes instead of forcing all changes to be major changes), but this discussion reminded me that I wanted to discuss these ideas with you, and it gives you an idea of what the version in dimensions.yaml might do.

I'm OK with using version from dimensions.yaml but I worry that we may be overloading that version number with different purposes. The schema version number should reflect the database side of things, and that is an implementation detail and it could change if you use different implementation of a manager class for example. This could mean that we have to switch to incompatible version number even if dimensions.yaml does not change. I'm not entirely sure but I think that existing version number has some different purpose which does not depend on implementation detail like database schema.

It's a good point that the version in dimensions.yaml is not quite a schema version. But I also think it's all we need (pending the question of versioning managers that's being discussed on Jira) - I think we can say that the combination of the dimensions.yaml version with versions that would be embedded in the daf_butler codebase (I'm imagining an overall daf_butler schema version + a dimension manager version, but that's not the only possibility) totally controls the dimensions part of the schema.

Or more precisely, there isn't a well-defined "dimensions part of the schema", but if we think about the full schema "version" being an unorderable hash of many different human-meaningful identifiers (see my last Jira comment), then the dimensions.yaml version number is still a sufficient number to capture any changes in that file/yaml-section, even if it has to be hashed together with other in-code versions and config labels to map to the full schema.

One more thought on dimensions - encoding of the dimension graph into dataset_collection_XXXX table names makes it very fragile w.r.t. adding or removing of the dimensions to/from the universe. There may be (complicated) workaround for that by adding a separate registry/table for mapping a set of dimensions to table name (or just a serial table number). It will be append-only and can be cached efficiently but it does add some extra logic.

I was already planning to follow up (on DM-24660, or maybe DM-24613 to do it right the first time for quantum tables) on your suggestion on a previous ticket review to start saving the full dataset_collection_XXXX table names into the dataset_type table, instead of relying on the encoding producing the same thing each time. I think if we encode the dimension version number, too, that may be enough to avoid that kind of logic, but we'll see.

timj · 2020-06-18T21:07:08Z

config/registry.yaml

+    dimensions:
+      version: 0.0.1
+      digest: ""
+    datastores:


Another quick point, but in theory the schema for datastores can depend on the datastore. s3datastore and posixDatastore do use the same schema for their internal table but that's not required (inMemoryDatastore doesn't use a table at all).

If the way this would work in practice is that the single datastore schema version would need to be updated whenever any datastore implementation changes its schema, that might still be okay (at least if all datastores continue to be implemented in daf_butler, not as extensions in other packages - that worries me a bit, but maybe that's just a hypothetical concern). It might cause us to declare some repos as incompatible when they actually aren't, which is better than not recognizing an incompatibility, but it could still be a problem.

Of course, we have the exact same problem (I think) with the core schema version and all of the registry managers. If we change one of those, I assume that means we change the core schema version - but that implies that repos that didn't use the changed manager now report as incompatible with the codebase, but really aren't.

Hmm, it gets complicated. I think that schema is not the whole story, there is also data compatibility. Could s3datastore and posixDatastore actually share the same database? If they cannot then we should not use the same version number for their schema.

That question also applies to other potential cases, I can imagine that sometimes we may need to do a data migration without changing schema (e.g. changing the format/meaning of string/BLOB data). That may also be accompanied by the change of the manager class but not necessarily. That sort of change also needs to be reflected in the schema version. And I think that configuration that we are going to store in the new attributes table is another example where the data in a table has to be consistent with the schema version number because that configuration actually determines the schema.

S3 and Posix datastores do currently share the same database schema but nothing requires that they continue to do so. It is defined in fileLikeDatastore at the moment.

My question was more about data, suppose that Posix datastore had written something to that table, could S3 datastore read and interpret those records, or at least understand that they are from different datastore and have to be ignored?

Each datastore has its own table. They don't share content. You could point an S3 datastore at the posix table and it would work so long as somebody behind the scenes had uploaded the data to S3 without telling butler but that wasn't something we had considered being a standard scenario.

And each butler can have multiple datastores in play each with their own tables and you can only work out how many you have by asking the datastore itself.

If there can be multiple datastores and/or datastores can be added/removed dynamically then I think we cannot manage that kind of setup via a single version number. Each datastore probably needs its own separate schema version number and each one needs to me migrated separately. I think in datastores case we likely don't need to migrate database schema between different datastores, so they can be managed completely independently, each one of them would have a linear history. It would still be beneficial to have the same tool (alembic) to manage datastores versions, so those versions need to be identifiable like for other managers (via configuration or in the source code).

I think each datastore class (s3, posix etc) needs its own version number.

TallJimbo

Comments are little typos or big-picture questions about how this might be used in later tickets. I'll summarize the latter in Jira rather than here, but the code looks fine and is okay to merge unless you think the big-picture stuff should hold it up.

python/lsst/daf/butler/registry/interfaces/_database.py

python/lsst/daf/butler/registry/versions.py

TallJimbo · 2020-06-19T14:27:13Z

config/registry.yaml

+      version: 0.0.1
+      digest: ""
+    dimensions:
+      version: 0.0.1


I think it'd probably be ideal if the version in dimensions.yaml could be the only one, and also serve as a schema version. We could also probably find a way to pass a version defined in this part of the config into the dimensions system, but I'm guessing it would require adding it as an argument to a bunch of constructors. I definitely think we should have only one version number for dimensions.

I would be quite happy to use the VersionTuple class introduced on this ticket for the version information stored in DimensionUniverse, if we can organize the package structure to make that dependency okay.

As a side thought, I know the dimensions system needs some work to make backwards compatibility easier; very small changes (like adding a new HTM level) right now will make a totally new universe that knows nothing about the old one, and since dimensions are used to define tables (e.g. dataset_collection_XXXX) for some managers, that's a real problem. The (vague) ideas I have for fixing this include:

Start including the dimension version number into DimensionGraph.encode output so we can at least tell when a dataset_collection_XXXX table was defined with a different dimension version than we have.

Have dimensions.yaml include multiple versions (under different version: headers) of the complete dimensions definition. Until a repo is fully migrated away from an old version, any older dimensions definitions used in that repo must be present in configuration as well. If necessary we could add some kind of inheritance/override stuff to configs to reduce the duplication this implies.

Provide a way to translate a DimensionGraph from one version to another, by looking up the DimensionUniverse for both versions. This would need to assume that the translated dimensions are present in both and have the same relationships, but could otherwise differ.

That's probably mostly not relevant for this ticket (it's more about making more changes into patch or minor changes instead of forcing all changes to be major changes), but this discussion reminded me that I wanted to discuss these ideas with you, and it gives you an idea of what the version in dimensions.yaml might do.

python/lsst/daf/butler/registry/_registry.py

TallJimbo · 2020-06-19T14:36:35Z

config/registry.yaml

+    dimensions:
+      version: 0.0.1
+      digest: ""
+    datastores:


If the way this would work in practice is that the single datastore schema version would need to be updated whenever any datastore implementation changes its schema, that might still be okay (at least if all datastores continue to be implemented in daf_butler, not as extensions in other packages - that worries me a bit, but maybe that's just a hypothetical concern). It might cause us to declare some repos as incompatible when they actually aren't, which is better than not recognizing an incompatibility, but it could still be a problem.

Of course, we have the exact same problem (I think) with the core schema version and all of the registry managers. If we change one of those, I assume that means we change the core schema version - but that implies that repos that didn't use the changed manager now report as incompatible with the codebase, but really aren't.

Schema versions are now specified in the registry config file and they are stored to attributes table on schema initialization. When opening existing database these versions are compared with configured ones, exception is raised if they are not compatible. There is also a groundwork for detecting schema changes by checking schema digests, it is not used by anything yet. Couple of unit tests needed updates to avoid repeated initialization of existing schema.

timj reviewed Jun 18, 2020

View reviewed changes

TallJimbo approved these changes Jun 19, 2020

View reviewed changes

andy-slac force-pushed the tickets/DM-25355 branch from c252642 to 51fa2e9 Compare June 22, 2020 22:06

andy-slac force-pushed the tickets/DM-25355 branch from 51fa2e9 to 31b3b40 Compare June 23, 2020 01:22

Disable version saving for now

f032157

andy-slac force-pushed the tickets/DM-25355 branch from 6e68a2a to f032157 Compare June 24, 2020 21:05

andy-slac merged commit 319242f into master Jun 25, 2020

timj deleted the tickets/DM-25355 branch February 16, 2024 17:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM-25355: Add support for specifying schema versions #316

DM-25355: Add support for specifying schema versions #316

andy-slac commented Jun 18, 2020

timj Jun 18, 2020

andy-slac Jun 18, 2020

TallJimbo Jun 19, 2020

andy-slac Jun 22, 2020

TallJimbo Jun 23, 2020

andy-slac Jun 24, 2020

TallJimbo Jun 24, 2020 •

edited

timj Jun 18, 2020

TallJimbo Jun 19, 2020

andy-slac Jun 22, 2020

timj Jun 22, 2020

andy-slac Jun 22, 2020

timj Jun 23, 2020 •

edited

timj Jun 23, 2020

andy-slac Jun 24, 2020

timj Jun 24, 2020

TallJimbo left a comment

TallJimbo Jun 19, 2020

TallJimbo Jun 19, 2020

DM-25355: Add support for specifying schema versions #316

DM-25355: Add support for specifying schema versions #316

Conversation

andy-slac commented Jun 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo Jun 24, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timj Jun 23, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TallJimbo Jun 24, 2020 •

edited

timj Jun 23, 2020 •

edited