-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Function source normalization (pg_dump) #139
Comments
This change adds two options to the `codd verify-schema` command: * `--ignore-col-ord` ignores column order. This is useful when making checkpoint migrations that perform column alignment. * `--ignore-fun-def` ignores function definitions. This is a *temporary* measure for dealing with `pg_dump` output. (mzabani#139) Note that this functionality is *not* added to schema verification done by other commands. A primary goal of this commit is to minimize change of existing code. A much more elegant solution could be made if existing code is refactored. Notes: * The `toFiles` function uses `DbDiskObj` instances to flatten a `DbRep` to a list of leaf values annotated by path. The object types are *not* included, but the new options require them. To minimize change of existing code, I wrote a separate function `toFiles'` that does the same thing but includes object types. * The `DiffType` sum type is specialized to what is displayed: it does not include expected values. I think that this type should be refactored as follows, so that the full context is available. The same subset of the full context can be displayed. If desired, more detailed/specific diff information (a diff of the `Value`s) could be displayed. It also makes the type useful internally. To minimize change of existing code, I simply use `(Maybe Value, Maybe Value)`, but this is not optimal because value `(Nothing, Nothing)` should never occur. (I therefore have to use `mapMaybe` instead of `map`.) ```haskell data DiffType = ExpectedButNotFound Value -- ^ expected value | NotExpectedButFound Value -- ^ database value | BothButDifferent Value -- ^ database value Value -- ^ expected value deriving stock (Eq, Show) ``` * The core of this commit is new module `Codd.Representations.Diff`. * Function `diffDbRep` calculates differences using `toFiles'`, using normal equality. * Function `filterDiff` filters those differences using an equality predicate. The predicate can be created according to user options. * Function `combinePredicates` combines multiple equality predicates. * Functions `ignoreColumnOrderP` and `ignoreRoutineDefinitionMd5P` implement the equality predicates for the added options. They are called on non-equal values, so they default to returning `False`. * Function `eqIgnoring` implements equality of `Value`s ignoring the specified keys when they are of type `Object`. * Function `diffToDiffType` converts from the internal representation to the map type using the existing `DiffType` type. * When the representations are not equal, the existing/previous implementation of `verifySchema` checks for equality twice as well as finds differences, which is three traversals of the data structure. This is not a big issue, but there is some overlap in functionality between `verifySchema` and `logSchemasComparison` that should probably be refactored. * The new CLI flags suffer from boolean blindness. This can be easily fixed by adding some new types.
FYI: I am currently working on a checkpoint migration, and I hacked some options into the As detailed in the commit message, a primary goal of this commit was to minimize change of existing code. It is not intended to be merged, since a much more elegant solution could be made if existing code is refactored. The |
This change adds two options to the `codd verify-schema` command: * `--ignore-col-ord` ignores column order. This is useful when making checkpoint migrations that perform column alignment. * `--ignore-fun-def` ignores function definitions. This is a *temporary* measure for dealing with `pg_dump` output. (mzabani#139) Note that this functionality is *not* added to schema verification done by other commands. A primary goal of this commit is to minimize change of existing code. A much more elegant solution could be made if existing code is refactored. Notes: * The `toFiles` function uses `DbDiskObj` instances to flatten a `DbRep` to a list of leaf values annotated by path. The object types are *not* included, but the new options require them. To minimize change of existing code, I wrote a separate function `toFiles'` that does the same thing but includes object types. * The `DiffType` sum type is specialized to what is displayed: it does not include expected values. I think that this type should be refactored as follows, so that the full context is available. The same subset of the full context can be displayed. If desired, more detailed/specific diff information (a diff of the `Value`s) could be displayed. It also makes the type useful internally. To minimize change of existing code, I simply use `(Maybe Value, Maybe Value)`, but this is not optimal because value `(Nothing, Nothing)` should never occur. (I therefore have to use `mapMaybe` instead of `map`.) ```haskell data DiffType = ExpectedButNotFound Value -- ^ expected value | NotExpectedButFound Value -- ^ database value | BothButDifferent Value -- ^ database value Value -- ^ expected value deriving stock (Eq, Show) ``` * The core of this commit is new module `Codd.Representations.Diff`. * Function `diffDbRep` calculates differences using `toFiles'`, using normal equality. * Function `filterDiff` filters those differences using an equality predicate. The predicate can be created according to user options. * Function `combinePredicates` combines multiple equality predicates. * Functions `ignoreColumnOrderP` and `ignoreRoutineDefinitionMd5P` implement the equality predicates for the added options. They are called on non-equal values, so they default to returning `False`. * Function `eqIgnoring` implements equality of `Value`s ignoring the specified keys when they are of type `Object`. * Function `diffToDiffType` converts from the internal representation to the map type using the existing `DiffType` type. * When the representations are not equal, the existing/previous implementation of `verifySchema` checks for equality twice as well as finds differences, which is three traversals of the data structure. This is not a big issue, but there is some overlap in functionality between `verifySchema` and `logSchemasComparison` that should probably be refactored. * The new CLI flags suffer from boolean blindness. This can be easily fixed by adding some new types.
I am not able to reproduce the issue with whitespace in functions. I created this function: CREATE FUNCTION test_function_with_whitespace() RETURNS INT AS $$
BEGIN
SELECT 1;
-- White space
-- More white space
END
$$ LANGUAGE plpgsql; Then I added this as a migration, ran I'm using postgresql 15.2 and up-to-date psql.
From the name of the option, am I right to think this is the same as setting the env var
Codd has a SQL parser that understand quotes and dollar quotes (it is necessary mostly due to So maybe we could fetch entire SQL bodies and hash them after stripping them of white space with codd's sql parser. But I'd be nice for me to reproduce the issue first. |
Interesting! I confirmed this using PostgreSQL 15.2, and I am happy to see that this was changed! I am currently using PostgreSQL 11, which has the problematic behavior that I described.
That environment variable setting changes the representations, while my BTW, the linked documentation says "in some cases that is just too hard to do right for everyone." This is so true! 😄 I wrote:
Quick correction! I have since implemented this and found that nested dollar-quoted strings are not checked; PostgreSQL just searches for the closing tag that matches the opening tag without parsing the tags of any nested dollar-quoted strings. A stack is therefore not required.
IMHO, it is really unfortunate that we have to parse SQL. Covering the simple cases is easy, but there are many non-simple cases. I understand that
PostgreSQL 11 should be able to reproduce the issue. Would storing the full source be problematic? Is a hash used just to keep the size down? |
I see. I'll try to find the most recent version of postgres where this still happens.
Ah, interesting! I wonder if we can think of some sensible code-like structure that specifies how things should be compared? I fear the proliferation of options. Although.. having some file in some format to specify this also feels icky; maybe having a ton of options that customize the comparison algo is better?
It really is unfortunate, and I actually made a gross simplification saying it's only due to There's a note in the code about it from a while ago: codd/src/Codd/Internal/MultiQueryStatement.hs Lines 35 to 44 in 0b31d04
Yeah, my thinking is the has is just used to keep the size down. It wouldn't be really problematic to store the full contents (I think in some places where I forgot to sprinkle |
Hmmm, I do not foresee too many options, but perhaps I am being naïve. Assuming that no comparison options will need to be parameterized, one UI technique is to represent them using strings and not expose them as separate
In my hack, I implemented the functionality by first comparing representations using equality and then filtering the result using an equality predicate that is created based on options, when specified. The
That is more complicated that I had thought... Thank you for the details! This transaction handling is also unfortunate because of rollback. When a migration is split into multiple transactions and a non-first transaction fails, then the resulting state represents partial application of the migration. I imagine that recovering from such an issue could require writing special migrations. It would be very dangerous if this happens in production if the resulting state is incompatible with the clients! I do not think that this is even viable in critical environments. In that case, one should split such migrations up into separate steps so that each step runs in a single transaction. The necessary inverses can then be developed and tested for all failure cases. This also complicates the schema consistency check. If that check is done as part of the last transaction of a series of migrations, and consistency fails, then rollback does not necessary rollback the whole series of migrations, as described above. Thinking aloud about my use case (not necessarily making suggestions for Codd):
Thank you for pointing that out! I did not know that some PostgreSQL statements (such as I have not worked through the Codd parser, but here are a few things that I noticed are not supported when briefly looking over it:
The benefit of storing the full source is that the software would then be able to show exactly how two functions differ, not just that they differ. It is just a convenience, to save users the need to do so manually.
Yes, the My current implementation is in Collapse SQL Whitespace: Part 2, and additional notes (and an approximate state machine diagram) can be found the previous version in Collapse SQL Whitespace. Note that I use
In this case, upgrading PostgreSQL would make it look like the function source differs even when it does not. This is not a significant issue, though. |
I ran into a curious case this week at work, where I could conceivably want different index But I agree with you these things sound hard to come by. Still, I think there's a way to be flexible enough to accommodate this kind of thing, which is what I address next:
I've thought of something that might be sufficiently generic but still relatively simple. The downside is codd must keep its disk layout and representations stable. To ignore column order, for instance, It'd something like
I'm probably confused, but I think I don't understand how transaction handling or the parser relate to that? For what it's worth, somewhere in codd's docs it's written to be very careful with no-txn migrations. I think the implicit ordering of migrations that codd facilitates (i.e. since each developer runs Your idea of a DAG of schema states forces users to think of that, but there is still the question of "which migrations are pending for environment Staging/Prod/X?", which I think can only be answered by a good CI pipeline that knows this and triggers warnings in the presence of no-txn migrations, or something along those lines.
Yeah, and ouch, and thanks for reminding me of those things. I should probably make them my top priority.
Oh my, neither did I! Once again, thanks! Longer term, it'd probably be better to take that note I wrote seriously and use the upstream parser rules somehow.
Indeed, very good point. This reminds me codd should really be showing a JSON diff when schemas mismatch, for which I'll create an issue.
I just ran into an interesting case where postgresql replaces |
I forgot to mention, codd has a provision for when it errors out parsing SQL: you can add This is of course different from codd's parser parsing statement boundaries incorrectly, which I hope to address in a newly created issue, #153 |
I have pushed a draft to ensure that pg_dump -> restore is an identity wrt codd schemas, and the test function with empty spaces is there and passes, even with postgres 11 (it won't pass in CI because I haven't populated the Nix cache, but it does pass locally). Here's the tested migration: I wonder if the problem is with some specific minor version of postgresql and/or pg_dump? The one used in codd's Nix environment is postgresql-11.20. |
I merged the PR with the pg_dump + restore = identity test given the utility of such a test, but I'm not marking this as fixed until we get to the bottom of the issue and find a way to make things work nicely. |
I'm taking the liberty of closing this issue for a few reasons:
But please reopen it (or open a new one, up to you!) if you can reproduce it in pg >= v12. |
The Codd representation of a PostgreSQL function/procedure includes the MD5 hash of the source code of the function body when the function is implemented in SQL or PL/pgSQL (reference). This function body source code is a string that includes the exact whitespace used during creation. This whitespace is collapsed when the database is dumped using pg_dump, however, causing a difference in representation.
This issue can be avoided by not using pg_dump schema dumps. For example, dumps are often used when upgrading PostgreSQL. An alternative is to initialize the schema on the new version from the migrations (using Codd) and to only use pg_dump to migrate the data. Such schema dumps are the de facto standard way to save a snapshot of a database (schema), however, so being able to compare them would be nice.
This issue could be resolved by normalizing the function body source code. This requires parsing the source code. (Doing this correctly requires parsing of nested dollar-quoted strings using identifiers, using an FSM augmented with an identifier stack.) I have not tried implementing this, but my current idea of how to do so it to change the representation to include the full source code. An option could then be used to determine how that source code is compared (equality by default, normalized when requested). Would including the full source code in the representation be problematic?
The text was updated successfully, but these errors were encountered: