Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make row ordering deterministic #30

Closed
olivierdalang opened this issue Oct 12, 2021 · 9 comments
Closed

Make row ordering deterministic #30

olivierdalang opened this issue Oct 12, 2021 · 9 comments

Comments

@olivierdalang
Copy link
Contributor

Hi !

It seems the rows in the resulting SQL statement are not deterministic (two identical calls on the same database can result in different SQLs, containing the same rows but in different orders). This is a bit of a deal breaker when dumps must be versionned.

Postgres doesn't guarantee rows orders unless there's a ORDER BY clause.

So maybe instead of COPY $sample_table TO STDOUT we should have COPY (SELECT * FROM $sample_table ORDER BY id) TO STDOUT (with some logic to retrieve the primary key).

Thanks !!

@mla
Copy link
Owner

mla commented Oct 17, 2021

Seems doable. Just to understand the use case, why are you versioning them?

There's no actual guarantee that the selected rows will be the same anyway, is there? I mean, you can explicitly request "random" results, but otherwise, there's no selection criteria, so we're just relying on whichever rows Pg decides to give us. Usually that's deterministic, but idk if you were to rebuild or change indexes or some such if that would affect which rows are returned. IOW, I'm not sure you can rely on the row selection being deterministic, if that's what you need...

@olivierdalang
Copy link
Contributor Author

I'm versioning them because I use pg_sample to produce a seed data dump for development extracted from production data. It's an iterative process as I include more tables every now and then, and want to keep these changes as small as possible to it's easier to see what actually changed.

Of course if the rows are not the same (due to random selection or changes in the data) I don't expect it to be the same :-) But as you said, currently it outputs the row in order postgres returns them, and as you hinted it's indeed not guaranteed to be deterministic (vacuum, indices rebuilds, etc). The only way to get guaranteed deterministic results is to specify the ordering explicitly.

@mla
Copy link
Owner

mla commented Nov 7, 2021

Thank you for the commits @olivierdalang. I merged it, but see my changes. There were some issues when a table doesn't have a primary key defined (I expanded it to look for any candidate key, but try to use primary key first if avail).

And there were some issues with quoting if the index columns have reserved words and such (see my quote_identifier calls).

It's passing the tests with --ordered applied now.

I think we should rename the option though. We have a --random option, which randomizes the sample selection. I would have assumed that --ordered is the opposite... but it's not. --ordered orders just the output of the sample set, and the sample itself may still be random or may change if it's rebuilt, etc.

I'm still not clear on the use-case, tbh, but I'm happy to keep the option if it's helpful, but let's think of a different name.... --order-sample-output or --order-sample-result or some such?

@olivierdalang
Copy link
Contributor Author

olivierdalang commented Nov 8, 2021

Thanks for the merge !

From your comment, I realize my pull request only does half of the job (as actually --ordered is supposed be the opposite of --random, and should sort on the source table). Just opened a followup PR for that : #33

Sorry, I went a bit too quickly over that 😳 I'd also be happy to include a test case for this new option, but a bit lost about how tests cases are organized (not familiar with perl at all)

@mla
Copy link
Owner

mla commented Nov 9, 2021

The tests are under the t/ subdir.

Perl uses TAP (Test Anything Protocol):
https://en.wikipedia.org/wiki/Test_Anything_Protocol

If you have Perl installed locally then you probably have "prove", which runs tests. If you run prove from the project root, it should try to run everything.

e..g, I get

$ prove
t/pg_sample.t .. ok     
All tests successful.
Files=1, Tests=12,  1 wallclock secs ( 0.01 usr  0.00 sys +  0.20 cusr  0.03 csys =  0.24 CPU)
Result: PASS

Run prove --verbose for verbosity.

If you can't find prove you should still be able to run the tests directly: perl t/pg_sample.t

If you take a look a pg_sample.t, you'll see all we're doing is creating a bunch of tables with FKs and such, populating them, then running pg_sample on it. We then load the resulting sample and do some basic checks on it.

Note that some of the naming is intentionally weird, to test using removed words and such (to make sure our quoting/escaping works).

@mla
Copy link
Owner

mla commented Nov 9, 2021

On PR #33 can you compare to the changes I made on dev? I added a find_candidate_key() method or some such which I think is more general.

I'm not sure about the '$table'::regclass thing. Will that work correctly if there's multiple tables with the same name in different schemas?

And my code will work on tables that do not have a primary key.

@olivierdalang
Copy link
Contributor Author

I merged your changes in #33, so I there's no '$table'::regclass anymore (look at the changes in https://github.com/mla/pg_sample/pull/33/files -> I'll remove unrelated whitespace changes).

Ok for the tests, will give it a shot to try to make a testcase for this feature.

Thanks for willing to accept my patches and the primer about Perl testing :-)

@olivierdalang
Copy link
Contributor Author

(continuing the discussion about the PR on the PR itself)

@mla
Copy link
Owner

mla commented Nov 29, 2021

Sorry for the delay on this @olivierdalang. Just merged into dev. Looks great to me, thank you. And thanks for adding the tests.

Will make a new release soon.

@mla mla closed this as completed Nov 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants