Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DISTINCT does not work with ORDER BY in some queries #1287

Merged
merged 5 commits into from Jul 13, 2018

Conversation

Projects
None yet
3 participants
@katzyn
Copy link
Contributor

katzyn commented Jul 13, 2018

A fix for issue #408.

Only plain selects are fixed, UNION that was also discussed somewhere in comments is not touched here.

I don't know what modern SQL standards say about queries like SELECT DISTINCT A ORDER BY B, but at least MySQL / MariaDB accept them. Also we need to support queries like SELECT DISTINCT A ORDER BY function(A), but our implementation in not really suitable to distinguish between valid expressions based on selected values from other unrelated expressions. It's actually simpler to allow anything, like MySQL.

Also such support for unrelated columns was requested too in the comments.

@grandinj

This comment has been minimized.

Copy link
Contributor

grandinj commented Jul 13, 2018

@katzyn

This comment has been minimized.

Copy link
Contributor Author

katzyn commented Jul 13, 2018

PostgreSQL 10, for example, rejects all such queries. Even the SELECT DISTINCT A FROM TEST ORDER BY some_expression(A).

@lukaseder

This comment has been minimized.

Copy link
Contributor

lukaseder commented Jul 13, 2018

I don't know what modern SQL standards say about queries like SELECT DISTINCT A ORDER BY B, but at least MySQL / MariaDB accept them

MySQL is not good guidance for reasonable SQL functionality :) I wouldn't be surprised if MySQL just ordered rows randomly if the ORDER BY clause doesn't really make sense. Perhaps, they do something like PostgreSQL's DISTINCT ON behind the scenes, though.

The SQL standard is quite wordy about this part. I'm looking at ISO/IEC 9075-2:2016(E), 7.17 <query expression>, Syntax Rules 28) d) i) 9) B) II), which says

QS shall not specify the <set quantifier> DISTINCT or directly contain one or
more s.

Essentially, DISTINCT is forbidden if a column is referenced from ORDER BY that is not part of the projection. However, 28) d) i) 9) B) III) allows for ordering by expressions that are not part of the projection:

The columns Cj are said to be extended sort key columns.

Which makes total sense when you think about it. Assuming this data set:

A   B
-----
2   1
1   2
1   3
2   4

Now consider:

-- It is easy to see how the ordering is defined on this query. It will be 2, 1, 1, 2
SELECT A ORDER BY B

-- It is impossible to see how the ordering is defined on this query. Is it 1, 2 or 2, 1?
SELECT DISTINCT A ORDER BY B

I would definitely go with the SQL standard / PostgreSQL here. Optionally, support PostgreSQL's DISTINCT ON

@katzyn

This comment has been minimized.

Copy link
Contributor Author

katzyn commented Jul 13, 2018

@lukaseder
Is this query valid or not?

SELECT DISTINCT A FROM TEST ORDER BY LOWER(A)
@lukaseder

This comment has been minimized.

Copy link
Contributor

lukaseder commented Jul 13, 2018

Yes, to my understanding, LOWER(A) is also covered by 28) d) i) 9) B) III), which allows for extended sort key columns. Intuitively, it does make sense, right? While the ordering is not necessarily deterministic (it does not have to be), it is clearly defined.

While the SQL standard is extremely hard to read, I think this article I wrote some time ago might explain it better:
https://blog.jooq.org/2016/12/09/a-beginners-guide-to-the-true-order-of-sql-operations/

logically, the order of operations in SELECT DISTINCT A FROM TEST ORDER BY LOWER(A) is:

  • FROM TEST
  • SELECT A
  • DISTINCT
  • ORDER BY LOWER(A)

When looking at SELECT A FROM TEST ORDER BY B it is:

  • FROM TEST
  • SELECT A, B -- Add B to the extended sort key columns projection
  • ORDER BY B
  • SELECT A -- Remove B again from the projection

This shows that SELECT DISTINCT A FROM TEST ORDER BY B is illegal, because adding B to that extended sort key columns projection would alter the semantics of the DISTINCT operation.

Make sense?

@katzyn

This comment has been minimized.

Copy link
Contributor Author

katzyn commented Jul 13, 2018

But without this change such valid queries were rejected too.

It's not easy to check whether expression use only valid columns due to internal implementation of such selects in H2.

H2 builds complete list of expressions. Some of them are expressions for select (they may be treated as distinct or not), others are used only for sorting (but not in distinct operations). Expressions are not linked to each other and intermediate result contains evaluated values in its rows.

I guess that

SELECT UPPER(A) FROM TEST ORDER BY LOWER(A)

should be accepted too?

@grandinj

This comment has been minimized.

Copy link
Contributor

grandinj commented Jul 13, 2018

I'm ok with us accepting "not really legal" stuff if it means we accept more legal stuff

If someone executes a "not really legal" query and the results change in the next version, that is on them

@lukaseder

This comment has been minimized.

Copy link
Contributor

lukaseder commented Jul 13, 2018

SELECT UPPER(A) FROM TEST ORDER BY LOWER(A)

Should be accepted, because without DISTINCT, LOWER(A) can easily be added to the extended sort key columns

I'm ok with us accepting "not really legal" stuff if it means we accept more legal stuff

If someone executes a "not really legal" query and the results change in the next version, that is on them

Heh. That's what the MySQL folks probably thought as well. Now everyone criticises them for that, and yet, they cannot get rid of it easily because no one can turn on strict mode on a legacy database :-)

@katzyn

This comment has been minimized.

Copy link
Contributor Author

katzyn commented Jul 13, 2018

And what about such weird query? Is it valid or not?

SELECT DISTINCT (A + B), C FROM TEST ORDER BY (B + C)

I'm trying to understand how exactly queries should be checked for correctness.

@katzyn

This comment has been minimized.

Copy link
Contributor Author

katzyn commented Jul 13, 2018

Should be accepted, because without DISTINCT, LOWER(A) can easily be added to the extended sort key columns

Sorry, I mean

SELECT DISTINCT UPPER(A) FROM TEST ORDER BY LOWER(A)
@katzyn

This comment has been minimized.

Copy link
Contributor Author

katzyn commented Jul 13, 2018

Actually we can enable such unrestricted queries only in MySQL mode. But I want to fix problem with valid queries that are not currently accepted in regular mode too. (They also are not accepted by PostrgreSQL, but accepted by Oracle, MySQL and may be by some others).

@lukaseder

This comment has been minimized.

Copy link
Contributor

lukaseder commented Jul 13, 2018

SELECT DISTINCT (A + B), C FROM TEST ORDER BY (B + C)

No, this shouldn't be valid. The rule is always the same. The expression (B + C) from the ORDER BY clause is not in the select list. Hence it is an extended sort key column. Hence, DISTINCT is not allowed.

SELECT DISTINCT UPPER(A) FROM TEST ORDER BY LOWER(A)

Not allowed

@katzyn

This comment has been minimized.

Copy link
Contributor Author

katzyn commented Jul 13, 2018

Did you mean that in

SELECT DISTINCT A FROM TEST ORDER BY LOWER(A)

LOWER(A) is allowed if A is in select list, but (A + B) is not allowed even both A and B are used somehow in select list?

Or query like

SELECT DISTINCT A, B FROM TEST ORDER BY (A + B)

should be valid and query

SELECT DISTINCT (A - B) FROM TEST ORDER BY (A + B)

should not be valid?

@lukaseder

This comment has been minimized.

Copy link
Contributor

lukaseder commented Jul 13, 2018

Let's look again at:

SELECT DISTINCT (A + B), C FROM TEST ORDER BY (B + C)

Try it with example data if formalism doesn't help :)

A  B  C
-------
1  4  1
2  5  1
5  2  1
4  1  1

What would be the expected result of this query? Clearly, DISTINCT yields 2 rows:

(A + B)  C
----------
5        1
7        1

Now, at this point, what does ORDER BY (B + C) even mean? Which row should be first? It's impossible to know.

SELECT DISTINCT A, B FROM TEST ORDER BY (A + B)

Yes that works, because the expression in ORDER BY depends only on expressions in the select list, so no extended sort key columns are needed, and thus DISTINCT is allowed. You can try this with any type of data, it's always very clear that it will work

SELECT DISTINCT (A - B) FROM TEST ORDER BY (A + B)

No. Try again with data:

A  B
-----
3  1
-1 -3
3  0

DISTINCT yields 2 rows:

(A - B)
-------
2
3

How would you want to order these 2 rows by (A + B)?

@katzyn

This comment has been minimized.

Copy link
Contributor Author

katzyn commented Jul 13, 2018

Thank you for detailed explanation!

To perform correct check for ORDER BY expression that was not matched with some select expression or its alias (that is already checked with a large piece of code) a additional complicated check should be performed. Expression may use deterministic functions (like mentioned LOWER(), arithmetic or logic operators and their arguments should all be specified in select list or be constants. This is not too hard, I'll try to write such check for regular mode.

For MySQL mode such additional check should be skipped.

@grandinj

This comment has been minimized.

Copy link
Contributor

grandinj commented on h2/src/main/org/h2/expression/Operation.java in e647b73 Jul 13, 2018

rather use getLHS and getRHS, otherwise very hard to read the caller code

This comment has been minimized.

Copy link
Contributor Author

katzyn replied Jul 13, 2018

This method was copied from ConditionAndOr as is. So it should be changed too.

I did not understand what you suggest. Add a couple of methods instead of shared one or what?

This comment has been minimized.

Copy link
Contributor

grandinj replied Jul 13, 2018

yes, use two methods rather than one is what I meant

@grandinj

This comment has been minimized.

Copy link
Contributor

grandinj commented on h2/src/main/org/h2/expression/ConditionNot.java in e647b73 Jul 13, 2018

rather call this getSubCondition or getSubExpression

@grandinj

This comment has been minimized.

Copy link
Contributor

grandinj commented on h2/src/main/org/h2/command/dml/Query.java in e647b73 Jul 13, 2018

a comment as to the purpose of this method would be nice

@katzyn katzyn merged commit 1e46200 into h2database:master Jul 13, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@katzyn katzyn deleted the katzyn:distinct branch Jul 13, 2018

@vetemi vetemi referenced this pull request Dec 28, 2018

Closed

Duplicates when sorting #98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.