Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for select distinct on(a, b, c) ... #12653

Open
JivanRoquet opened this issue Apr 12, 2019 · 10 comments
Open

Add support for select distinct on(a, b, c) ... #12653

JivanRoquet opened this issue Apr 12, 2019 · 10 comments

Comments

@JivanRoquet
Copy link

JivanRoquet commented Apr 12, 2019

The issue in Presto is that on one side, one can't use

select distinct on (a, b)
  c
from d

but one also cannot use:

select
  c
from d
group by a, b

Combining these two limitations together, makes deduplicating rows a relatively cumbersome process, needing resorting either to subqueries with window functions and retrieving the row number, or to array aggregations. Again, a lot of context to be carried over, a complexity which adds up exponentially as more elements get in, and much more error-prone than either of the cleaner solutions above.

Postgres implementation of select distinct on is very straightforward and even allows for custom sorting, e.g:

select distinct on (a, b)
  c
from d
order by
  e desc,
  f asc
@rongrong
Copy link
Contributor

Do you have references to SQL spec or examples on how other query engines handle this? Thanks!

@rschlussel
Copy link
Contributor

Can you explain how this is different from using arbitrary or max or max_by?
SELECT max_by(e, c) from d group by a, b

@allenexc
Copy link

Can you explain how this is different from using arbitrary or max or max_by?
SELECT max_by(e, c) from d group by a, b

SELECT DISTINCT ON in Postgres lets you select entire rows, whereas max_by/min_by returns one result. I couldn't find a way to do this in Presto without creating a CTE.

See:
https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564
https://www.postgresql.org/docs/9.5/sql-select.html#SQL-DISTINCT

@NicolasGuary
Copy link

Found a solution from https://redshift-support.matillion.com/s/article/2822021

ROW_NUMBER() OVER ( PARTITION BY <<unique columns>> ORDER BY <<sort columns>>) as counts

And then select where counts=1 only.

Hope this can help

@allenexc
Copy link

allenexc commented Apr 21, 2020 via email

@JivanRoquet
Copy link
Author

@NicolasGuary if you read my original post

[need to resort to] subqueries with window functions and retrieving the row number

Plus, I'm not sure why quote a post about Redshit to demonstrate how Presto is working.

@Israel-Kli
Copy link

Israel-Kli commented Apr 22, 2020

@JivanRoquet The solution from here work for me on Athena:

SELECT Name, MAX(Address), MAX(other field)...
FROM MyTable
GROUP BY Name

Will give you one row per Name.

https://stackoverflow.com/a/6792357/9225626

@JivanRoquet
Copy link
Author

@Kligerr that wasn't probably clear enough in my original message, but the issue with this is that you need the Name field to be included in your column selection as well. For instance, the following wouldn't work in Presto:

SELECT
  Age,
  Address,
  LastPurchaseDate
FROM MyTable
GROUP BY Name

To achieve that you would need to encapsulate your query into a wrapper like:

SELECT
  Age,
  Address,
  LastPurchaseDate
FROM (
  SELECT
    Name,
    Age,
    Address,
    LastPurchaseDate
  FROM MyTable
  GROUP BY Name
) b

Which is, again, much more cumbersome and complex than the Postgres way:

SELECT DISTINCT ON (NAME)
  Age,
  Address,
  LastPurchaseDate
FROM MyTable

It's already a problem when you write each query manually, but above all it makes writing automated queries a much more complex process.

@JivanRoquet
Copy link
Author

@rongrong

Do you have references to SQL spec or examples on how other query engines handle this? Thanks!

one example that I just stumbled upon illustrating the benefit of the distinct on syntax. if you take the following:

select
  vessel_id,
  timestamp,
  longitude,
  latitude
from (
  select
    rank() over (partition by vessel_id, ts_, lon_, lat_ order by timestamp asc) as r,
    *
  from (
    select
      vessel_id,
      timestamp,
      cast(timestamp / 7200 as integer) as ts_,
      cast(longitude * 100 as integer) as lon_,
      cast(latitude * 100 as integer) as lat_,
      longitude,
      latitude
    from position
  ) v
) q
where r = 1

instead of this, distinct on can allow rewriting the above query in a much simpler manner, avoiding one level of subquery

select
  distinct on (vessel_id, ts_, lon_, lat_)
  vessel_id,
  timestamp,
  longitude,
  latitude
from (
    select
      vessel_id,
      timestamp,
      cast(timestamp / 7200 as integer) as ts_,
      cast(longitude * 100 as integer) as lon_,
      cast(latitude * 100 as integer) as lat_,
      longitude,
      latitude
    from position
) v
order by vessel_id, ts_, lon_, lat_, timestamp desc

@ricklamps
Copy link

bump^ would love to see this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants