Add support for select distinct on(a, b, c) ... #12653

JivanRoquet · 2019-04-12T08:30:44Z

The issue in Presto is that on one side, one can't use

select distinct on (a, b)
  c
from d

but one also cannot use:

select
  c
from d
group by a, b

Combining these two limitations together, makes deduplicating rows a relatively cumbersome process, needing resorting either to subqueries with window functions and retrieving the row number, or to array aggregations. Again, a lot of context to be carried over, a complexity which adds up exponentially as more elements get in, and much more error-prone than either of the cleaner solutions above.

Postgres implementation of select distinct on is very straightforward and even allows for custom sorting, e.g:

select distinct on (a, b)
  c
from d
order by
  e desc,
  f asc

The text was updated successfully, but these errors were encountered:

rongrong · 2019-04-12T21:38:42Z

Do you have references to SQL spec or examples on how other query engines handle this? Thanks!

rschlussel · 2019-04-15T17:00:04Z

Can you explain how this is different from using arbitrary or max or max_by?
SELECT max_by(e, c) from d group by a, b

allenexc · 2019-11-22T00:03:20Z

Can you explain how this is different from using arbitrary or max or max_by?
SELECT max_by(e, c) from d group by a, b

SELECT DISTINCT ON in Postgres lets you select entire rows, whereas max_by/min_by returns one result. I couldn't find a way to do this in Presto without creating a CTE.

See:
https://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group/7630564#7630564
https://www.postgresql.org/docs/9.5/sql-select.html#SQL-DISTINCT

NicolasGuary · 2020-04-21T14:28:38Z

Found a solution from https://redshift-support.matillion.com/s/article/2822021

ROW_NUMBER() OVER ( PARTITION BY <<unique columns>> ORDER BY <<sort columns>>) as counts

And then select where counts=1 only.

Hope this can help

allenexc · 2020-04-21T19:17:17Z

That's similar to what I ended up leaving it as, but I think it's still rather lengthy and inefficient compared to postgres' select distinct on for something that's used quite frequently. On Apr 21, 2020, at 7:29 AM, Nicolas Guary <notifications@github.com> wrote: Found a solution from https://redshift-support.matillion.com/s/article/2822021<https://urldefense.proofpoint.com/v2/url?u=https-3A__redshift-2Dsupport.matillion.com_s_article_2822021&d=DwMCaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=sequIz2q--a9nk09NOXnvw&m=FaDjpZ1AYA6BtkwFTmXFQnLlZJT4YQYzhTR5Az-su_Q&s=CWrZqCbcRt3tPL4vbIhr-qGdd8e_52maXXqmIWYT8Fc&e=> ROW_NUMBER() OVER ( PARTITION BY <<unique columns>> ORDER BY <<sort columns>>) as counts And then select where counts=1 only. Hope this can help — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#12653 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AN25Q3Z2KUOAQ7FONQRTUTLRNWUSPANCNFSM4HFONFLQ>.

JivanRoquet · 2020-04-22T08:28:22Z

@NicolasGuary if you read my original post

[need to resort to] subqueries with window functions and retrieving the row number

Plus, I'm not sure why quote a post about Redshit to demonstrate how Presto is working.

Israel-Kli · 2020-04-22T13:02:01Z

@JivanRoquet The solution from here work for me on Athena:

SELECT Name, MAX(Address), MAX(other field)...
FROM MyTable
GROUP BY Name

Will give you one row per Name.

https://stackoverflow.com/a/6792357/9225626

JivanRoquet · 2020-04-22T14:04:02Z

@Kligerr that wasn't probably clear enough in my original message, but the issue with this is that you need the Name field to be included in your column selection as well. For instance, the following wouldn't work in Presto:

SELECT
  Age,
  Address,
  LastPurchaseDate
FROM MyTable
GROUP BY Name

To achieve that you would need to encapsulate your query into a wrapper like:

SELECT
  Age,
  Address,
  LastPurchaseDate
FROM (
  SELECT
    Name,
    Age,
    Address,
    LastPurchaseDate
  FROM MyTable
  GROUP BY Name
) b

Which is, again, much more cumbersome and complex than the Postgres way:

SELECT DISTINCT ON (NAME)
  Age,
  Address,
  LastPurchaseDate
FROM MyTable

It's already a problem when you write each query manually, but above all it makes writing automated queries a much more complex process.

JivanRoquet · 2020-11-21T11:41:07Z

@rongrong

Do you have references to SQL spec or examples on how other query engines handle this? Thanks!

one example that I just stumbled upon illustrating the benefit of the distinct on syntax. if you take the following:

select
  vessel_id,
  timestamp,
  longitude,
  latitude
from (
  select
    rank() over (partition by vessel_id, ts_, lon_, lat_ order by timestamp asc) as r,
    *
  from (
    select
      vessel_id,
      timestamp,
      cast(timestamp / 7200 as integer) as ts_,
      cast(longitude * 100 as integer) as lon_,
      cast(latitude * 100 as integer) as lat_,
      longitude,
      latitude
    from position
  ) v
) q
where r = 1

instead of this, distinct on can allow rewriting the above query in a much simpler manner, avoiding one level of subquery

select
  distinct on (vessel_id, ts_, lon_, lat_)
  vessel_id,
  timestamp,
  longitude,
  latitude
from (
    select
      vessel_id,
      timestamp,
      cast(timestamp / 7200 as integer) as ts_,
      cast(longitude * 100 as integer) as lon_,
      cast(latitude * 100 as integer) as lat_,
      longitude,
      latitude
    from position
) v
order by vessel_id, ts_, lon_, lat_, timestamp desc

ricklamps · 2023-10-03T13:04:32Z

bump^ would love to see this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for select distinct on(a, b, c) ... #12653

Add support for select distinct on(a, b, c) ... #12653

JivanRoquet commented Apr 12, 2019 •

edited

rongrong commented Apr 12, 2019

rschlussel commented Apr 15, 2019

allenexc commented Nov 22, 2019

NicolasGuary commented Apr 21, 2020

allenexc commented Apr 21, 2020 via email

JivanRoquet commented Apr 22, 2020

Israel-Kli commented Apr 22, 2020 •

edited

JivanRoquet commented Apr 22, 2020

JivanRoquet commented Nov 21, 2020

ricklamps commented Oct 3, 2023

Add support for select distinct on(a, b, c) ... #12653

Add support for select distinct on(a, b, c) ... #12653

Comments

JivanRoquet commented Apr 12, 2019 • edited

rongrong commented Apr 12, 2019

rschlussel commented Apr 15, 2019

allenexc commented Nov 22, 2019

NicolasGuary commented Apr 21, 2020

allenexc commented Apr 21, 2020 via email

JivanRoquet commented Apr 22, 2020

Israel-Kli commented Apr 22, 2020 • edited

JivanRoquet commented Apr 22, 2020

JivanRoquet commented Nov 21, 2020

ricklamps commented Oct 3, 2023

JivanRoquet commented Apr 12, 2019 •

edited

Israel-Kli commented Apr 22, 2020 •

edited