New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for select distinct on(a, b, c) ... #12653
Comments
Do you have references to SQL spec or examples on how other query engines handle this? Thanks! |
Can you explain how this is different from using arbitrary or max or max_by? |
SELECT DISTINCT ON in Postgres lets you select entire rows, whereas max_by/min_by returns one result. I couldn't find a way to do this in Presto without creating a CTE. See: |
Found a solution from https://redshift-support.matillion.com/s/article/2822021
And then select where counts=1 only. Hope this can help |
That's similar to what I ended up leaving it as, but I think it's still rather lengthy and inefficient compared to postgres' select distinct on for something that's used quite frequently.
On Apr 21, 2020, at 7:29 AM, Nicolas Guary <notifications@github.com> wrote:
Found a solution from https://redshift-support.matillion.com/s/article/2822021<https://urldefense.proofpoint.com/v2/url?u=https-3A__redshift-2Dsupport.matillion.com_s_article_2822021&d=DwMCaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=sequIz2q--a9nk09NOXnvw&m=FaDjpZ1AYA6BtkwFTmXFQnLlZJT4YQYzhTR5Az-su_Q&s=CWrZqCbcRt3tPL4vbIhr-qGdd8e_52maXXqmIWYT8Fc&e=>
ROW_NUMBER() OVER ( PARTITION BY <<unique columns>> ORDER BY <<sort columns>>) as counts
And then select where counts=1 only.
Hope this can help
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#12653 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AN25Q3Z2KUOAQ7FONQRTUTLRNWUSPANCNFSM4HFONFLQ>.
|
@NicolasGuary if you read my original post
Plus, I'm not sure why quote a post about Redshit to demonstrate how Presto is working. |
@JivanRoquet The solution from here work for me on Athena:
Will give you one row per Name. |
@Kligerr that wasn't probably clear enough in my original message, but the issue with this is that you need the
To achieve that you would need to encapsulate your query into a wrapper like:
Which is, again, much more cumbersome and complex than the Postgres way:
It's already a problem when you write each query manually, but above all it makes writing automated queries a much more complex process. |
one example that I just stumbled upon illustrating the benefit of the select
vessel_id,
timestamp,
longitude,
latitude
from (
select
rank() over (partition by vessel_id, ts_, lon_, lat_ order by timestamp asc) as r,
*
from (
select
vessel_id,
timestamp,
cast(timestamp / 7200 as integer) as ts_,
cast(longitude * 100 as integer) as lon_,
cast(latitude * 100 as integer) as lat_,
longitude,
latitude
from position
) v
) q
where r = 1 instead of this, select
distinct on (vessel_id, ts_, lon_, lat_)
vessel_id,
timestamp,
longitude,
latitude
from (
select
vessel_id,
timestamp,
cast(timestamp / 7200 as integer) as ts_,
cast(longitude * 100 as integer) as lon_,
cast(latitude * 100 as integer) as lat_,
longitude,
latitude
from position
) v
order by vessel_id, ts_, lon_, lat_, timestamp desc |
bump^ would love to see this |
The issue in Presto is that on one side, one can't use
but one also cannot use:
Combining these two limitations together, makes deduplicating rows a relatively cumbersome process, needing resorting either to subqueries with window functions and retrieving the row number, or to array aggregations. Again, a lot of context to be carried over, a complexity which adds up exponentially as more elements get in, and much more error-prone than either of the cleaner solutions above.
Postgres implementation of
select distinct on
is very straightforward and even allows for custom sorting, e.g:The text was updated successfully, but these errors were encountered: