Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve "arrangement" in collapse and compute #2281

Closed
JohnMount opened this issue Dec 1, 2016 · 5 comments
Closed

Preserve "arrangement" in collapse and compute #2281

JohnMount opened this issue Dec 1, 2016 · 5 comments
Labels
feature a feature request or enhancement

Comments

@JohnMount
Copy link

JohnMount commented Dec 1, 2016

Trying to land results with dplyr::collapse and dplyr::compute appears to lose dplyr::arrange ordering on PostgreSQL (probably will happen on all other DB backends including sparklyr). At the very least the annotation of the ordering being present is lost (hence the warning messages in the example below), but it also seems likely the order is lost (though it isn't obvious in this example).

Details here and pasted below.

Check durability of dplyr::arrange through dplyr::compute.

library('dplyr')
 #  
 #  Attaching package: 'dplyr'
 #  The following objects are masked from 'package:stats':
 #  
 #      filter, lag
 #  The following objects are masked from 'package:base':
 #  
 #      intersect, setdiff, setequal, union
library('RPostgreSQL')
 #  Loading required package: DBI
packageVersion('dplyr')
 #  [1] '0.5.0'
packageVersion('RPostgreSQL')
 #  [1] '0.4.1'
my_db <- dplyr::src_postgres(host = 'localhost',port = 5432,user = 'postgres',password = 'pg')
class(my_db)
 #  [1] "src_postgres" "src_sql"      "src"
set.seed(32525)
dz <- dplyr::copy_to(my_db,data.frame(x=runif(1000)),'dz99',overwrite=TRUE)

Notice below: no warnings in frame or runtime.

dz %>% arrange(x) %>% mutate(ccol=1) %>% mutate(rank=cumsum(ccol))  -> dz1
print(dz1)
 #  Source:   query [?? x 3]
 #  Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
 #  
 #               x  ccol  rank
 #           <dbl> <dbl> <dbl>
 #  1  0.002176207     1     1
 #  2  0.003543465     1     2
 #  3  0.004778773     1     3
 #  4  0.005225066     1     4
 #  5  0.005311800     1     5
 #  6  0.005833068     1     6
 #  7  0.006158232     1     7
 #  8  0.006178999     1     8
 #  9  0.006268262     1     9
 #  10 0.007748033     1    10
 #  # ... with more rows
warnings()
 #  NULL

Notice below: warning "Warning: Windowed expression 'sum("ccol")' does not have explicit order.". Result may appear the same, but we do not seem to be able to depend on that.

dz %>% arrange(x) %>% compute() %>% mutate(ccol=1) %>% mutate(rank=cumsum(ccol))  -> dz2
print(dz2)
 #  Source:   query [?? x 3]
 #  Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
 #  Warning: Windowed expression 'sum("ccol")' does not have explicit order.
 #  Please use arrange() to make determinstic.
 #               x  ccol  rank
 #           <dbl> <dbl> <dbl>
 #  1  0.002176207     1     1
 #  2  0.003543465     1     2
 #  3  0.004778773     1     3
 #  4  0.005225066     1     4
 #  5  0.005311800     1     5
 #  6  0.005833068     1     6
 #  7  0.006158232     1     7
 #  8  0.006178999     1     8
 #  9  0.006268262     1     9
 #  10 0.007748033     1    10
 #  # ... with more rows
warnings()
 #  NULL

Notice below: warning "Warning: Windowed expression 'sum("ccol")' does not have explicit order.". Result may appear the same, but we do not seem to be able to depend on that.

dz %>% arrange(x) %>% collapse() %>% mutate(ccol=1) %>% mutate(rank=cumsum(ccol))  -> dz3
print(dz3)
 #  Source:   query [?? x 3]
 #  Database: postgres 9.6.1 [postgres@localhost:5432/postgres]
 #  Warning: Windowed expression 'sum("ccol")' does not have explicit order.
 #  Please use arrange() to make determinstic.
 #               x  ccol  rank
 #           <dbl> <dbl> <dbl>
 #  1  0.002176207     1     1
 #  2  0.003543465     1     2
 #  3  0.004778773     1     3
 #  4  0.005225066     1     4
 #  5  0.005311800     1     5
 #  6  0.005833068     1     6
 #  7  0.006158232     1     7
 #  8  0.006178999     1     8
 #  9  0.006268262     1     9
 #  10 0.007748033     1    10
 #  # ... with more rows
warnings()
 #  NULL
version
 #                 _                           
 #  platform       x86_64-apple-darwin13.4.0   
 #  arch           x86_64                      
 #  os             darwin13.4.0                
 #  system         x86_64, darwin13.4.0        
 #  status                                     
 #  major          3                           
 #  minor          3.2                         
 #  year           2016                        
 #  month          10                          
 #  day            31                          
 #  svn rev        71607                       
 #  language       R                           
 #  version.string R version 3.3.2 (2016-10-31)
 #  nickname       Sincere Pumpkin Patch
@JohnMount JohnMount changed the title dplyr::collect and dplyr::comptue appear to lose dplyr::arrange ordering dplyr::collect and dplyr::compute appear to lose dplyr::arrange ordering Dec 1, 2016
@JohnMount JohnMount changed the title dplyr::collect and dplyr::compute appear to lose dplyr::arrange ordering dplyr:: collapse and dplyr::compute appear to lose dplyr::arrange ordering Dec 1, 2016
@JohnMount JohnMount changed the title dplyr:: collapse and dplyr::compute appear to lose dplyr::arrange ordering dplyr::collapse and dplyr::compute appear to lose dplyr::arrange ordering Dec 1, 2016
@krlmlr
Copy link
Member

krlmlr commented Dec 2, 2016

@hadley: I wonder if we should preserve the order after collapse() and compute().

@krlmlr krlmlr added the database label Dec 2, 2016
@hadley
Copy link
Member

hadley commented Dec 2, 2016

Hmmmm, I think that's reasonable as row order is not fixed in database tables.

@JohnMount
Copy link
Author

I would not expect compute() to preserve order (as databases don't have that concept).

However, the documentation of collapse() doesn't seem claim the result is landed as a table: "collapse doesn't force computation, but collapses a complex tbl into a form that additional restrictions can be placed on." I was reading "complex tbl" as shorthand for "complex tbl calculation."

Wouldn't you want an invariant that adding compute() and collapse() don't change the semantics of a calculation? The workflow I am thinking of is grouped ranking as in:

# define our windowed operation, in this case ranking
rank_in_group <- . %>% mutate(constcol=1) %>%
          mutate(rank=cumsum(constcol)) %>% select(-constcol)

# calculate
iris %>% group_by(Species) %>% arrange(desc(Sepal.Length)) %>%
  rank_in_group 

@hadley
Copy link
Member

hadley commented Dec 2, 2016

I meant it's reasonable for dplyr to preserve the row order, since the database doesn't. It's just a matter of copying the ordering attribute.

@JohnMount
Copy link
Author

Ah, sorry I misread that as "it is reasonable to not have row-order." Actually another thing I am clamoring for is to make the ordering attribute user visible like "Groups" is. Seeing that would really help users going forward.

@hadley hadley added the feature a feature request or enhancement label Jan 31, 2017
@hadley hadley changed the title dplyr::collapse and dplyr::compute appear to lose dplyr::arrange ordering Preserve "arrangement" in collapse and compute Feb 20, 2017
@hadley hadley closed this as completed in 19cff25 Feb 21, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

3 participants