New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big performance regression in subsetting in v0.11 #251

Closed
TomAndrews opened this Issue Jul 18, 2018 · 7 comments

Comments

Projects
None yet
3 participants
@TomAndrews
Contributor

TomAndrews commented Jul 18, 2018

Description

There seems to be a big performance regression when subsetting an xts object using a vector of dates. In the example below it goes from 41 milliseconds to 17 seconds to perform the subsetting.

Using v0.10-2:

library(xts)
library(microbenchmark)
ts <- .xts(1:1e5, 1:1e5)
dates <- .POSIXct(1:1e5)
microbenchmark(ts[dates])

 Unit: milliseconds
       expr      min       lq     mean   median       uq      max neval
  ts[dates] 40.54739 40.67655 41.37819 40.96016 41.41218 45.32007   100

Session Info

R version 3.5.1 (2018-07-02)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Debian GNU/Linux 9 (stretch)

 Matrix products: default
 BLAS: /usr/lib/libblas/libblas.so.3.7.0
 LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

 locale:
  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 other attached packages:
 [1] xts_0.10-2           zoo_1.8-3            microbenchmark_1.4-4

 loaded via a namespace (and not attached):
 [1] compiler_3.5.1  tools_3.5.1     grid_3.5.1      lattice_0.20-35

Using v0.11:

library(xts)
library(microbenchmark)
ts <- .xts(1:1e5, 1:1e5)
dates <- .POSIXct(1:1e5)
microbenchmark(ts[dates], times=1)  # fewer iterations, otherwise it takes too long

Unit: seconds
       expr      min       lq     mean   median       uq      max neval
  ts[dates] 17.16381 17.16381 17.16381 17.16381 17.16381 17.16381     1

Session Info

R version 3.5.1 (2018-07-02)
 Platform: x86_64-pc-linux-gnu (64-bit)
 Running under: Debian GNU/Linux 9 (stretch)

 Matrix products: default
 BLAS: /usr/lib/libblas/libblas.so.3.7.0
 LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

 locale:
  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8
  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8
  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base

 other attached packages:
 [1] microbenchmark_1.4-4 xts_0.11-0           zoo_1.8-3

 loaded via a namespace (and not attached):
 [1] compiler_3.5.1  tools_3.5.1     grid_3.5.1      lattice_0.20-35

Obviously this is a bit of a stupid example since I'm just fetching all the dates in ts. In my actual problem, dates is arbitrary but large. Should I be re-writing the subsetting in a different way?

@TomAndrews

This comment has been minimized.

Show comment
Hide comment
@TomAndrews

TomAndrews Jul 18, 2018

Contributor

Thinking about it some more, it looks like I should be using merge.xts:

library(xts)
library(microbenchmark)
ts <- .xts(1:1e5, 1:1e5)
dates <- .POSIXct(1:1e5)
microbenchmark(ts[dates], merge.xts(ts, xts(, dates), join="right", retside=c(TRUE,FALSE)), times=1)  # fewer iterations, otherwise it takes too long

 Unit: milliseconds
                                                                        expr
                                                                   ts[dates]
  merge.xts(ts, xts(, dates), join = "right", retside = c(TRUE,      FALSE))
           min           lq         mean       median           uq          max
  18403.239120 18403.239120 18403.239120 18403.239120 18403.239120 18403.239120
      2.835482     2.835482     2.835482     2.835482     2.835482     2.835482
Contributor

TomAndrews commented Jul 18, 2018

Thinking about it some more, it looks like I should be using merge.xts:

library(xts)
library(microbenchmark)
ts <- .xts(1:1e5, 1:1e5)
dates <- .POSIXct(1:1e5)
microbenchmark(ts[dates], merge.xts(ts, xts(, dates), join="right", retside=c(TRUE,FALSE)), times=1)  # fewer iterations, otherwise it takes too long

 Unit: milliseconds
                                                                        expr
                                                                   ts[dates]
  merge.xts(ts, xts(, dates), join = "right", retside = c(TRUE,      FALSE))
           min           lq         mean       median           uq          max
  18403.239120 18403.239120 18403.239120 18403.239120 18403.239120 18403.239120
      2.835482     2.835482     2.835482     2.835482     2.835482     2.835482
@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich Jul 19, 2018

Owner

A quick profiling session indicates that the bottleneck is in window_idx(). Specifically, the call to c() in the for and repeat loop inside the block below. @corwinjoy, do you remember if you wrote test cases for the situation handled in this code block? I quickly looked at the tests in runit.xts.methods.R, but didn't see anything that mentioned testing duplicate index values.

  if(usr_idx && !is.null(firstlast)) {
    # Translate from user .index to xts index
    # We get back upper bound of index as per findInterval
    tmp <- base_idx[firstlast]
    
    # Iterate in reverse to grab all matches
    # We have to do this to handle duplicate dates in the xts index.
    tmp <- rev(tmp)
    res <- NULL
    for(i in tmp) {
      dt <- idx[i]
      j <- i
      repeat {
        res <- c(res, j)
        j <- j -1
        if(j < 1 || idx[j] != dt) break
      }
    }
    firstlast <- rev(res)
  }

I'd like a solution that is more performant with duplicate values, but a quick fix for the non-duplicate case would be to add an anyDuplicated() check.

Owner

joshuaulrich commented Jul 19, 2018

A quick profiling session indicates that the bottleneck is in window_idx(). Specifically, the call to c() in the for and repeat loop inside the block below. @corwinjoy, do you remember if you wrote test cases for the situation handled in this code block? I quickly looked at the tests in runit.xts.methods.R, but didn't see anything that mentioned testing duplicate index values.

  if(usr_idx && !is.null(firstlast)) {
    # Translate from user .index to xts index
    # We get back upper bound of index as per findInterval
    tmp <- base_idx[firstlast]
    
    # Iterate in reverse to grab all matches
    # We have to do this to handle duplicate dates in the xts index.
    tmp <- rev(tmp)
    res <- NULL
    for(i in tmp) {
      dt <- idx[i]
      j <- i
      repeat {
        res <- c(res, j)
        j <- j -1
        if(j < 1 || idx[j] != dt) break
      }
    }
    firstlast <- rev(res)
  }

I'd like a solution that is more performant with duplicate values, but a quick fix for the non-duplicate case would be to add an anyDuplicated() check.

@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy Jul 20, 2018

Contributor

Let's see:
I wrote a test for duplicated dates in the xts series. This is on line 190 of runit.xts.methods.R:

# Test index parameter with repeated dates in xts series
  idx <- sort(rep(1:5, 5))
  x <- xts(1:length(idx), as.Date("1999-12-31")+idx)
  bin <- window(x, index = as.Date("1999-12-31")+c(1,3,5))

So here bin gives the duplicated dates in the xts series as expected.
Looking further I don't see a test for what happens when the index is duplicated.
So, the test might be something like:
bin <- window(x, index = as.Date("1999-12-31")+c(1,1,2))
We should add an explicit test for validation.

Now we did have a discussion about this because there are two aspects:

  1. Are duplicate values allowed, we said yes in keeping with what R does.
  2. But, the returned values are sorted by date, unlike normal subsetting.
    (See #100 (comment))

Now, back to the performance of this routine. I agree it is slow. It is trying to do a fast subset of the dates, but making it work with duplicated dates in the xts index slows everything down.
Options:

  1. If we don't have duplicated dates the above routine can be fast. It just needs to return base_idx[firstlast]. No iteration needed. So maybe a flag indicating if the xts data has duplicated dates?
  2. I tried to solve this using just R functions and findInterval(). At the time I thought it was cool, but really it is probably too complicated. Maybe there is a better way to do this in R?
  3. A better approach is probably just to do this in C. It is not quite a standard set intersection algorithm because we can have duplicated dates in the index data. But it is quite close, just choose all the data from the xts series where we match. Here is the standard linear time algorithm (https://www.geeksforgeeks.org/union-and-intersection-of-two-sorted-arrays-2/)
    I don't think we need the O(sqrt(n)) time algorithm like Baeza-Yates unless folks are doing just huge date sets.

Looking further, you could say that 3) is already implemented by merge.xts(). However, despite the design discussion, it does not correctly handle joins with duplicates. In a normal inner join you can have duplicates and this is OK. (https://stackoverflow.com/questions/40218138/will-inner-join-allow-duplicates). But when I look at merge.xts I see:

x <- xts(4:10, Sys.Date()+c(4,4,4, 7:10))
y <- xts(1:6, Sys.Date()+1:6)
merge(x,y, join='inner')
#            x y
# 2018-07-23 4 4

Instead, the inner join should return 3 records matching the values in x. (Here is a more correct version of a sort-merge-join that allows for duplicate values: http://www.dcs.ed.ac.uk/home/tz/phd/thesis/node20.htm).
So, probably the best solution here is to fix the logic in merge.xts() and then just call it to do an inner join once we have found the range bounded by start and end date.

Contributor

corwinjoy commented Jul 20, 2018

Let's see:
I wrote a test for duplicated dates in the xts series. This is on line 190 of runit.xts.methods.R:

# Test index parameter with repeated dates in xts series
  idx <- sort(rep(1:5, 5))
  x <- xts(1:length(idx), as.Date("1999-12-31")+idx)
  bin <- window(x, index = as.Date("1999-12-31")+c(1,3,5))

So here bin gives the duplicated dates in the xts series as expected.
Looking further I don't see a test for what happens when the index is duplicated.
So, the test might be something like:
bin <- window(x, index = as.Date("1999-12-31")+c(1,1,2))
We should add an explicit test for validation.

Now we did have a discussion about this because there are two aspects:

  1. Are duplicate values allowed, we said yes in keeping with what R does.
  2. But, the returned values are sorted by date, unlike normal subsetting.
    (See #100 (comment))

Now, back to the performance of this routine. I agree it is slow. It is trying to do a fast subset of the dates, but making it work with duplicated dates in the xts index slows everything down.
Options:

  1. If we don't have duplicated dates the above routine can be fast. It just needs to return base_idx[firstlast]. No iteration needed. So maybe a flag indicating if the xts data has duplicated dates?
  2. I tried to solve this using just R functions and findInterval(). At the time I thought it was cool, but really it is probably too complicated. Maybe there is a better way to do this in R?
  3. A better approach is probably just to do this in C. It is not quite a standard set intersection algorithm because we can have duplicated dates in the index data. But it is quite close, just choose all the data from the xts series where we match. Here is the standard linear time algorithm (https://www.geeksforgeeks.org/union-and-intersection-of-two-sorted-arrays-2/)
    I don't think we need the O(sqrt(n)) time algorithm like Baeza-Yates unless folks are doing just huge date sets.

Looking further, you could say that 3) is already implemented by merge.xts(). However, despite the design discussion, it does not correctly handle joins with duplicates. In a normal inner join you can have duplicates and this is OK. (https://stackoverflow.com/questions/40218138/will-inner-join-allow-duplicates). But when I look at merge.xts I see:

x <- xts(4:10, Sys.Date()+c(4,4,4, 7:10))
y <- xts(1:6, Sys.Date()+1:6)
merge(x,y, join='inner')
#            x y
# 2018-07-23 4 4

Instead, the inner join should return 3 records matching the values in x. (Here is a more correct version of a sort-merge-join that allows for duplicate values: http://www.dcs.ed.ac.uk/home/tz/phd/thesis/node20.htm).
So, probably the best solution here is to fix the logic in merge.xts() and then just call it to do an inner join once we have found the range bounded by start and end date.

@TomAndrews

This comment has been minimized.

Show comment
Hide comment
@TomAndrews

TomAndrews Jul 20, 2018

Contributor

I'm not at all familiar with the working of xts subsetting, so I may have misunderstood what you are discussing in which case I apologise.

I just wanted to point out that in the example above there are no duplicates in the index of ts nor in dates. It seems to be the sheer number of entries in dates that is causing the slowdown.

Contributor

TomAndrews commented Jul 20, 2018

I'm not at all familiar with the working of xts subsetting, so I may have misunderstood what you are discussing in which case I apologise.

I just wanted to point out that in the example above there are no duplicates in the index of ts nor in dates. It seems to be the sheer number of entries in dates that is causing the slowdown.

@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich Jul 21, 2018

Owner

Thanks for your detailed comments, @corwinjoy!

It looks like window.xts() currently behaves different from window.zoo() when index. contains duplicate values. window.zoo() seems to ignore the duplicates, while window.xts() returns all the matching .index() values multiple times. For example:

xidx <- as.Date("1999-12-31") + rep(1:3, each = 2)
# duplicate values in xts index, and duplicate values in window index.
x <- xts(seq_along(xidx), xidx)
widx <- as.Date("1999-12-31") + c(1, 1)
window(x, widx)
#            [,1]
# 2000-01-01    1
# 2000-01-01    2
# 2000-01-01    1
# 2000-01-01    2
zoo:::window.zoo(x, widx)
#            [,1]
# 2000-01-01    1
# 2000-01-01    2

I lean more toward consistency with zoo, but I do realize that this is a bit of an edge case.

I agree with your assessment of the result of inner joins performed by merge.xts(). I created issue #106 to document it, but have not attempted a fix. I plan to refactor the merge C code and address the issue afterward.

@TomAndrews the slowdown is caused by re-allocating the res vector inside the for and repeat loops. You don't need to worry about the details of xts subsetting. The benchmark tests you provided are helpful by themselves.

Owner

joshuaulrich commented Jul 21, 2018

Thanks for your detailed comments, @corwinjoy!

It looks like window.xts() currently behaves different from window.zoo() when index. contains duplicate values. window.zoo() seems to ignore the duplicates, while window.xts() returns all the matching .index() values multiple times. For example:

xidx <- as.Date("1999-12-31") + rep(1:3, each = 2)
# duplicate values in xts index, and duplicate values in window index.
x <- xts(seq_along(xidx), xidx)
widx <- as.Date("1999-12-31") + c(1, 1)
window(x, widx)
#            [,1]
# 2000-01-01    1
# 2000-01-01    2
# 2000-01-01    1
# 2000-01-01    2
zoo:::window.zoo(x, widx)
#            [,1]
# 2000-01-01    1
# 2000-01-01    2

I lean more toward consistency with zoo, but I do realize that this is a bit of an edge case.

I agree with your assessment of the result of inner joins performed by merge.xts(). I created issue #106 to document it, but have not attempted a fix. I plan to refactor the merge C code and address the issue afterward.

@TomAndrews the slowdown is caused by re-allocating the res vector inside the for and repeat loops. You don't need to worry about the details of xts subsetting. The benchmark tests you provided are helpful by themselves.

@corwinjoy

This comment has been minimized.

Show comment
Hide comment
@corwinjoy

corwinjoy Jul 23, 2018

Contributor
Contributor

corwinjoy commented Jul 23, 2018

joshuaulrich added a commit that referenced this issue Jul 30, 2018

Handle duplicate index values in C for window_idx
This basically moves the R code into C, with minor changes. The C code
pre-(and over-)allocates the result and loops over the locations in
reverse order. Corwin suggested better long-term solutions (e.g. fix
merge.xts() to handle duplicate index values correctly), but this
this commit addresses the immediate performance regression.

Also add more benchmarks.

See #251.

@joshuaulrich joshuaulrich self-assigned this Jul 30, 2018

@joshuaulrich joshuaulrich added the bug label Jul 30, 2018

@joshuaulrich

This comment has been minimized.

Show comment
Hide comment
@joshuaulrich

joshuaulrich Jul 30, 2018

Owner

@TomAndrews, thanks again for the report! My timings suggest this solution is 2-3x faster than 0.10-2. Hopefully you see similar results.

Owner

joshuaulrich commented Jul 30, 2018

@TomAndrews, thanks again for the report! My timings suggest this solution is 2-3x faster than 0.10-2. Hopefully you see similar results.

@joshuaulrich joshuaulrich added this to the Release 0.11-1 milestone Jul 30, 2018

joshuaulrich added a commit that referenced this issue Aug 5, 2018

Restore support for both index types in window()
Moving the duplicate index value handling to C created an error when
the index type was integer. This was not caught in the unit tests.
Add a loop around the window/subset unit tests that run them on both
index types (double and integer). Add support for both index types to
the fill_window_dups_rev() C function.

Change the scalar '1' to '1L' in the pmax() call after findInterval()
to ensure that 'base_index' is not coerced to double from integer.

See #251.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment