Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

na.fill long call times #259

Open
ckatsulis opened this issue Jul 25, 2018 · 8 comments
Open

na.fill long call times #259

ckatsulis opened this issue Jul 25, 2018 · 8 comments

Comments

@ckatsulis
Copy link

@ckatsulis ckatsulis commented Jul 25, 2018

na.fill() takes much longer than other methods to fill NA values. I expected near constant time to find and fill NA values.

Reproducible timings are below. I will send testData via email independently as it is too large for this forum.

testData[is.na(testData)] = 0
Unit: milliseconds
                     expr      min       lq     mean   median       uq      max neval
testData[is.na(testData)] 26.52425 35.31587 37.19453 37.23217 38.80555 46.61541    10

> microbenchmark::microbenchmark(na.fill(testData,0),times = 1)
Unit: seconds
                 expr      min       lq     mean   median       uq      max neval
 na.fill(testData, 0) 150.9054 150.9054 150.9054 150.9054 150.9054 150.9054     1

Unit: seconds
                           expr      min       lq     mean   median       uq      max neval
 na.fill(coredata(testData), 0) 52.90975 52.90975 52.90975 52.90975 52.90975 52.90975     1

Session Info

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8         
 [8] LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] roll_1.0.7                 PerformanceAnalytics_1.5.2 xts_0.10-2.2               zoo_1.8-2                  TTR_0.23-3                 dygraphs_1.1.1.5          

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17         magrittr_1.5         xtable_1.8-2         lattice_0.20-35      R6_2.2.2             quadprog_1.5-5       xlsx_0.6.1           tools_3.5.1          grid_3.5.1           data.table_1.11.4    qmao_1.6.11         
[12] R.oo_1.22.0          htmltools_0.3.6      yaml_2.1.19          RcppParallel_4.4.0   digest_0.6.15        RJSONIO_1.3-0        shiny_1.0.5          rJava_0.9-10         later_0.7.3          microbenchmark_1.4-4 promises_1.0.1      
[23] bitops_1.0-6         htmlwidgets_0.9      R.utils_2.6.0        xlsxjars_0.6.1       RCurl_1.95-4.10      curl_3.2             mime_0.5             pander_0.6.2         compiler_3.5.1       R.methodsS3_1.7.1    jsonlite_1.5        
[34] httpuv_1.4.4.2      
@joshuaulrich
Copy link
Owner

@joshuaulrich joshuaulrich commented Jul 25, 2018

Note that there is no na.fill.xts() method, so zoo:::na.fill.zoo() is dispatched. We can probably add an xts method, and only dispatch to the zoo method if fill is not a scalar.

@ckatsulis
Copy link
Author

@ckatsulis ckatsulis commented Jul 25, 2018

@joshuaulrich
Copy link
Owner

@joshuaulrich joshuaulrich commented Aug 11, 2018

@ckatsulis it looks like a lot of the time is spent in creating symbol names from the ... arguments passed to merge.xts(). This is related to #248.

joshuaulrich added a commit that referenced this issue Nov 3, 2018
The column names of a merge() result can be monstrous for raw, unnamed
numeric objects. Something similar to dput() for the entire column is
used, but zoo deparses the dput() output and limits the deparse length
to ~60 characters.

Use a modified version of makeNames() (defined in merge.zoo()) to make
the column names. Major differences are:
  - Use substitute(alist(...)) to avoid evaluating '...'
  - Use deparse(x, nlines = 1L) to deparse only the first expression

NOTE: this may break existing code that relies on behavior for integer
objects.

For example, current behavior is:

  x <- .xts(1:5, 1:5)
  lx <- list(x, x)
  do.call(merge, lx)
                      X1.5 X1.5.1
  1969-12-31 18:00:01    1      1
  1969-12-31 18:00:02    2      2
  1969-12-31 18:00:03    3      3
  1969-12-31 18:00:04    4      4
  1969-12-31 18:00:05    5      5

New behavior has column names:

  structure.1.5...Dim...c.5L..1L...index...structure.1.5..tclass...c..POSIXct...
  structure.1.5...Dim...c.5L..1L...index...structure.1.5..tclass...c..          POSIXct....1

Fixes #248. See #259.
@joshuaulrich
Copy link
Owner

@joshuaulrich joshuaulrich commented Nov 3, 2018

Hi @ckatsulis, the fix for #248 improves performance for this case fairly dramatically. Though it's still not nearly as fast as x[is.na(x)] <- 0. Is the new performance sufficient to close this issue?

x <- .xts(1:1e7, 1:1e7)
is.na(x) <- sample(1e7, 1e6) 
microbenchmark::microbenchmark(na.fill(x, 0), times = 1) 
# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#  na.fill(x, 0) 1.944266 1.944266 1.944266 1.944266 1.944266 1.944266     1
y <- coredata(x)
microbenchmark::microbenchmark(na.fill(y, 0), times = 1) 
# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#  na.fill(y, 0) 6.566733 6.566733 6.566733 6.566733 6.566733 6.566733     1
microbenchmark::microbenchmark(X[is.na(X)] <- 0, times = 1, setup = X <- x)
# Unit: milliseconds
#              expr      min       lq     mean   median       uq      max neval
#  X[is.na(X)] <- 0 176.9659 176.9659 176.9659 176.9659 176.9659 176.9659     1
@harvey131
Copy link

@harvey131 harvey131 commented Dec 12, 2018

Performance using Rcpp ifelse()

library(xts)
Rcpp::cppFunction("
NumericMatrix na_fill(const NumericMatrix x, const double fill) {
  NumericMatrix r = clone(x);
  for (int j=0; j < r.ncol(); j++) {
    r(_,j) = ifelse( is_na(r(_,j)), fill, r(_,j) );
  }
  return r;
}")
f1 = function(x, fill) {
  x[is.na(x)] <- fill
  x
}
x <- .xts(1:1e7, 1:1e7)
is.na(x) <- sample(1e7, 1e6) 
microbenchmark::microbenchmark(na.fill(x, 0), 
                               na_fill(x, 0), 
                               f1(x, 0),
                               times=100)
Unit: milliseconds
          expr        min         lq       mean     median        uq       max neval
 na.fill(x, 0) 1082.35488 1146.39916 1178.56886 1175.79914 1203.5146 1309.6379   100
 na_fill(x, 0)   71.08533   76.49017   97.88448   89.81974  110.8743  166.9555   100
      f1(x, 0)   69.06708   78.01604   98.97025   93.17469  105.8702  177.6743   100
@ckatsulis
Copy link
Author

@ckatsulis ckatsulis commented Dec 12, 2018

@joshuaulrich
Copy link
Owner

@joshuaulrich joshuaulrich commented Dec 13, 2018

@ckatsulis the code @harvey131 posted isn't in xts, let alone exported. I'm not likely to add a dependency on compiled code unless there's significant performance improvement over a native R solution. In this case, the Rcpp code is roughly the same as x[is.na(x)] <- 0.

I was also hoping to make the change upstream in zoo, but time is a constraint, as always.

@ckatsulis
Copy link
Author

@ckatsulis ckatsulis commented Dec 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants