New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

na.fill long call times #259

Open
ckatsulis opened this Issue Jul 25, 2018 · 4 comments

Comments

Projects
None yet
2 participants
@ckatsulis

ckatsulis commented Jul 25, 2018

na.fill() takes much longer than other methods to fill NA values. I expected near constant time to find and fill NA values.

Reproducible timings are below. I will send testData via email independently as it is too large for this forum.

testData[is.na(testData)] = 0
Unit: milliseconds
                     expr      min       lq     mean   median       uq      max neval
testData[is.na(testData)] 26.52425 35.31587 37.19453 37.23217 38.80555 46.61541    10

> microbenchmark::microbenchmark(na.fill(testData,0),times = 1)
Unit: seconds
                 expr      min       lq     mean   median       uq      max neval
 na.fill(testData, 0) 150.9054 150.9054 150.9054 150.9054 150.9054 150.9054     1

Unit: seconds
                           expr      min       lq     mean   median       uq      max neval
 na.fill(coredata(testData), 0) 52.90975 52.90975 52.90975 52.90975 52.90975 52.90975     1

Session Info

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8       LC_PAPER=en_US.UTF-8         
 [8] LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] roll_1.0.7                 PerformanceAnalytics_1.5.2 xts_0.10-2.2               zoo_1.8-2                  TTR_0.23-3                 dygraphs_1.1.1.5          

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17         magrittr_1.5         xtable_1.8-2         lattice_0.20-35      R6_2.2.2             quadprog_1.5-5       xlsx_0.6.1           tools_3.5.1          grid_3.5.1           data.table_1.11.4    qmao_1.6.11         
[12] R.oo_1.22.0          htmltools_0.3.6      yaml_2.1.19          RcppParallel_4.4.0   digest_0.6.15        RJSONIO_1.3-0        shiny_1.0.5          rJava_0.9-10         later_0.7.3          microbenchmark_1.4-4 promises_1.0.1      
[23] bitops_1.0-6         htmlwidgets_0.9      R.utils_2.6.0        xlsxjars_0.6.1       RCurl_1.95-4.10      curl_3.2             mime_0.5             pander_0.6.2         compiler_3.5.1       R.methodsS3_1.7.1    jsonlite_1.5        
[34] httpuv_1.4.4.2      
@joshuaulrich

This comment has been minimized.

Owner

joshuaulrich commented Jul 25, 2018

Note that there is no na.fill.xts() method, so zoo:::na.fill.zoo() is dispatched. We can probably add an xts method, and only dispatch to the zoo method if fill is not a scalar.

@ckatsulis

This comment has been minimized.

ckatsulis commented Jul 25, 2018

@joshuaulrich

This comment has been minimized.

Owner

joshuaulrich commented Aug 11, 2018

@ckatsulis it looks like a lot of the time is spent in creating symbol names from the ... arguments passed to merge.xts(). This is related to #248.

joshuaulrich added a commit that referenced this issue Nov 3, 2018

Shorten column names for merge on unnamed objects
The column names of a merge() result can be monstrous for raw, unnamed
numeric objects. Something similar to dput() for the entire column is
used, but zoo deparses the dput() output and limits the deparse length
to ~60 characters.

Use a modified version of makeNames() (defined in merge.zoo()) to make
the column names. Major differences are:
  - Use substitute(alist(...)) to avoid evaluating '...'
  - Use deparse(x, nlines = 1L) to deparse only the first expression

NOTE: this may break existing code that relies on behavior for integer
objects.

For example, current behavior is:

  x <- .xts(1:5, 1:5)
  lx <- list(x, x)
  do.call(merge, lx)
                      X1.5 X1.5.1
  1969-12-31 18:00:01    1      1
  1969-12-31 18:00:02    2      2
  1969-12-31 18:00:03    3      3
  1969-12-31 18:00:04    4      4
  1969-12-31 18:00:05    5      5

New behavior has column names:

  structure.1.5...Dim...c.5L..1L...index...structure.1.5..tclass...c..POSIXct...
  structure.1.5...Dim...c.5L..1L...index...structure.1.5..tclass...c..          POSIXct....1

Fixes #248. See #259.
@joshuaulrich

This comment has been minimized.

Owner

joshuaulrich commented Nov 3, 2018

Hi @ckatsulis, the fix for #248 improves performance for this case fairly dramatically. Though it's still not nearly as fast as x[is.na(x)] <- 0. Is the new performance sufficient to close this issue?

x <- .xts(1:1e7, 1:1e7)
is.na(x) <- sample(1e7, 1e6) 
microbenchmark::microbenchmark(na.fill(x, 0), times = 1) 
# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#  na.fill(x, 0) 1.944266 1.944266 1.944266 1.944266 1.944266 1.944266     1
y <- coredata(x)
microbenchmark::microbenchmark(na.fill(y, 0), times = 1) 
# Unit: seconds
#           expr      min       lq     mean   median       uq      max neval
#  na.fill(y, 0) 6.566733 6.566733 6.566733 6.566733 6.566733 6.566733     1
microbenchmark::microbenchmark(X[is.na(X)] <- 0, times = 1, setup = X <- x)
# Unit: milliseconds
#              expr      min       lq     mean   median       uq      max neval
#  X[is.na(X)] <- 0 176.9659 176.9659 176.9659 176.9659 176.9659 176.9659     1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment