Use rowSums / rowMeans to compute row-wise summaries #112

jimhester · 2017-10-09T13:02:36Z

This simplifies the code and is 4-5 orders of magnitude faster

Description

An alternative to #111, in case you do not want the added complexity. This uses the base functions rowSums() and rowMeans() and the fact is.na() is vectorized to do the calculations. It is 4-5 orders of magnitude faster than the original implementation, and seems to be ~ 2-10x faster than the Rcpp version in that PR (although I did not directly compare them).

microbenchmark::microbenchmark(add_n_miss(airquality), current_add_n_miss(airquality), unit = "ms")

#> Unit: milliseconds
#>                            expr       min         lq         mean     median          uq        max neval cld
#>          add_n_miss(airquality)  0.048766   0.060757   0.07926442   0.083558   0.0895555   0.160938   100  a
#>  current_add_n_miss(airquality) 92.369034 100.457525 106.24287966 104.383850 108.0468730 294.761924   100   b

This simplifies the code and is 4-5 orders of magnitude faster

romainfrancois · 2017-10-10T07:05:15Z

I'm surprised this is faster than #111 as this was one of the option I tried. Perhaps @jimhester have simpler R-side logic than my ‼️ use.

microbenchmarking the actual code that counts gives me this:

> count_na_cpp <- naniar:::count_na_cpp
> bench <- function(data){
+   microbenchmark(
+     rowSums = rowSums(is.na(data)),
+     cpp = count_na_cpp(data, seq_along(data))
+   )
+ }
> bench(airquality)
Unit: microseconds
    expr    min     lq     mean  median      uq    max neval cld
 rowSums 26.362 26.927 29.47272 27.3545 28.4215 95.279   100   b
     cpp  4.569  4.861  6.68874  5.8335  6.1050 46.694   100  a 
> d <- map_df( 1:1000, ~airquality)
> bench(d)
Unit: microseconds
    expr      min       lq      mean   median       uq        max neval cld
 rowSums 4066.523 5828.593 11928.727 7162.083 8258.343 180031.286   100   b
     cpp  394.613  473.148   654.007  542.173  645.600   4572.943   100  a 

> all.equal( rowSums(is.na(airquality)) , count_na_cpp(airquality, seq_along(airquality)) )
[1] TRUE
> all.equal( rowSums(is.na(d)) , count_na_cpp(d, seq_along(d)) )
[1] TRUE
>

Maybe this is bc I have CXX11FLAGS=-O3 in my ~/.R/Makevars ?

btw, another attempt that I kind of liked:

add_n_miss__tidyeval <- function(data){
  vars <- names(data)
  expr <- map( syms(vars), ~expr(is.na(!!.x)) ) %>%
    reduce( ~ expr(UQ(.x) + UQ(.y)) )

  mutate( data, n_miss = !!expr )
}

…rney#112

romainfrancois · 2017-10-10T07:42:00Z

I've simplified the R logic to use less ‼️ in #111, essentially using @jimhester version. Also temporarily added this version as add_n_miss_rowSums so that everything else is comparable (namespace resolution etc ...).

I get:

> library(naniar)
> bench <- function(data){
+   microbenchmark( 
+     rowSums = add_n_miss_rowSums(data), 
+     cpp = add_n_miss(data)
+   )
+ }
> bench(airquality)
Unit: microseconds
    expr    min      lq     mean  median       uq      max neval cld
 rowSums 41.074 42.7420 108.5897 50.3280 106.9125 3028.254   100   b
     cpp 16.864 18.3295  40.2343 21.1365  43.9495  666.223   100  a

Also, on a bigger data:

> d <- map_df(1:1000, ~airquality)
> bench(d)
Unit: microseconds
    expr      min        lq      mean   median        uq        max neval cld
 rowSums 4113.109 6048.8690 9926.6086 6920.907 8346.1555 148647.065   100   b
     cpp  397.492  499.4435  617.1485  556.735  663.3275   2772.838   100  a

I went from the rowSums(is.na()) combo to the C++ version to avoid the materialization of the n x d logical matrix, sparing some time and memory footprint ...

njtierney · 2017-10-12T04:54:22Z

Thank you both again for your work!

I really appreciate your awesome contributions, @jimhester and @romainfrancois. I am incredibly lucky to have had two great pull requests from authors whose work I greatly admire.

I am currently reviewing the cpp code. I need to weigh up whether it is worth the cost of the complexity for the speed. At this stage I am really excited to have something so fast available.

I will update you all soon about this - but I just wanted to say a sincere thank you, you are both awesome!

romainfrancois · 2017-10-12T07:53:51Z

Well the fact your code adheres to some tidy principles (simple function that only do one thing) makes it easy to contribute to it.

njtierney · 2017-12-14T02:47:59Z

Going to merge this in at the moment as a place-holder until the c++ code is ready, thanks again @jimhester ! :)

romainfrancois · 2017-12-14T07:58:43Z

cool, I've done some work on the c++ branch, but yeah atm it does not travis ✅ I'll get to it.

njtierney · 2017-12-14T23:50:21Z

Ah, awesome, thanks Romain! :)

Rolling out the changes through the rowMeans branch has been helpful for me to work out how to best implement the c++ changes as well. At the moment there is a really solid speedup, I'm so excited for where this is going! :)

Use rowSums / rowMeans to compute row-wise summaries

67153e1

This simplifies the code and is 4-5 orders of magnitude faster

romainfrancois pushed a commit to romainfrancois/naniar that referenced this pull request Oct 10, 2017

simpler R logic around the c++ code, to make this comparable to njtie…

81d7b80

…rney#112

romainfrancois mentioned this pull request Oct 18, 2017

c++ back end for add_n_miss #111

Merged

njtierney changed the base branch from master to rowMeans-isna December 14, 2017 02:47

njtierney merged commit 9e40b7a into njtierney:rowMeans-isna Dec 14, 2017

romainfrancois pushed a commit that referenced this pull request Mar 22, 2018

simpler R logic around the c++ code, to make this comparable to #112

9c287dd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use rowSums / rowMeans to compute row-wise summaries #112

Use rowSums / rowMeans to compute row-wise summaries #112

jimhester commented Oct 9, 2017

romainfrancois commented Oct 10, 2017 •

edited

romainfrancois commented Oct 10, 2017

njtierney commented Oct 12, 2017

romainfrancois commented Oct 12, 2017

njtierney commented Dec 14, 2017

romainfrancois commented Dec 14, 2017

njtierney commented Dec 14, 2017

Use rowSums / rowMeans to compute row-wise summaries #112

Use rowSums / rowMeans to compute row-wise summaries #112

Conversation

jimhester commented Oct 9, 2017

Description

romainfrancois commented Oct 10, 2017 • edited

romainfrancois commented Oct 10, 2017

njtierney commented Oct 12, 2017

romainfrancois commented Oct 12, 2017

njtierney commented Dec 14, 2017

romainfrancois commented Dec 14, 2017

njtierney commented Dec 14, 2017

romainfrancois commented Oct 10, 2017 •

edited