Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use rowSums / rowMeans to compute row-wise summaries #112

Merged
merged 1 commit into from Dec 14, 2017

Conversation

jimhester
Copy link
Contributor

This simplifies the code and is 4-5 orders of magnitude faster

Description

An alternative to #111, in case you do not want the added complexity. This uses the base functions rowSums() and rowMeans() and the fact is.na() is vectorized to do the calculations. It is 4-5 orders of magnitude faster than the original implementation, and seems to be ~ 2-10x faster than the Rcpp version in that PR (although I did not directly compare them).

microbenchmark::microbenchmark(add_n_miss(airquality), current_add_n_miss(airquality), unit = "ms")

#> Unit: milliseconds
#>                            expr       min         lq         mean     median          uq        max neval cld
#>          add_n_miss(airquality)  0.048766   0.060757   0.07926442   0.083558   0.0895555   0.160938   100  a
#>  current_add_n_miss(airquality) 92.369034 100.457525 106.24287966 104.383850 108.0468730 294.761924   100   b

This simplifies the code and is 4-5 orders of magnitude faster
@romainfrancois
Copy link
Contributor

romainfrancois commented Oct 10, 2017

I'm surprised this is faster than #111 as this was one of the option I tried. Perhaps @jimhester have simpler R-side logic than my ‼️ use.

microbenchmarking the actual code that counts gives me this:

> count_na_cpp <- naniar:::count_na_cpp
> bench <- function(data){
+   microbenchmark(
+     rowSums = rowSums(is.na(data)),
+     cpp = count_na_cpp(data, seq_along(data))
+   )
+ }
> bench(airquality)
Unit: microseconds
    expr    min     lq     mean  median      uq    max neval cld
 rowSums 26.362 26.927 29.47272 27.3545 28.4215 95.279   100   b
     cpp  4.569  4.861  6.68874  5.8335  6.1050 46.694   100  a 
> d <- map_df( 1:1000, ~airquality)
> bench(d)
Unit: microseconds
    expr      min       lq      mean   median       uq        max neval cld
 rowSums 4066.523 5828.593 11928.727 7162.083 8258.343 180031.286   100   b
     cpp  394.613  473.148   654.007  542.173  645.600   4572.943   100  a 

> all.equal( rowSums(is.na(airquality)) , count_na_cpp(airquality, seq_along(airquality)) )
[1] TRUE
> all.equal( rowSums(is.na(d)) , count_na_cpp(d, seq_along(d)) )
[1] TRUE
> 

Maybe this is bc I have CXX11FLAGS=-O3 in my ~/.R/Makevars ?

btw, another attempt that I kind of liked:

add_n_miss__tidyeval <- function(data){
  vars <- names(data)
  expr <- map( syms(vars), ~expr(is.na(!!.x)) ) %>%
    reduce( ~ expr(UQ(.x) + UQ(.y)) )

  mutate( data, n_miss = !!expr )
}

romainfrancois pushed a commit to romainfrancois/naniar that referenced this pull request Oct 10, 2017
@romainfrancois
Copy link
Contributor

I've simplified the R logic to use less ‼️ in #111, essentially using @jimhester version. Also temporarily added this version as add_n_miss_rowSums so that everything else is comparable (namespace resolution etc ...).

I get:

> library(naniar)
> bench <- function(data){
+   microbenchmark( 
+     rowSums = add_n_miss_rowSums(data), 
+     cpp = add_n_miss(data)
+   )
+ }
> bench(airquality)
Unit: microseconds
    expr    min      lq     mean  median       uq      max neval cld
 rowSums 41.074 42.7420 108.5897 50.3280 106.9125 3028.254   100   b
     cpp 16.864 18.3295  40.2343 21.1365  43.9495  666.223   100  a 

Also, on a bigger data:

> d <- map_df(1:1000, ~airquality)
> bench(d)
Unit: microseconds
    expr      min        lq      mean   median        uq        max neval cld
 rowSums 4113.109 6048.8690 9926.6086 6920.907 8346.1555 148647.065   100   b
     cpp  397.492  499.4435  617.1485  556.735  663.3275   2772.838   100  a 

I went from the rowSums(is.na()) combo to the C++ version to avoid the materialization of the n x d logical matrix, sparing some time and memory footprint ...

@njtierney
Copy link
Owner

Thank you both again for your work!

I really appreciate your awesome contributions, @jimhester and @romainfrancois. I am incredibly lucky to have had two great pull requests from authors whose work I greatly admire.

I am currently reviewing the cpp code. I need to weigh up whether it is worth the cost of the complexity for the speed. At this stage I am really excited to have something so fast available.

I will update you all soon about this - but I just wanted to say a sincere thank you, you are both awesome!

@romainfrancois
Copy link
Contributor

Well the fact your code adheres to some tidy principles (simple function that only do one thing) makes it easy to contribute to it.

@njtierney njtierney changed the base branch from master to rowMeans-isna December 14, 2017 02:47
@njtierney njtierney merged commit 9e40b7a into njtierney:rowMeans-isna Dec 14, 2017
@njtierney
Copy link
Owner

Going to merge this in at the moment as a place-holder until the c++ code is ready, thanks again @jimhester ! :)

@romainfrancois
Copy link
Contributor

cool, I've done some work on the c++ branch, but yeah atm it does not travis ✅ I'll get to it.

@njtierney
Copy link
Owner

Ah, awesome, thanks Romain! :)

Rolling out the changes through the rowMeans branch has been helpful for me to work out how to best implement the c++ changes as well. At the moment there is a really solid speedup, I'm so excited for where this is going! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants