New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use rowSums / rowMeans to compute row-wise summaries #112
Conversation
This simplifies the code and is 4-5 orders of magnitude faster
I'm surprised this is faster than #111 as this was one of the option I tried. Perhaps @jimhester have simpler R-side logic than my microbenchmarking the actual code that counts gives me this: > count_na_cpp <- naniar:::count_na_cpp
> bench <- function(data){
+ microbenchmark(
+ rowSums = rowSums(is.na(data)),
+ cpp = count_na_cpp(data, seq_along(data))
+ )
+ }
> bench(airquality)
Unit: microseconds
expr min lq mean median uq max neval cld
rowSums 26.362 26.927 29.47272 27.3545 28.4215 95.279 100 b
cpp 4.569 4.861 6.68874 5.8335 6.1050 46.694 100 a
> d <- map_df( 1:1000, ~airquality)
> bench(d)
Unit: microseconds
expr min lq mean median uq max neval cld
rowSums 4066.523 5828.593 11928.727 7162.083 8258.343 180031.286 100 b
cpp 394.613 473.148 654.007 542.173 645.600 4572.943 100 a
> all.equal( rowSums(is.na(airquality)) , count_na_cpp(airquality, seq_along(airquality)) )
[1] TRUE
> all.equal( rowSums(is.na(d)) , count_na_cpp(d, seq_along(d)) )
[1] TRUE
> Maybe this is bc I have btw, another attempt that I kind of liked: add_n_miss__tidyeval <- function(data){
vars <- names(data)
expr <- map( syms(vars), ~expr(is.na(!!.x)) ) %>%
reduce( ~ expr(UQ(.x) + UQ(.y)) )
mutate( data, n_miss = !!expr )
} |
I've simplified the R logic to use less I get:
Also, on a bigger data:
I went from the |
Thank you both again for your work! I really appreciate your awesome contributions, @jimhester and @romainfrancois. I am incredibly lucky to have had two great pull requests from authors whose work I greatly admire. I am currently reviewing the cpp code. I need to weigh up whether it is worth the cost of the complexity for the speed. At this stage I am really excited to have something so fast available. I will update you all soon about this - but I just wanted to say a sincere thank you, you are both awesome! |
Well the fact your code adheres to some tidy principles (simple function that only do one thing) makes it easy to contribute to it. |
Going to merge this in at the moment as a place-holder until the c++ code is ready, thanks again @jimhester ! :) |
cool, I've done some work on the c++ branch, but yeah atm it does not travis ✅ I'll get to it. |
Ah, awesome, thanks Romain! :) Rolling out the changes through the rowMeans branch has been helpful for me to work out how to best implement the c++ changes as well. At the moment there is a really solid speedup, I'm so excited for where this is going! :) |
This simplifies the code and is 4-5 orders of magnitude faster
Description
An alternative to #111, in case you do not want the added complexity. This uses the base functions
rowSums()
androwMeans()
and the factis.na()
is vectorized to do the calculations. It is 4-5 orders of magnitude faster than the original implementation, and seems to be ~ 2-10x faster than the Rcpp version in that PR (although I did not directly compare them).