New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
c++ back end for add_n_miss #111
Conversation
@ColinFay this is a revised version of what we discussed yesterday |
The failure on travis seems related to some roxygen problems ... 🤷♀️ |
here is some benchmark code: library(naniar)
library(microbenchmark)
library(purrr)
current_add_n_miss <- function(data, ..., label = "n_miss"){
if (missing(...)) {
purrrlyr::by_row(.d = data,
..f = function(x) n_miss(x),
.collate = "row",
.to = paste0(label,"_all"))
} else {
quo_vars <- rlang::quos(...)
selected_data <- dplyr::select(data, !!!quo_vars)
prop_selected_data <- purrrlyr::by_row(.d = selected_data,
..f = function(x) n_miss(x),
.collate = "row",
.to = paste0(label,"_vars"))
# add only the variables prop_miss function, not the whole data.frame...
prop_selected_data_cut <- prop_selected_data %>%
dplyr::select(!!as.name(paste0(label,"_vars")))
dplyr::bind_cols(data, prop_selected_data_cut) %>% dplyr::as_tibble()
} # close else loop
}
bench <- function(data){
microbenchmark(
current_add_n_miss(data),
add_n_miss(data)
)
}
# with a small data set
bench(airquality)
# with a bigger dataset
d <- map_df(1:100, ~airquality)
bench(d)
|
Just going to have a play around on cpp-test branch and let you know how I go :) |
Cool. We need to remember to remove the extra functions from #112 that I just added for making comparison easy. |
Carrying this work through to the tabular summaries, we get library(naniar)
miss_case_summary_cpp <- function(data, order = FALSE, ...){
res <- data
res$pct_miss <- naniar:::prop_na_cpp(data)
res$n_miss <- naniar:::count_na_cpp(data)
res$case <- 1:nrow(data)
res$n_miss_cumsum <- cumsum(res$n_miss)
res <- dplyr::select(res,
case,
n_miss,
pct_miss,
n_miss_cumsum) %>%
dplyr::as_tibble()
if (order) {
return(dplyr::arrange(res, -n_miss))
} else {
return(res)
}
}
miss_case_summary_rowSums <- function(data, order = FALSE, ...){
res <- data
res$pct_miss <- rowMeans(is.na(data))
res$n_miss <- rowSums(is.na(data))
res$case <- 1:nrow(data)
res$n_miss_cumsum <- cumsum(res$n_miss)
res <- dplyr::select(res,
case,
n_miss,
pct_miss,
n_miss_cumsum) %>%
dplyr::as_tibble()
if (order) {
return(dplyr::arrange(res, -n_miss))
} else {
return(res)
}
}
bench <- function(data){
microbenchmark::microbenchmark(
existing = miss_case_summary(data),
cpp = miss_case_summary_cpp(data),
base = miss_case_summary_rowSums(data),
times = 5
)
}
|
You can probably cut that, as these are very related:
|
Description
I've added a (parallel) C++ back end for counting the number of
NA
in each row, and simplified the associated tidyeval logic.Example
No interface change. But the function is now split into smaller bits.
Tests
No additional tests