Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c++ back end for add_n_miss #111

Merged
merged 10 commits into from Oct 18, 2017
Merged

Conversation

romainfrancois
Copy link
Contributor

Description

I've added a (parallel) C++ back end for counting the number of NA in each row, and simplified the associated tidyeval logic.

Example

No interface change. But the function is now split into smaller bits.

> test_df <- data.frame(x = c(NA,2,3),
+                       y = c(1,NA,3),
+                       z = c(1,2,3))
> 
> add_n_miss(test_df)
   x  y z n_miss_all
1 NA  1 1          1
2  2 NA 2          1
3  3  3 3          0
> add_n_miss(test_df, x)
   x  y z n_miss_vars
1 NA  1 1           1
2  2 NA 2           0
3  3  3 3           0
> ( naniar:::count_na(test_df, 1:2 ) )
[1] 1 1 0

Tests

No additional tests

@romainfrancois
Copy link
Contributor Author

@ColinFay this is a revised version of what we discussed yesterday

@romainfrancois
Copy link
Contributor Author

The failure on travis seems related to some roxygen problems ... 🤷‍♀️

@romainfrancois
Copy link
Contributor Author

romainfrancois commented Oct 7, 2017

here is some benchmark code:

library(naniar)
library(microbenchmark)
library(purrr)

current_add_n_miss <- function(data, ..., label = "n_miss"){

  if (missing(...)) {
    purrrlyr::by_row(.d = data,
      ..f = function(x) n_miss(x),
      .collate = "row",
      .to = paste0(label,"_all"))
  } else {

    quo_vars <- rlang::quos(...)

    selected_data <- dplyr::select(data, !!!quo_vars)

    prop_selected_data <- purrrlyr::by_row(.d = selected_data,
      ..f = function(x) n_miss(x),
      .collate = "row",
      .to =  paste0(label,"_vars"))

    # add only the variables prop_miss function, not the whole data.frame...
    prop_selected_data_cut <- prop_selected_data %>%
      dplyr::select(!!as.name(paste0(label,"_vars")))

    dplyr::bind_cols(data, prop_selected_data_cut) %>% dplyr::as_tibble()

  } # close else loop

}


bench <- function(data){
  microbenchmark(
    current_add_n_miss(data),
    add_n_miss(data)
  )
}


# with a small data set
bench(airquality)

# with a bigger dataset
d <- map_df(1:100, ~airquality)
bench(d)
> # with a small data set
> bench(airquality)
Unit: milliseconds
                     expr       min        lq      mean    median        uq        max neval cld
 current_add_n_miss(data) 72.930573 75.756150 79.774946 78.279936 82.782694 130.276075   100   b
         add_n_miss(data)  1.835696  1.980453  2.160201  2.116305  2.219488   4.793865   100  a 
> identical(pull(current_add_n_miss(airquality)), pull(add_n_miss(airquality)) )
[1] TRUE

> # with a bigger dataset
> d <- map_df(1:100, ~airquality)
> bench(d)
Unit: milliseconds
                     expr        min          lq        mean      median          uq         max neval cld
 current_add_n_miss(data) 8201.02784 8536.906254 9318.309036 8886.351413 9554.075054 21176.25978   100   b
         add_n_miss(data)    2.30777    2.608623    3.060244    2.794713    2.966273    14.76994   100  a 

> identical(pull(current_add_n_miss(d)), pull(add_n_miss(d)) )
[1] TRUE

@njtierney njtierney changed the base branch from master to cpp-test October 18, 2017 12:02
@njtierney njtierney merged commit 6af8b15 into njtierney:cpp-test Oct 18, 2017
@njtierney
Copy link
Owner

Just going to have a play around on cpp-test branch and let you know how I go :)

@romainfrancois
Copy link
Contributor Author

Cool. We need to remember to remove the extra functions from #112 that I just added for making comparison easy.

@njtierney
Copy link
Owner

Carrying this work through to the tabular summaries, we get

library(naniar)
miss_case_summary_cpp <- function(data, order = FALSE, ...){

  res <- data
  res$pct_miss <- naniar:::prop_na_cpp(data)
  res$n_miss <- naniar:::count_na_cpp(data)
  res$case <- 1:nrow(data)
  res$n_miss_cumsum <- cumsum(res$n_miss)
  res <- dplyr::select(res, 
                       case,
                       n_miss,
                       pct_miss,
                       n_miss_cumsum) %>%
    dplyr::as_tibble()
  
  if (order) {
    return(dplyr::arrange(res, -n_miss))
  } else {
    return(res)
  }
}

miss_case_summary_rowSums <- function(data, order = FALSE, ...){

  res <- data
  res$pct_miss <- rowMeans(is.na(data))
  res$n_miss <- rowSums(is.na(data))
  res$case <- 1:nrow(data)
  res$n_miss_cumsum <- cumsum(res$n_miss)
  res <- dplyr::select(res, 
                       case,
                       n_miss,
                       pct_miss,
                       n_miss_cumsum) %>%
    dplyr::as_tibble()
  
  if (order) {
    return(dplyr::arrange(res, -n_miss))
  } else {
    return(res)
  }
}

bench <- function(data){
  microbenchmark::microbenchmark(
    existing = miss_case_summary(data),
    cpp = miss_case_summary_cpp(data),
    base = miss_case_summary_rowSums(data),
    times = 5
  )
}
> # with a bigger dataset
> d <- purrr::map_df(1:100, ~airquality)
> dim(d)
[1] 15300     6
> 
> bd <- bench(d)
> bd
Unit: milliseconds
     expr          min           lq        mean      median          uq
 existing 14987.402396 15014.907935 15406.16553 15351.41423 15718.28937
      cpp     9.151344     9.293812    10.94039    10.29448    12.01509
     base    10.105784    10.343790    10.92548    10.48311    11.51229
         max neval cld
 15958.81370     5   b
    13.94723     5  a 
    12.18241     5  a 

@romainfrancois
Copy link
Contributor Author

You can probably cut that, as these are very related:

  res$pct_miss <- naniar:::prop_na_cpp(data)
  res$n_miss <- naniar:::count_na_cpp(data)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants