c++ back end for add_n_miss #111

romainfrancois · 2017-10-07T18:17:28Z

Description

I've added a (parallel) C++ back end for counting the number of NA in each row, and simplified the associated tidyeval logic.

Example

No interface change. But the function is now split into smaller bits.

> test_df <- data.frame(x = c(NA,2,3),
+                       y = c(1,NA,3),
+                       z = c(1,2,3))
> 
> add_n_miss(test_df)
   x  y z n_miss_all
1 NA  1 1          1
2  2 NA 2          1
3  3  3 3          0
> add_n_miss(test_df, x)
   x  y z n_miss_vars
1 NA  1 1           1
2  2 NA 2           0
3  3  3 3           0
> ( naniar:::count_na(test_df, 1:2 ) )
[1] 1 1 0

Tests

No additional tests

romainfrancois · 2017-10-07T18:19:40Z

@ColinFay this is a revised version of what we discussed yesterday

romainfrancois · 2017-10-07T18:44:52Z

The failure on travis seems related to some roxygen problems ... 🤷‍♀️

romainfrancois · 2017-10-07T18:46:09Z

here is some benchmark code:

library(naniar)
library(microbenchmark)
library(purrr)

current_add_n_miss <- function(data, ..., label = "n_miss"){

  if (missing(...)) {
    purrrlyr::by_row(.d = data,
      ..f = function(x) n_miss(x),
      .collate = "row",
      .to = paste0(label,"_all"))
  } else {

    quo_vars <- rlang::quos(...)

    selected_data <- dplyr::select(data, !!!quo_vars)

    prop_selected_data <- purrrlyr::by_row(.d = selected_data,
      ..f = function(x) n_miss(x),
      .collate = "row",
      .to =  paste0(label,"_vars"))

    # add only the variables prop_miss function, not the whole data.frame...
    prop_selected_data_cut <- prop_selected_data %>%
      dplyr::select(!!as.name(paste0(label,"_vars")))

    dplyr::bind_cols(data, prop_selected_data_cut) %>% dplyr::as_tibble()

  } # close else loop

}


bench <- function(data){
  microbenchmark(
    current_add_n_miss(data),
    add_n_miss(data)
  )
}


# with a small data set
bench(airquality)

# with a bigger dataset
d <- map_df(1:100, ~airquality)
bench(d)

> # with a small data set
> bench(airquality)
Unit: milliseconds
                     expr       min        lq      mean    median        uq        max neval cld
 current_add_n_miss(data) 72.930573 75.756150 79.774946 78.279936 82.782694 130.276075   100   b
         add_n_miss(data)  1.835696  1.980453  2.160201  2.116305  2.219488   4.793865   100  a 
> identical(pull(current_add_n_miss(airquality)), pull(add_n_miss(airquality)) )
[1] TRUE

> # with a bigger dataset
> d <- map_df(1:100, ~airquality)
> bench(d)
Unit: milliseconds
                     expr        min          lq        mean      median          uq         max neval cld
 current_add_n_miss(data) 8201.02784 8536.906254 9318.309036 8886.351413 9554.075054 21176.25978   100   b
         add_n_miss(data)    2.30777    2.608623    3.060244    2.794713    2.966273    14.76994   100  a 

> identical(pull(current_add_n_miss(d)), pull(add_n_miss(d)) )
[1] TRUE

…rney#112

njtierney · 2017-10-18T12:03:26Z

Just going to have a play around on cpp-test branch and let you know how I go :)

romainfrancois · 2017-10-18T12:05:01Z

Cool. We need to remember to remove the extra functions from #112 that I just added for making comparison easy.

njtierney · 2017-10-20T07:05:16Z

Carrying this work through to the tabular summaries, we get

library(naniar)
miss_case_summary_cpp <- function(data, order = FALSE, ...){

  res <- data
  res$pct_miss <- naniar:::prop_na_cpp(data)
  res$n_miss <- naniar:::count_na_cpp(data)
  res$case <- 1:nrow(data)
  res$n_miss_cumsum <- cumsum(res$n_miss)
  res <- dplyr::select(res, 
                       case,
                       n_miss,
                       pct_miss,
                       n_miss_cumsum) %>%
    dplyr::as_tibble()
  
  if (order) {
    return(dplyr::arrange(res, -n_miss))
  } else {
    return(res)
  }
}

miss_case_summary_rowSums <- function(data, order = FALSE, ...){

  res <- data
  res$pct_miss <- rowMeans(is.na(data))
  res$n_miss <- rowSums(is.na(data))
  res$case <- 1:nrow(data)
  res$n_miss_cumsum <- cumsum(res$n_miss)
  res <- dplyr::select(res, 
                       case,
                       n_miss,
                       pct_miss,
                       n_miss_cumsum) %>%
    dplyr::as_tibble()
  
  if (order) {
    return(dplyr::arrange(res, -n_miss))
  } else {
    return(res)
  }
}

bench <- function(data){
  microbenchmark::microbenchmark(
    existing = miss_case_summary(data),
    cpp = miss_case_summary_cpp(data),
    base = miss_case_summary_rowSums(data),
    times = 5
  )
}

> # with a bigger dataset
> d <- purrr::map_df(1:100, ~airquality)
> dim(d)
[1] 15300     6
> 
> bd <- bench(d)
> bd
Unit: milliseconds
     expr          min           lq        mean      median          uq
 existing 14987.402396 15014.907935 15406.16553 15351.41423 15718.28937
      cpp     9.151344     9.293812    10.94039    10.29448    12.01509
     base    10.105784    10.343790    10.92548    10.48311    11.51229
         max neval cld
 15958.81370     5   b
    13.94723     5  a 
    12.18241     5  a

romainfrancois · 2017-10-20T07:10:20Z

You can probably cut that, as these are very related:

  res$pct_miss <- naniar:::prop_na_cpp(data)
  res$n_miss <- naniar:::count_na_cpp(data)

Romain Francois added 7 commits October 7, 2017 11:52

redoc

26e739c

using Rcpp

495c144

C++ backend for add_n_miss

655f9d9

using the c++ back end in add_n_miss

a0be4da

import dplyr::mutate

52bd9b3

+comments in the C++ code

7e8a3bc

adding me as an aut

d4692b2

Romain Francois added 2 commits October 8, 2017 11:48

c++ back end for add_prop_miss as well (almost the same code anyway).

ed53e3f

trimmin some uselessness

0012f23

jimhester mentioned this pull request Oct 9, 2017

Use rowSums / rowMeans to compute row-wise summaries #112

Merged

simpler R logic around the c++ code, to make this comparable to njtie…

81d7b80

…rney#112

njtierney changed the base branch from master to cpp-test October 18, 2017 12:02

njtierney merged commit 6af8b15 into njtierney:cpp-test Oct 18, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

c++ back end for add_n_miss #111

c++ back end for add_n_miss #111

romainfrancois commented Oct 7, 2017

romainfrancois commented Oct 7, 2017

romainfrancois commented Oct 7, 2017

romainfrancois commented Oct 7, 2017 •

edited

njtierney commented Oct 18, 2017

romainfrancois commented Oct 18, 2017

njtierney commented Oct 20, 2017

romainfrancois commented Oct 20, 2017

c++ back end for add_n_miss #111

c++ back end for add_n_miss #111

Conversation

romainfrancois commented Oct 7, 2017

Description

Example

Tests

romainfrancois commented Oct 7, 2017

romainfrancois commented Oct 7, 2017

romainfrancois commented Oct 7, 2017 • edited

njtierney commented Oct 18, 2017

romainfrancois commented Oct 18, 2017

njtierney commented Oct 20, 2017

romainfrancois commented Oct 20, 2017

romainfrancois commented Oct 7, 2017 •

edited