New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create other flavours of missing values #50

Closed
njtierney opened this Issue Jun 4, 2017 · 15 comments

Comments

Projects
3 participants
@njtierney
Owner

njtierney commented Jun 4, 2017

Building on issues #25 and #31, and discussions with @rgayler, there needs to be a way to create different flavours of missing values to indicate different mechanisms.

An example of this could be where a weather station records -99 as a missing value, but missing specifically because the weather was so cold the instruments stop working.

Currently in R there is only one kind of NA value (ignoring NA_integer_ ... and friends).

So there needs to be a way to specify your own missing value NA_this (or something).

This might be a function like tidyr::replace_na, perhaps instead called replace_na_why or something.

This might look like

data %>%
replace_na_why(.condition = var == -99,
              .why = "weather station too cold",
              .suffix = "TC")

This would then create a value NA_TC, which then has a mechanism recorded.

Since R does not treat these as missing, we would incorporate this into the shadow matrix values

!NA, NA, and NA_.why

@njtierney

This comment has been minimized.

Owner

njtierney commented Jun 7, 2017

Another name for the function is replace_na_type

This could have the arguments:

replace_na_type(.vars/.cols, # similar to `mutate
                .predicate,
                .funs)

It might also need to follow the format of purrr::pmap, where you provide a named list, which would contain the variables/columns, and the rules for each of those.

Alternatively, since this isn't really doing any modification in place, but is instead adding things to the shadow dataframe, it might be more sensible to have different verbs for that process:

  • add_shadow or
  • mutate_na / mutate_na_type or
  • replace_shadow / replace_shadow_type / replace_shadow_with
@rgayler

This comment has been minimized.

rgayler commented Jun 7, 2017

@njtierney

This comment has been minimized.

Owner

njtierney commented Jun 7, 2017

Great suggestion, thanks @rgayler ! :)

@njtierney njtierney added the V0.1.0 label Jun 21, 2017

@njtierney njtierney added this to To Do in CRAN Version 0.1.0 Jun 21, 2017

@njtierney njtierney moved this from To Do to Priority in CRAN Version 0.1.0 Jun 21, 2017

@njtierney

This comment has been minimized.

Owner

njtierney commented Jun 22, 2017

OK, so haven's tagged_na function is pretty slick

library(haven)
x <- c(1:5, tagged_na("a"), tagged_na("z"), NA)

# Tagged NA's work identically to regular NAs
x
#> [1]  1  2  3  4  5 NA NA NA
is.na(x)
#> [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

# To see that they're special, you need to use na_tag(),
# is_tagged_na(), or print_tagged_na():
is_tagged_na(x)
#> [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
na_tag(x)
#> [1] NA  NA  NA  NA  NA  "a" "z" NA
print_tagged_na(x)
#> [1]     1     2     3     4     5 NA(a) NA(z)    NA

# You can test for specific tagged NAs with the second argument
is_tagged_na(x, "a")
#> [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

# Because the support for tagged's NAs is somewhat tagged on to R,
# the left-most NA will tend to be preserved in arithmetic operations.
na_tag(tagged_na("a") + tagged_na("z"))
#> [1] "a"

I need to more carefully think about the implementation of this system, and how it fits into shadow values, so I'm going to change this to be on the next release for Narnia.

One of the main goals with Narnia is to clearly expose these sorts of values to the user, I am unsure if hiding attributes of an NA value is ideal here, although I do see the benefits with other features of R.

@njtierney njtierney added V0.2.0 and removed V0.1.0 labels Jun 22, 2017

@njtierney njtierney removed this from Priority in CRAN Version 0.1.0 Jun 22, 2017

@njtierney njtierney added this to To Do in CRAN Version 0.2.0 Jul 25, 2017

@njtierney njtierney moved this from To Do to Priority in CRAN Version 0.2.0 Jul 25, 2017

@mpadge

This comment has been minimized.

mpadge commented Aug 8, 2017

Another related thought that I've been wondering about is handling sparse Matrix::dsparseMatrix objects. They're the definitive object packed full of missing values which are simply ... missing. Just for you to ponder down the track ...

@njtierney

This comment has been minimized.

Owner

njtierney commented Aug 8, 2017

Just a note to look into memisc::missing.values() (as suggested by @leeper)

@njtierney

This comment has been minimized.

Owner

njtierney commented Jan 12, 2018

Just adding some of the relevant components of #76 into here:

Extending the dieas from #76, and extending it to shadow values, this allows us to directly specify the different flavours of missings, providing the verbs:

  • replace_shadow
  • replace_shadow_where
  • replace_shadow_if
  • replace_shadow_at

This code would then only alter the shadow matrix, and leave the data intact as is, allowing us to leverage other features of the shadow matrix, and also possibly maybe add an additional factor level to it that describes the missingness mechanism (!NA, NA, NA_, NA_)

There needs to be a way to store the "codebook / data dictionary" of missingness mechanisms, so that the user has a way to look up / describe what a value like NA_rainfall really means.

Although there are currently ways in haven to store the "different" values of missingness using tagged_na(), but I don't really like hiding important features from the user, and here I think that shadow matrix can be more practical.

One approach I like so far could be something like this:

data %>%
replace_shadow_where(.funs = ~.x == -99,
              .why = "weather station too cold",
              .suffix = "TC")

An additional idea then is to make themiss_ functions have special behaviour for the _NA columns, so that gg_miss_var(data) could provide the summary of the number of pure NA values, and then also the number of values that are coded as NA_<reason_1>. Likewise, miss_var_summary might also have some specific shadow summaries.

@njtierney njtierney moved this from Priority to In Progress in CRAN Version 0.2.0 Jan 18, 2018

@njtierney njtierney added V0.3.0 and removed V0.2.0 labels Jan 25, 2018

@njtierney njtierney removed this from In Progress in CRAN Version 0.2.0 Jan 25, 2018

@njtierney njtierney added this to Other in CRAN Version 0.3.0 Jan 26, 2018

@njtierney njtierney moved this from Other to Priority / In Progress in CRAN Version 0.3.0 Jan 26, 2018

@njtierney

This comment has been minimized.

Owner

njtierney commented Mar 16, 2018

Current progress:

library(naniar)
library(tidyverse)

df <- tribble(
  ~wind, ~temperature,
  -99,    45,
   68,    NA,
   72,    25
)

dfs <- bind_shadow(df)

map(levels)
#> Error in as_mapper(.f, ...): argument ".f" is missing, with no default

dfs_special <- recode_shadow(dfs,
                             temperature = .where(wind == -99 ~ "bananas"))

dfs_special
#> # A tibble: 3 x 4
#>    wind temperature wind_NA temperature_NA
#>   <dbl>       <dbl> <fct>   <fct>         
#> 1  -99.         45. !NA     NA_bananas    
#> 2   68.         NA  !NA     NA            
#> 3   72.         25. !NA     !NA

map(dfs_special, levels)
#> $wind
#> NULL
#> 
#> $temperature
#> NULL
#> 
#> $wind_NA
#> [1] "!NA"        "NA"         "NA_bananas"
#> 
#> $temperature_NA
#> [1] "!NA"        "NA"         "NA_bananas"

Created on 2018-03-16 by the reprex package (v0.2.0).

@njtierney

This comment has been minimized.

Owner

njtierney commented Mar 16, 2018

Next steps:

  • make it work with >1 statements
  • test how this works inside a grouped_df
@mpadge

This comment has been minimized.

mpadge commented Mar 16, 2018

@njtierney just a thought for ya, but haven has a really nice labelled class that is very useful beyond the standard remit of the package itself. Any one NA could then contain concurrent info about all possible types of NA.

@njtierney

This comment has been minimized.

Owner

njtierney commented Mar 25, 2018

Thanks @mpadge ! That is something I have been thinking about, but at the moment I am sticking with the idea of expanding out the dataframe into the data and shadows. In the future I am interested in looking at collapsing things back down in the dataframe.

On the note of labelled features, here are some packages that work with them (for future reference to myself)

https://github.com/larmarange/labelled

https://github.com/rubenarslan/codebook

@njtierney

This comment has been minimized.

Owner

njtierney commented Jun 4, 2018

Hola @caitlinhudon - tagging you in here from discussion on twitter from your awesome talk

@njtierney njtierney modified the milestones: V0.3.0, V0.4.0 Jun 5, 2018

@njtierney njtierney removed the V0.3.0 label Jun 5, 2018

@njtierney

This comment has been minimized.

Owner

njtierney commented Jun 14, 2018

Here is the section on labelled missing data:

https://larmarange.github.io/labelled/articles/intro_labelled.html#user-defined-missing-values-spsss-style

This looks like a nicely scoped out idea, which is great! But I think I want to take my idea for recode_shadow to some maturity before I compare the two.

@njtierney

This comment has been minimized.

Owner

njtierney commented Jun 14, 2018

OK just dusted off the "special-missing" branch, I get the same output as before:

library(naniar)
library(tidyverse)

df <- tribble(
  ~wind, ~temperature,
  -99,    45,
  68,    NA,
  72,    25
)

dfs <- bind_shadow(df)

map(dfs, levels)
#> $wind
#> NULL
#> 
#> $temperature
#> NULL
#> 
#> $wind_NA
#> [1] "!NA" "NA" 
#> 
#> $temperature_NA
#> [1] "!NA" "NA"
map(dfs, class)
#> $wind
#> [1] "numeric"
#> 
#> $temperature
#> [1] "numeric"
#> 
#> $wind_NA
#> [1] "shadow" "factor"
#> 
#> $temperature_NA
#> [1] "shadow" "factor"
is_shadow(dfs)
#> [1] TRUE
are_shadow(dfs)
#>           wind    temperature        wind_NA temperature_NA 
#>          FALSE          FALSE           TRUE           TRUE
any_shadow(dfs)
#> Error in any_shadow(dfs): could not find function "any_shadow"
map(dfs, class)
#> $wind
#> [1] "numeric"
#> 
#> $temperature
#> [1] "numeric"
#> 
#> $wind_NA
#> [1] "shadow" "factor"
#> 
#> $temperature_NA
#> [1] "shadow" "factor"
class(dfs)
#> [1] "shadow"     "tbl_df"     "tbl"        "data.frame"

dfs_special <- dfs %>%
  recode_shadow(temperature = .where(wind == -99 ~ "bananas"))

dfs_special
#> # A tibble: 3 x 4
#>    wind temperature wind_NA temperature_NA
#>   <dbl>       <dbl> <fct>   <fct>         
#> 1   -99          45 !NA     NA_bananas    
#> 2    68          NA !NA     NA            
#> 3    72          25 !NA     !NA

map(dfs_special, levels)
#> $wind
#> NULL
#> 
#> $temperature
#> NULL
#> 
#> $wind_NA
#> [1] "!NA"        "NA"         "NA_bananas"
#> 
#> $temperature_NA
#> [1] "!NA"        "NA"         "NA_bananas"

Created on 2018-06-14 by the reprex package (v0.2.0).

Current tasks:

  • Explore how this works with grouped data
  • Options for performing this many times - scoped variants or other options.
  • Demonstrate usage in a larger dataset
@njtierney

This comment has been minimized.

Owner

njtierney commented Jun 24, 2018

See this SO question for an idea on one practical implementation / statement of need

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment