create other flavours of missing values #50

njtierney · 2017-06-04T14:03:09Z

Building on issues #25 and #31, and discussions with @rgayler, there needs to be a way to create different flavours of missing values to indicate different mechanisms.

An example of this could be where a weather station records -99 as a missing value, but missing specifically because the weather was so cold the instruments stop working.

Currently in R there is only one kind of NA value (ignoring NA_integer_ ... and friends).

So there needs to be a way to specify your own missing value NA_this (or something).

This might be a function like tidyr::replace_na, perhaps instead called replace_na_why or something.

This might look like

data %>%
replace_na_why(.condition = var == -99,
              .why = "weather station too cold",
              .suffix = "TC")

This would then create a value NA_TC, which then has a mechanism recorded.

Since R does not treat these as missing, we would incorporate this into the shadow matrix values

!NA, NA, and NA_.why

The text was updated successfully, but these errors were encountered:

njtierney · 2017-06-07T02:39:58Z

Another name for the function is replace_na_type

This could have the arguments:

replace_na_type(.vars/.cols, # similar to `mutate
                .predicate,
                .funs)

It might also need to follow the format of purrr::pmap, where you provide a named list, which would contain the variables/columns, and the rules for each of those.

Alternatively, since this isn't really doing any modification in place, but is instead adding things to the shadow dataframe, it might be more sensible to have different verbs for that process:

add_shadow or
mutate_na / mutate_na_type or
replace_shadow / replace_shadow_type / replace_shadow_with

rgayler · 2017-06-07T02:50:49Z

If you haven't already, you should probably have a look at how haven deals with flavoursof missing in order to stay compatible with the tidyverse. Ross

…

On 7 Jun 2017 12:39 p.m., "Nicholas Tierney" ***@***.***> wrote: Another name for the function is replace_na_type This could have the arguments: replace_na_type(.vars/.cols, # similar to `mutate .predicate, .funs) It might also need to follow the format of purrr::pmap, where you provide a named list, which would contain the variables/columns, and the rules for each of those. *Alternatively*, since this isn't really doing any modification in place, but is instead adding things to the shadow dataframe, it might be more sensible to have different verbs for that process: - add_shadow or - mutate_na / mutate_na_type or - replace_shadow / replace_shadow_type / replace_shadow_with — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#50 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFKJG0mFlty3MWNUquqyahh6AjbGvgC8ks5sBg1_gaJpZM4NvYaq> .

njtierney · 2017-06-07T02:53:10Z

Great suggestion, thanks @rgayler ! :)

njtierney · 2017-06-22T00:16:59Z

OK, so haven's tagged_na function is pretty slick

library(haven)
x <- c(1:5, tagged_na("a"), tagged_na("z"), NA)

# Tagged NA's work identically to regular NAs
x
#> [1]  1  2  3  4  5 NA NA NA
is.na(x)
#> [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

# To see that they're special, you need to use na_tag(),
# is_tagged_na(), or print_tagged_na():
is_tagged_na(x)
#> [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
na_tag(x)
#> [1] NA  NA  NA  NA  NA  "a" "z" NA
print_tagged_na(x)
#> [1]     1     2     3     4     5 NA(a) NA(z)    NA

# You can test for specific tagged NAs with the second argument
is_tagged_na(x, "a")
#> [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE

# Because the support for tagged's NAs is somewhat tagged on to R,
# the left-most NA will tend to be preserved in arithmetic operations.
na_tag(tagged_na("a") + tagged_na("z"))
#> [1] "a"

I need to more carefully think about the implementation of this system, and how it fits into shadow values, so I'm going to change this to be on the next release for Narnia.

One of the main goals with Narnia is to clearly expose these sorts of values to the user, I am unsure if hiding attributes of an NA value is ideal here, although I do see the benefits with other features of R.

mpadge · 2017-08-08T13:57:38Z

Another related thought that I've been wondering about is handling sparse Matrix::dsparseMatrix objects. They're the definitive object packed full of missing values which are simply ... missing. Just for you to ponder down the track ...

njtierney · 2017-08-08T14:02:45Z

Just a note to look into memisc::missing.values() (as suggested by @leeper)

njtierney · 2018-01-12T05:56:47Z

Just adding some of the relevant components of #76 into here:

Extending the dieas from #76, and extending it to shadow values, this allows us to directly specify the different flavours of missings, providing the verbs:

replace_shadow
replace_shadow_where
replace_shadow_if
replace_shadow_at

This code would then only alter the shadow matrix, and leave the data intact as is, allowing us to leverage other features of the shadow matrix, and also possibly maybe add an additional factor level to it that describes the missingness mechanism (!NA, NA, NA_, NA_)

There needs to be a way to store the "codebook / data dictionary" of missingness mechanisms, so that the user has a way to look up / describe what a value like NA_rainfall really means.

Although there are currently ways in haven to store the "different" values of missingness using tagged_na(), but I don't really like hiding important features from the user, and here I think that shadow matrix can be more practical.

One approach I like so far could be something like this:

data %>%
replace_shadow_where(.funs = ~.x == -99,
              .why = "weather station too cold",
              .suffix = "TC")

An additional idea then is to make themiss_ functions have special behaviour for the _NA columns, so that gg_miss_var(data) could provide the summary of the number of pure NA values, and then also the number of values that are coded as NA_<reason_1>. Likewise, miss_var_summary might also have some specific shadow summaries.

njtierney · 2018-03-16T06:39:10Z

Current progress:

library(naniar)
library(tidyverse)

df <- tribble(
  ~wind, ~temperature,
  -99,    45,
   68,    NA,
   72,    25
)

dfs <- bind_shadow(df)

map(levels)
#> Error in as_mapper(.f, ...): argument ".f" is missing, with no default

dfs_special <- recode_shadow(dfs,
                             temperature = .where(wind == -99 ~ "bananas"))

dfs_special
#> # A tibble: 3 x 4
#>    wind temperature wind_NA temperature_NA
#>   <dbl>       <dbl> <fct>   <fct>         
#> 1  -99.         45. !NA     NA_bananas    
#> 2   68.         NA  !NA     NA            
#> 3   72.         25. !NA     !NA

map(dfs_special, levels)
#> $wind
#> NULL
#> 
#> $temperature
#> NULL
#> 
#> $wind_NA
#> [1] "!NA"        "NA"         "NA_bananas"
#> 
#> $temperature_NA
#> [1] "!NA"        "NA"         "NA_bananas"

Created on 2018-03-16 by the reprex package (v0.2.0).

njtierney · 2018-03-16T06:39:55Z

Next steps:

make it work with >1 statements
test how this works inside a grouped_df

mpadge · 2018-03-16T09:14:23Z

@njtierney just a thought for ya, but haven has a really nice labelled class that is very useful beyond the standard remit of the package itself. Any one NA could then contain concurrent info about all possible types of NA.

njtierney · 2018-03-25T23:13:38Z

Thanks @mpadge ! That is something I have been thinking about, but at the moment I am sticking with the idea of expanding out the dataframe into the data and shadows. In the future I am interested in looking at collapsing things back down in the dataframe.

On the note of labelled features, here are some packages that work with them (for future reference to myself)

https://github.com/larmarange/labelled

https://github.com/rubenarslan/codebook

njtierney · 2018-06-04T01:52:43Z

Hola @caitlinhudon - tagging you in here from discussion on twitter from your awesome talk

njtierney · 2018-06-14T05:10:47Z

Here is the section on labelled missing data:

https://larmarange.github.io/labelled/articles/intro_labelled.html#user-defined-missing-values-spsss-style

This looks like a nicely scoped out idea, which is great! But I think I want to take my idea for recode_shadow to some maturity before I compare the two.

njtierney · 2018-06-14T06:07:07Z

OK just dusted off the "special-missing" branch, I get the same output as before:

library(naniar)
library(tidyverse)

df <- tribble(
  ~wind, ~temperature,
  -99,    45,
  68,    NA,
  72,    25
)

dfs <- bind_shadow(df)

map(dfs, levels)
#> $wind
#> NULL
#> 
#> $temperature
#> NULL
#> 
#> $wind_NA
#> [1] "!NA" "NA" 
#> 
#> $temperature_NA
#> [1] "!NA" "NA"
map(dfs, class)
#> $wind
#> [1] "numeric"
#> 
#> $temperature
#> [1] "numeric"
#> 
#> $wind_NA
#> [1] "shadow" "factor"
#> 
#> $temperature_NA
#> [1] "shadow" "factor"
is_shadow(dfs)
#> [1] TRUE
are_shadow(dfs)
#>           wind    temperature        wind_NA temperature_NA 
#>          FALSE          FALSE           TRUE           TRUE
any_shadow(dfs)
#> Error in any_shadow(dfs): could not find function "any_shadow"
map(dfs, class)
#> $wind
#> [1] "numeric"
#> 
#> $temperature
#> [1] "numeric"
#> 
#> $wind_NA
#> [1] "shadow" "factor"
#> 
#> $temperature_NA
#> [1] "shadow" "factor"
class(dfs)
#> [1] "shadow"     "tbl_df"     "tbl"        "data.frame"

dfs_special <- dfs %>%
  recode_shadow(temperature = .where(wind == -99 ~ "bananas"))

dfs_special
#> # A tibble: 3 x 4
#>    wind temperature wind_NA temperature_NA
#>   <dbl>       <dbl> <fct>   <fct>         
#> 1   -99          45 !NA     NA_bananas    
#> 2    68          NA !NA     NA            
#> 3    72          25 !NA     !NA

map(dfs_special, levels)
#> $wind
#> NULL
#> 
#> $temperature
#> NULL
#> 
#> $wind_NA
#> [1] "!NA"        "NA"         "NA_bananas"
#> 
#> $temperature_NA
#> [1] "!NA"        "NA"         "NA_bananas"

Created on 2018-06-14 by the reprex package (v0.2.0).

Current tasks:

Explore how this works with grouped data
Options for performing this many times - scoped variants or other options.
Demonstrate usage in a larger dataset

njtierney · 2018-06-24T14:06:42Z

See this SO question for an idea on one practical implementation / statement of need

njtierney added the V0.2.0 label Jun 21, 2017

njtierney added V0.2.0 and removed V0.1.0 labels Jun 22, 2017

njtierney mentioned this issue Jul 24, 2017

Replacing selected values as NA #76

Closed

njtierney mentioned this issue Oct 12, 2017

extend naniar package for multiple types of missing value ropensci/ozunconf17#8

Closed

njtierney mentioned this issue Jan 9, 2018

revealers, helpers to clear up common / other representations of missing values #25

Closed

njtierney mentioned this issue Jan 19, 2018

bind_shadow should only add columns for variables that have missing values #106

Closed

njtierney added V0.3.0 and removed V0.2.0 labels Jan 25, 2018

njtierney mentioned this issue Mar 22, 2018

add unbind_shadow and unbind_data #142

Closed

njtierney modified the milestones: V0.3.0, V0.4.0 Jun 5, 2018

njtierney removed the V0.3.0 label Jun 5, 2018

njtierney mentioned this issue Jun 14, 2018

Any thoughts on handling "censored" values #67

Closed

njtierney mentioned this issue Aug 12, 2018

Implement new classes to assist with special missing values / shadow matrix / nabular data #189

Closed

njtierney mentioned this issue Aug 20, 2018

Implement recode_shadow, which allows for special missing values #202

Merged

njtierney closed this as completed in #202 Aug 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

create other flavours of missing values #50

create other flavours of missing values #50

njtierney commented Jun 4, 2017

njtierney commented Jun 7, 2017

rgayler commented Jun 7, 2017 via email

njtierney commented Jun 7, 2017

njtierney commented Jun 22, 2017

mpadge commented Aug 8, 2017

njtierney commented Aug 8, 2017

njtierney commented Jan 12, 2018

njtierney commented Mar 16, 2018

njtierney commented Mar 16, 2018

mpadge commented Mar 16, 2018

njtierney commented Mar 25, 2018

njtierney commented Jun 4, 2018

njtierney commented Jun 14, 2018

njtierney commented Jun 14, 2018 •

edited

Loading

njtierney commented Jun 24, 2018

create other flavours of missing values #50

create other flavours of missing values #50

Comments

njtierney commented Jun 4, 2017

njtierney commented Jun 7, 2017

rgayler commented Jun 7, 2017 via email

njtierney commented Jun 7, 2017

njtierney commented Jun 22, 2017

mpadge commented Aug 8, 2017

njtierney commented Aug 8, 2017

njtierney commented Jan 12, 2018

njtierney commented Mar 16, 2018

njtierney commented Mar 16, 2018

mpadge commented Mar 16, 2018

njtierney commented Mar 25, 2018

njtierney commented Jun 4, 2018

njtierney commented Jun 14, 2018

njtierney commented Jun 14, 2018 • edited Loading

njtierney commented Jun 24, 2018

njtierney commented Jun 14, 2018 •

edited

Loading