Useful missing data data structure and visualisation #165

njtierney · 2018-05-22T22:05:17Z

library(tidyverse)
library(naniar)

which_are_shadow <- function(data) which(are_shadow(data))

aq_shadow_gather <- airquality %>%
  bind_shadow() %>%
  gather(key = "key",
         value = "value",
         -which_are_shadow(.)) %>%
  select(key, value, everything()) %>%
  gather(key = "key_NA",
         value = "value_NA",
         which_are_shadow(.))

aq_shadow_gather %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(key~key_NA,
             scales = "free",
             switch = "y")
#> Warning: Removed 264 rows containing non-finite values (stat_density).

# and now only showing the variables that contain missings

aq_shadow_gather <- airquality %>%
  bind_shadow(only_miss = TRUE) %>%
  gather(key = "key",
         value = "value",
         -which_are_shadow(.)) %>%
  select(key, value, everything()) %>%
  gather(key = "key_NA",
         value = "value_NA",
         which_are_shadow(.))

aq_shadow_gather %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(key~key_NA,
             scales = "free",
             switch = "y")
#> Warning: Removed 88 rows containing non-finite values (stat_density).

Created on 2018-05-23 by the reprex package (v0.2.0).

njtierney · 2018-08-13T02:20:01Z

library(tidyverse)
library(naniar)
shadow_gather <- function(shadow_data){
  
  shadow_data %>%
    tidyr::gather(key = "variable",
                  value = "value",
                  -which_are_shadow(.)) %>%
    tidyr::gather(key = "variable_NA",
                  value = "value_NA",
                  which_are_shadow(.))
}

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_gather(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

ggplot(gathered_ocean_imp_mean,
       aes(x = value,
           fill = value_NA)) + 
  geom_histogram() +
  facet_grid(variable ~ variable_NA,
             scales = "free_x",
             switch = "y")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

Some notes on implementation

naming

The function name should be gather_shadow. This function already exists, but is rarely used. To help overcome this, this is where the new class system defined in #189 would be very helpful, and brings us to the next point

Methods

gather_shadow should have nabular, data.frame, and shadow methods.

Options for extra variables

There should be options to leave certain variables in the dataframe untouched. For example, the any_missing column that is created by add_label_shadow. This would involve having ... and then quoting this input, and adding it to the end of the gather statements.

Notes on the visualisation method

I spent a while trying to NOT use facet_grid - but you need to, otherwise you combine the different datasets.

This smells like a bit of a leaky abstraction.

There should be a nice way to get only the variables and their imputed values into shape for this kind of visualisation. This means getting the visualisations on the diagonal - doing a filter where variable == variable_NA.

some work so far on this:

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  mutate(temp = paste0(variable,"_NA")) %>%
  filter(variable == temp)

…tations - see #165

njtierney · 2018-08-13T05:12:29Z

OK so here is the progress on this:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  filter(variable_NA == paste0(variable,"_NA")) %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

I think that the abstraction here would be to specify the variables that you want to focus on, which would be filtered out.

njtierney · 2018-08-13T05:15:24Z

Actually I just added that filtering step to the shadow_long function. this actually abstracts away a nice chunk of the code:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

gathered_ocean_imp_mean %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

njtierney added this to the V0.4.0 milestone Jun 5, 2018

njtierney added a commit that referenced this issue Aug 13, 2018

add which_are_shadow, to help implement long form exploration of impu…

a6164ae

…tations - see #165

njtierney added the visualisation label Aug 21, 2018

njtierney modified the milestones: V0.4.0, V0.5.0 Sep 3, 2018

njtierney modified the milestones: V0.5.0, V0.6.0 Oct 30, 2019

njtierney removed this from the V0.6.0 milestone Oct 14, 2022

njtierney added this to the V0.8.0 milestone Apr 10, 2023

njtierney added the documentation label Apr 23, 2023

njtierney modified the milestones: V1.2.0, V1.3.0 Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Useful missing data data structure and visualisation #165

Useful missing data data structure and visualisation #165

njtierney commented May 22, 2018

njtierney commented Aug 13, 2018

njtierney commented Aug 13, 2018

njtierney commented Aug 13, 2018

Useful missing data data structure and visualisation #165

Useful missing data data structure and visualisation #165

Comments

njtierney commented May 22, 2018

njtierney commented Aug 13, 2018

Some notes on implementation

naming

Methods

Options for extra variables

Notes on the visualisation method

njtierney commented Aug 13, 2018

njtierney commented Aug 13, 2018