New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Useful missing data data structure and visualisation #165

Open
njtierney opened this Issue May 22, 2018 · 3 comments

Comments

Projects
None yet
1 participant
@njtierney
Owner

njtierney commented May 22, 2018

library(tidyverse)
library(naniar)

which_are_shadow <- function(data) which(are_shadow(data))

aq_shadow_gather <- airquality %>%
  bind_shadow() %>%
  gather(key = "key",
         value = "value",
         -which_are_shadow(.)) %>%
  select(key, value, everything()) %>%
  gather(key = "key_NA",
         value = "value_NA",
         which_are_shadow(.))

aq_shadow_gather %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(key~key_NA,
             scales = "free",
             switch = "y")
#> Warning: Removed 264 rows containing non-finite values (stat_density).

# and now only showing the variables that contain missings

aq_shadow_gather <- airquality %>%
  bind_shadow(only_miss = TRUE) %>%
  gather(key = "key",
         value = "value",
         -which_are_shadow(.)) %>%
  select(key, value, everything()) %>%
  gather(key = "key_NA",
         value = "value_NA",
         which_are_shadow(.))

aq_shadow_gather %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(key~key_NA,
             scales = "free",
             switch = "y")
#> Warning: Removed 88 rows containing non-finite values (stat_density).

Created on 2018-05-23 by the reprex package (v0.2.0).

@njtierney njtierney added this to the V0.4.0 milestone Jun 5, 2018

@njtierney

This comment has been minimized.

Show comment
Hide comment
@njtierney

njtierney Aug 13, 2018

Owner
library(tidyverse)
library(naniar)
shadow_gather <- function(shadow_data){
  
  shadow_data %>%
    tidyr::gather(key = "variable",
                  value = "value",
                  -which_are_shadow(.)) %>%
    tidyr::gather(key = "variable_NA",
                  value = "value_NA",
                  which_are_shadow(.))
}

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_gather(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

ggplot(gathered_ocean_imp_mean,
       aes(x = value,
           fill = value_NA)) + 
  geom_histogram() +
  facet_grid(variable ~ variable_NA,
             scales = "free_x",
             switch = "y")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

Some notes on implementation

naming

The function name should be gather_shadow. This function already exists, but is rarely used. To help overcome this, this is where the new class system defined in #189 would be very helpful, and brings us to the next point

Methods

gather_shadow should have nabular, data.frame, and shadow methods.

Options for extra variables

There should be options to leave certain variables in the dataframe untouched. For example, the any_missing column that is created by add_label_shadow. This would involve having ... and then quoting this input, and adding it to the end of the gather statements.

Notes on the visualisation method

I spent a while trying to NOT use facet_grid - but you need to, otherwise you combine the different datasets.

This smells like a bit of a leaky abstraction.

There should be a nice way to get only the variables and their imputed values into shape for this kind of visualisation. This means getting the visualisations on the diagonal - doing a filter where variable == variable_NA.

some work so far on this:

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  mutate(temp = paste0(variable,"_NA")) %>%
  filter(variable == temp)
Owner

njtierney commented Aug 13, 2018

library(tidyverse)
library(naniar)
shadow_gather <- function(shadow_data){
  
  shadow_data %>%
    tidyr::gather(key = "variable",
                  value = "value",
                  -which_are_shadow(.)) %>%
    tidyr::gather(key = "variable_NA",
                  value = "value_NA",
                  which_are_shadow(.))
}

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_gather(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

ggplot(gathered_ocean_imp_mean,
       aes(x = value,
           fill = value_NA)) + 
  geom_histogram() +
  facet_grid(variable ~ variable_NA,
             scales = "free_x",
             switch = "y")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

Some notes on implementation

naming

The function name should be gather_shadow. This function already exists, but is rarely used. To help overcome this, this is where the new class system defined in #189 would be very helpful, and brings us to the next point

Methods

gather_shadow should have nabular, data.frame, and shadow methods.

Options for extra variables

There should be options to leave certain variables in the dataframe untouched. For example, the any_missing column that is created by add_label_shadow. This would involve having ... and then quoting this input, and adding it to the end of the gather statements.

Notes on the visualisation method

I spent a while trying to NOT use facet_grid - but you need to, otherwise you combine the different datasets.

This smells like a bit of a leaky abstraction.

There should be a nice way to get only the variables and their imputed values into shape for this kind of visualisation. This means getting the visualisations on the diagonal - doing a filter where variable == variable_NA.

some work so far on this:

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  mutate(temp = paste0(variable,"_NA")) %>%
  filter(variable == temp)

njtierney added a commit that referenced this issue Aug 13, 2018

@njtierney

This comment has been minimized.

Show comment
Hide comment
@njtierney

njtierney Aug 13, 2018

Owner

OK so here is the progress on this:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  filter(variable_NA == paste0(variable,"_NA")) %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

I think that the abstraction here would be to specify the variables that you want to focus on, which would be filtered out.

Owner

njtierney commented Aug 13, 2018

OK so here is the progress on this:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  filter(variable_NA == paste0(variable,"_NA")) %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

I think that the abstraction here would be to specify the variables that you want to focus on, which would be filtered out.

@njtierney

This comment has been minimized.

Show comment
Hide comment
@njtierney

njtierney Aug 13, 2018

Owner

Actually I just added that filtering step to the shadow_long function. this actually abstracts away a nice chunk of the code:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

gathered_ocean_imp_mean %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

Owner

njtierney commented Aug 13, 2018

Actually I just added that filtering step to the shadow_long function. this actually abstracts away a nice chunk of the code:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

gathered_ocean_imp_mean %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

@njtierney njtierney modified the milestones: V0.4.0, V0.5.0 Sep 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment