Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Useful missing data data structure and visualisation #165

Open
njtierney opened this issue May 22, 2018 · 3 comments
Open

Useful missing data data structure and visualisation #165

njtierney opened this issue May 22, 2018 · 3 comments

Comments

@njtierney
Copy link
Owner

library(tidyverse)
library(naniar)

which_are_shadow <- function(data) which(are_shadow(data))

aq_shadow_gather <- airquality %>%
  bind_shadow() %>%
  gather(key = "key",
         value = "value",
         -which_are_shadow(.)) %>%
  select(key, value, everything()) %>%
  gather(key = "key_NA",
         value = "value_NA",
         which_are_shadow(.))

aq_shadow_gather %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(key~key_NA,
             scales = "free",
             switch = "y")
#> Warning: Removed 264 rows containing non-finite values (stat_density).

# and now only showing the variables that contain missings

aq_shadow_gather <- airquality %>%
  bind_shadow(only_miss = TRUE) %>%
  gather(key = "key",
         value = "value",
         -which_are_shadow(.)) %>%
  select(key, value, everything()) %>%
  gather(key = "key_NA",
         value = "value_NA",
         which_are_shadow(.))

aq_shadow_gather %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_density(alpha = 0.5) + 
  facet_grid(key~key_NA,
             scales = "free",
             switch = "y")
#> Warning: Removed 88 rows containing non-finite values (stat_density).

Created on 2018-05-23 by the reprex package (v0.2.0).

@njtierney njtierney added this to the V0.4.0 milestone Jun 5, 2018
@njtierney
Copy link
Owner Author

library(tidyverse)
library(naniar)
shadow_gather <- function(shadow_data){
  
  shadow_data %>%
    tidyr::gather(key = "variable",
                  value = "value",
                  -which_are_shadow(.)) %>%
    tidyr::gather(key = "variable_NA",
                  value = "value_NA",
                  which_are_shadow(.))
}

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_gather(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

ggplot(gathered_ocean_imp_mean,
       aes(x = value,
           fill = value_NA)) + 
  geom_histogram() +
  facet_grid(variable ~ variable_NA,
             scales = "free_x",
             switch = "y")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

Some notes on implementation

naming

The function name should be gather_shadow. This function already exists, but is rarely used. To help overcome this, this is where the new class system defined in #189 would be very helpful, and brings us to the next point

Methods

gather_shadow should have nabular, data.frame, and shadow methods.

Options for extra variables

There should be options to leave certain variables in the dataframe untouched. For example, the any_missing column that is created by add_label_shadow. This would involve having ... and then quoting this input, and adding it to the end of the gather statements.

Notes on the visualisation method

I spent a while trying to NOT use facet_grid - but you need to, otherwise you combine the different datasets.

This smells like a bit of a leaky abstraction.

There should be a nice way to get only the variables and their imputed values into shape for this kind of visualisation. This means getting the visualisations on the diagonal - doing a filter where variable == variable_NA.

some work so far on this:

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  mutate(temp = paste0(variable,"_NA")) %>%
  filter(variable == temp)

@njtierney
Copy link
Owner Author

OK so here is the progress on this:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean %>%
  filter(variable %in% c("air_temp_c",
                         "humidity",
                         "sea_temp_c")) %>%
  filter(variable_NA == paste0(variable,"_NA")) %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

I think that the abstraction here would be to specify the variables that you want to focus on, which would be filtered out.

@njtierney
Copy link
Owner Author

Actually I just added that filtering step to the shadow_long function. this actually abstracts away a nice chunk of the code:

library(tidyverse)
library(naniar)

ocean_imp_mean <- oceanbuoys %>% 
  bind_shadow(only_miss = TRUE) %>%
  impute_mean_all()

gathered_ocean_imp_mean <- shadow_long(ocean_imp_mean)

gathered_ocean_imp_mean
#> # A tibble: 17,664 x 4
#>    variable value variable_NA   value_NA
#>    <chr>    <dbl> <chr>         <chr>   
#>  1 year      1997 sea_temp_c_NA !NA     
#>  2 year      1997 sea_temp_c_NA !NA     
#>  3 year      1997 sea_temp_c_NA !NA     
#>  4 year      1997 sea_temp_c_NA !NA     
#>  5 year      1997 sea_temp_c_NA !NA     
#>  6 year      1997 sea_temp_c_NA !NA     
#>  7 year      1997 sea_temp_c_NA !NA     
#>  8 year      1997 sea_temp_c_NA !NA     
#>  9 year      1997 sea_temp_c_NA !NA     
#> 10 year      1997 sea_temp_c_NA !NA     
#> # ... with 17,654 more rows

gathered_ocean_imp_mean %>%
  ggplot(aes(x = value,
             fill = value_NA)) + 
  geom_histogram() +
  facet_wrap(~variable_NA)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Created on 2018-08-13 by the reprex package (v0.2.0).

@njtierney njtierney modified the milestones: V0.4.0, V0.5.0 Sep 3, 2018
@njtierney njtierney modified the milestones: V0.5.0, V0.6.0 Oct 30, 2019
@njtierney njtierney removed this from the V0.6.0 milestone Oct 14, 2022
@njtierney njtierney added this to the V0.8.0 milestone Apr 10, 2023
@njtierney njtierney modified the milestones: V1.2.0, V1.3.0 Apr 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant