Permalink
Browse files

first commit of ggmissing!

  • Loading branch information...
njtierney committed Dec 10, 2015
0 parents commit a161b7c5707e0c5041467d8f7228c11afb5e2b29
Showing with 317 additions and 0 deletions.
  1. +2 −0 .Rbuildignore
  2. +4 −0 .gitignore
  3. +16 −0 DESCRIPTION
  4. +5 −0 NAMESPACE
  5. +16 −0 R/shadow_df.R
  6. +21 −0 R/shadow_shift.R
  7. +30 −0 README.md
  8. +20 −0 ggmissing.Rproj
  9. +15 −0 man/shadow_df.Rd
  10. +15 −0 man/shadow_shift.Rd
  11. +173 −0 vignettes/ggmissing-vignette.Rmd
@@ -0,0 +1,2 @@
^.*\.Rproj$
^\.Rproj\.user$
@@ -0,0 +1,4 @@
inst/doc
.Rproj.user
.Rhistory
.RData
@@ -0,0 +1,16 @@
Package: ggmissing
Type: Package
Title: enables ggplot to plot missing data
Version: 0.1.0
Author: Nicholas Tierney, Di Cook
Maintainer: <nicholas.tierney@gmail.com>
Description: Created functions that can be used to shift missing values and
create ggplot plots with missingness clearly indicated
License: Unsure
LazyData: TRUE
Suggests:
knitr,
rmarkdown
VignetteBuilder: knitr
Imports: dplyr
RoxygenNote: 5.0.0
@@ -0,0 +1,5 @@
# Generated by roxygen2: do not edit by hand

export(shadow_df)
export(shadow_shift)
import(dplyr)
@@ -0,0 +1,16 @@
#' shadow_df
#'
#' @description \code{shadow_df} creates a shadow matrix/data frame of class 'tbl_df' that denotes whether a given cell is missing or not - if a value is missing, it is denoted as TRUE
#'
#' @param x a dataframe
#'
#' @import dplyr
#'
#' @export

shadow_df <- function(x){
x %>%
is.na.data.frame %>%
as.data.frame %>%
as_data_frame
}
@@ -0,0 +1,21 @@
#' shadow_shift
#'
#' \code{shadow_shift} transforms missing values of a given variable
#'
#' @param x is a variable, must be continuous
#'
#' @import dplyr
#'
#' @export

# Make a window function that transforms missing values to be 10% below the minimum value for that variable
shadow_shift <- function(x){
ifelse(is.na(x),
yes = min(x, na.rm = T)*0.9,
no = x)
# min() might change to something related to the data range
# possibly use range() to determine the shadow shift
# Need to also add some jitter/noise to these points to seperate out repeats of the same value
# for factors, need to add another level (smaller than smallest)
# need to think about how time is handled as well.
}
@@ -0,0 +1,30 @@
# ggmissing

Currently, ggplot does not display missing data, omitting missing data from plots, but giving a warning message.

This repository is the beginnings of some R code to enable ggplot to display missingness

GGobi and Manet provide methods of incorporating missingness. One approach is to replace "NA" values with values 10% lower than the minimum of that variable.

This is done with the `shadow_shift` function. This can be directly incorporated into ggplot:

```
ggplot(data = df,
aes(x = shadow_shift(Height),
y = shadow_shift(Age))) +
geom_point()
```

This allows missingness to be visualised, however the missing values would ideally be shown in a different colour, so that missingness becomes preattentive.

We currently have a messy approach to colouring these points differently, although a more elegant solution is needed.

In this repository is a "vignette" of sorts, describing the process of adding missingness. Utility functions for plotting the missingness, `shadow_shift`, which shifts missing values, `shadow_df` which creates a shadow matrix.

Future work will involve creating an elegant and meaningful way of coding and representing missingness into the data. One approach is to use `interaction` to create the levels of missingness and plot these as different colours.

What sorts of plots could be handled by this approach also need to be thought about further.

- 1D, univaritae distribution plots
- Categorical variables
- Bivariate plots: Scatterplots, Density overlays,
@@ -0,0 +1,20 @@
Version: 1.0

RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: knitr
LaTeX: pdfLaTeX

AutoAppendNewline: Yes
StripTrailingWhitespace: Yes

BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source

Some generated files are not rendered by default. Learn more.

Oops, something went wrong.

Some generated files are not rendered by default. Learn more.

Oops, something went wrong.
@@ -0,0 +1,173 @@
---
title: "Vignette Title"
author: "Nicholas Tierney"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Vignette Title}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup}
knitr::opts_chunk$set(message = F)
```


This is a draft document/vignette that gets ggplot to display missingness into a plot.


```{r}
library(dplyr)
library(wakefield)
df <-
r_data_frame(
n = 30,
id,
race,
age,
sex,
hour,
iq,
height,
died,
Scoring = rnorm,
Smoker = valid
) %>%
r_na(prob=.4)
```


```{r}
library(ggplot2)
ggplot(data = df,
aes(x = Height,
y = Age)) +
geom_point()
# idea was the plot the missing data as 10% below the minimum value for that variable.
df %>%
# make missing values 10% below the minimum value for that variable
mutate(Height = ifelse(is.na(Height),
yes = min(Height, na.rm = T)*0.9,
no = Height),
Age = ifelse(is.na(Age),
yes = min(Age, na.rm = T)*0.9,
no = Age)) %>%
ggplot(data = .,
aes(x = Height,
y = Age)) +
geom_point()
```


```{r}
is.na.data.frame(df)
df.shadow <- as.data.frame(is.na.data.frame(df))
# make a function for creating a true/false shadow matrix
shadow_df <- function(x){
x %>%
is.na.data.frame %>%
as.data.frame %>%
as_data_frame
}
# remember that TRUE = missing
shadow_df(df)
# Make a window function that transforms missing values to be 10% below the minimum value for that variable
shadow_shift <- function(x){
ifelse(is.na(x),
yes = min(x, na.rm = T)*0.9,
no = x)
# min() might change to something related to the data range
# possibly use range() to determine the shadow shift
# Need to also add some jitter/noise to these points to seperate out repeats of the same value
# for factors, need to add another level (smaller than smallest)
# need to think about how time is handled as well.
}
df %>%
# make missing values 10% below the minimum value for that variable
mutate(Height = shadow_shift(Height),
Age = shadow_shift(Age)) %>%
ggplot(data = .,
aes(x = Height,
y = Age)) +
geom_point()
# OK, so it turns out that I can just shadow shift the data INSIDE ggplot.
ggplot(data = df,
aes(x = shadow_shift(Height),
y = shadow_shift(Age))) +
geom_point()
# now we just need to add in some colour to these points, so that missing data sorta takes upon this "preattentive" phase.
# let's make a new dataset, that is basically contains only the shifted data
df.test <-
df %>%
mutate(Height = shadow_shift(Height)) %>%
# filter out those observations that are greater than the min
filter(Height < (min(Height, na.rm = T)*1.1))
df.test.2 <-
df %>%
mutate(Age = shadow_shift(Age)) %>%
# filter out those observations that are greater than the min
filter(Age < (min(Age, na.rm = T)*1.1))
# to manage the different quantities of the variables, I could put them into a list, or something that allows me to have a "ragged" dataset
ggplot(data = df,
aes(x = shadow_shift(Height),
y = shadow_shift(Age))) +
geom_point() +
geom_point(data = df.test,
aes(x = Height),
colour = "Red") +
geom_point(data = df.test.2,
aes(y = Age),
colour = "Red")
```

Possibly colour by `interaction`, which creates all the different combinations of levels of factors.

```{r, eval = F, echo = T}
a <- gl(2, 4, 8)
b <- gl(2, 2, 8, labels = c("ctrl", "treat"))
s <- gl(2, 1, 8, labels = c("M", "F"))
a
b
s
ggplot(data = df,
aes(x = shadow_shift(Height),
y = shadow_shift(Age))) +
geom_point() +
geom_point(data = shadow_df(df),
colour = interaction(Height, Age))
df %>% shadow_df %>% select(Height, Age) %>% interaction
interaction()
```

0 comments on commit a161b7c

Please sign in to comment.