Skip to content

Commit

Permalink
first commit
Browse files Browse the repository at this point in the history
this is the first commit of the new package, visdat, which supersedes
`footprintr`.
  • Loading branch information
njtierney committed Jan 28, 2016
0 parents commit a350e50
Show file tree
Hide file tree
Showing 13 changed files with 245 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
^.*\.Rproj$
^\.Rproj\.user$
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.Rproj.user
.Rhistory
.RData
13 changes: 13 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
Package: visdat
Title: preliminary visualisation of data
Version: 0.0.0.9000
Authors@R: person("Nicholas", "Tierney", email = "nicholas.tierney@gmail.com", role = c("aut", "cre"))
Description: visdat makes it easy to visualise your whole dataset so that you can quickly identify problems visually.
Depends:
R (>= 3.2.2)
License: MIT
LazyData: true
RoxygenNote: 5.0.1
Imports: ggplot2,
tidyr,
dplyr
8 changes: 8 additions & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Generated by roxygen2: do not edit by hand

export(fingerprint)
export(vis_dat)
export(vis_miss)
import(dplyr)
import(ggplot2)
importFrom(tidyr,gather)
19 changes: 19 additions & 0 deletions R/fingerprint.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#' fingerprint
#'
#' \code{fingerprint} is a utility function for vis_dat
#'
#' @description fingerprint takes the fingerprint of a dataframe, and (currently) replaces the contents (x) with the class of a given object, unless it is missing (coded as NA), in which case it leaves it as NA. The name fingerprint is taken from the csv-fingerprint, of which this package is based.
#'
#' @param x a vector
#'
#' @export
fingerprint <- function(x){

# is the data missing?
ifelse(is.na(x),
# yes? Leave as is NA
yes = NA,
# no? make that value no equal to the class of this cell.
no = class(x))

} # end function
38 changes: 38 additions & 0 deletions R/viz_dat.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#' vis_dat
#'
#' \code{vis_dat} visualises a data.frame to tell you what it contains.
#'
#' @description \code{vis_dat} gives you an at-a-glance ggplot of what is inside a dataframe, colouring cells according to what class they are and whether the values are missing. As it returns a ggplot object, it is very easy to customize and change labels, etc.
#'
#' @param x a data.frame object
#'
#' @importFrom tidyr gather
#' @import dplyr
#' @import ggplot2
#'
#' @export
vis_dat <- function(x){

# apply the fingerprint to every column in the dataframe
lapply(x, fingerprint) %>%
# coerce it to a dataframe...there's probably a better way
as_data_frame %>%
# create a new column that is numbered from 1 to the number of rows
# this assists in the gathering of rows together
mutate(rows = 1:n()) %>%
# gather the variables together for plotting
# here we now have a column of the row number (row), then the variable(variables), then the contents of that variable (value)
gather(key = variables,
value = value,
-rows) %>%
# then we plot it
ggplot(data = .,
aes(x = variables,
y = rows)) +
geom_raster(aes(fill = value)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
labs(x = "Variables in Dataset",
y = "Rows / observations")

}
40 changes: 40 additions & 0 deletions R/viz_miss.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#' vis_miss
#'
#' \code{vis_miss} visualises a data.frame to display missingness.
#'
#' @description \code{vis_miss} gives you an at-a-glance ggplot of the missingness inside a dataframe, colouring cells according to missingness. As it returns a ggplot object, it is very easy to customize and change labels, etc.
#'
#' @param x a data.frame object
#'
#' @importFrom tidyr gather
#' @import dplyr
#' @import ggplot2
#'
#' @export
vis_miss <- function(x){

x %>%
is.na %>%
as.data.frame %>%
mutate(rows = 1:n()) %>%
# gather the variables together for plotting
# here we now have a column of the row number (row), then the variable(variables), then the contents of that variable (value)
gather(key = variables,
value = value,
-rows) %>%
# then we plot it
ggplot(data = .,
aes(x = variables,
y = rows)) +
geom_raster(aes(fill = value)) +
# change the colour, so that missing is grey, present is black
scale_fill_grey(name = "",
labels = c("Present",
"Missing")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45,
vjust = 0.5)) +
labs(x = "Variables in Dataset",
y = "Rows / observations")

}
47 changes: 47 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# visdat

This package is the second iteration of my attempt at cloning the super cool and way sexier "csv-fingerprint" from flowing data - see [here](https://github.com/setosa/csv-fingerprint) and [here](https://flowingdata.com/2014/08/14/csv-fingerprint-spot-errors-in-your-data-at-a-glance/). Initially I had named the package "footprintr", to keep in spirit with the name "csv-fingerprint". However, after a little more thought and usage, I felt that "footprintr" didn't actually describe what was going on with the pacakge, and what it does, and so "visdat" was born.

# What does it do?

visdat is a small r package that visualises a dataframe, displaying missing data and variable classes with different colours. Future work will allow for each cell to be colored according to its type (e.g., strings, factors, integers, decimals, dates, missing data). It would also be really cool to get this function to "intelligently" read in data types.

Part of the name suggests that it could be integrated with testdat and testthat. The idea being that first you visualise your data, then you run tests to fix them.


# How to install

```
# install.packages("devtools")
library(devtools)
install_github("tierneyn/footprintr")
```

# Example

Let's explore the missing data

```
library(visdat)
vis_miss(airquality)
```

Let's see what's inside airquality

```
vis_dat(airquality)
```

# Known Issues.

**Individual cells do not have an individual class**
Due to the fact that R coerces a vector to be the same type, this means that you cannot have something like c("a", 1L, 10.555) together as a vector, as it will just convert this to `[1] "a" "1" "10.555"`. This means that you don't get the ideal feature of picking up on nuances such as individuals cells that are different classes in the dataframe. Perhaps there is a way to read in a csv as a list so that these features are preserved?

**Missing Data not listed in legend**

When running the example below, the gray bars indicate missing values, but these are currently not specified as missing values.


Binary file added data/example2.rda
Binary file not shown.
18 changes: 18 additions & 0 deletions man/fingerprint.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 18 additions & 0 deletions man/vis_dat.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

18 changes: 18 additions & 0 deletions man/vis_miss.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

21 changes: 21 additions & 0 deletions visdat.Rproj
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
Version: 1.0

RestoreWorkspace: No
SaveWorkspace: No
AlwaysSaveHistory: Default

EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8

RnwWeave: knitr
LaTeX: pdfLaTeX

AutoAppendNewline: Yes
StripTrailingWhitespace: Yes

BuildType: Package
PackageUseDevtools: Yes
PackageInstallArgs: --no-multiarch --with-keep.source
PackageRoxygenize: rd,collate,namespace

0 comments on commit a350e50

Please sign in to comment.