# Visualization

Visualizing data can be a great way of understanding both variance within variables and covariance between variables. A typical workflow consists of:

1. Coming up with questions about your data
1. Filtering, mutating, and visualizing in different ways
1. Refine or generate new questions based on curiosity or skepticism

This is a bit like a multi-armed bandit. We can initially explore freely but over time, we want to hone in on promising ideas and refine those for further processing and modeling.

In [None]:
library(dplyr)
library(ggplot2)

In [None]:
str(anscombe)

In [None]:
summary(anscombe)

In [None]:
ggplot(data = anscombe) +
  geom_point(aes(x = x1, y = y1), colour = "red") +
  geom_point(aes(x = x2, y = y2), colour = "blue") +
  geom_point(aes(x = x3, y = y3), colour = "green") +
  geom_point(aes(x = x4, y = y4), colour = "black")

In [None]:
library(nycflights13)

In [None]:
str(flights)

### Histograms

In [None]:
ggplot(data = flights) +
  geom_bar(mapping = aes(x = carrier))

*Question*: How would you sort this plot?

In [None]:
ggplot(data = flights) +
  geom_histogram(mapping = aes(x = distance), binwidth = 100)

In [None]:
# Use geom.freqpoly if you need to overlay multiple histograms
# Why do we have to use factor(month)?
ggplot(data = flights) +
  geom_freqpoly(mapping = aes(x = distance, color = factor(month)), binwidth = 100)

In [None]:
ggplot(data = flights) +
  geom_freqpoly(mapping = aes(x = distance, color = factor(month)), binwidth = 100) +
  coord_cartesian(xlim = c(2400, 2600), ylim = c(1000, 2000))

In [None]:
ggplot(data = faithful, mapping = aes(x = eruptions)) + 
  geom_histogram(binwidth = 0.25)

Can you explain the distribution? If not, what steps could you take to explore the relationship?

In [None]:
flights %>% 
  mutate(
    cancelled = is.na(dep_time),
    sched_hour = sched_dep_time %/% 100,
    sched_min = sched_dep_time %% 100,
    sched_dep_time = sched_hour + sched_min / 60
  ) %>% 
  ggplot(mapping = aes(sched_dep_time)) + 
    geom_freqpoly(mapping = aes(colour = cancelled), binwidth = 1/4)

### Boxplots

Boxplots are a more abstract way of visualizing distributions.
They display the 25th, 50th, and 75th percentile, and usually add a whisker that ranges to the farthest non-outlier point.

Outlier points greater than 1.5x the IQR (Inter-quartile range) are marked separately which makes them easy to distinguish.

In [None]:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

In [None]:
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

In [None]:
ggplot(data = mpg) +
  geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
  coord_flip()

### Comparing two discrete values

In [None]:
ggplot(data = diamonds) +
  geom_count(mapping = aes(x = cut, y = color))

In [None]:
diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
    geom_tile(mapping = aes(fill = n))

In [None]:
ggplot(data = flights) + 
  geom_point(mapping = aes(x = dep_delay, y = arr_delay))

In [None]:
ggplot(data = flights) +
  geom_bin2d(mapping = aes(x = arr_delay, y = dep_delay))

*Note*: Modeling relationships between variables and using that to subtract the residual can help disentangle convolutions.

In [None]:
library(lattice)

## Lattice

Usage: graph_type(formula, data =)

* y~x -> scatterplot
* z~x*y|A -> 3D scatterplot
* z~x*y -> 3D contour plot or wireframe
* ~x|A*B -> KDE
* ~x -> histogram

In [None]:
str(mtcars)

In [None]:
# create factors with value labels 
gear_f <- factor(mtcars$gear, levels = c(3,4,5), labels = c("3_gears","4_gears","5_gears")) 
cyl_f <- factor(mtcars$cyl, levels = c(4,6,8), labels=c("4_cyl","6_cyl","8_cyl")) 

In [None]:
mtcars <- mtcars %>%
    mutate(gear_f = gear_f,
           cyl_f = cyl_f)

In [None]:
mtcars %>% head(3)

In [None]:
splom(mtcars[c(1,3,4,5,6)], main="MTCARS Data")

In [None]:
xyplot(mpg~wt, data = mtcars)

In [None]:
barchart(mpg~wt,
         data = mtcars)

In [None]:
xyplot(mpg~wt|gear_f, data = mtcars)

In [None]:
xyplot(mpg~wt|cyl_f*gear_f,
    data = mtcars,
    main = "Every combination of conditioning vars",
    ylab = "MPG", xlab = "Weight")

In [None]:
xyplot(mpg~wt, data = mtcars,
       groups = cyl_f,
       auto.key = list(columns = 3),
       type=c("p","g"))

In [None]:
histogram(~mpg|cyl_f,
          data = mtcars)

In [None]:
bwplot(~hp|cyl_f,
       data = mtcars)

### Kernel Density Estimation

A way to essentially estimate a PDF based on smoothing discrete data.

* [Michael Lerner's motivation of KDE based on histograms](http://www.mglerner.com/blog/?p=28)

In [None]:
densityplot(~mpg,
            data = mtcars)

In [None]:
densityplot(~mpg|cyl_f,
            data = mtcars,
            layout=c(1,3))

In [None]:
help(lattice)

In [None]:
levelplot(mpg~wt*hp,
          data = mtcars)

In [None]:
cloud(mpg~cyl_f*gear_f,
      data = mtcars)

## Publishing to the web

[ggvis](http://ggvis.rstudio.com/) and [Rcharts](https://ramnathv.github.io/rCharts/) are visualization packages for R that focus publishing to the web.

ggvis renders plots in HTML and can plug into Shiny for public interactability.

Rcharts focuses on JavaScript and lets you use libraries like NVD3 to create embeddable visualizations.

*Exercise*: Load the iris dataset and explore the distribution of the variables conditioned by the type of flower. Produce some violin plots which illustrate the differences.

Sources:
```
* https://www.r-bloggers.com/conditioning-and-grouping-with-lattice-graphics/
* http://www.statmethods.net/advgraphs/trellis.html
* http://r4ds.had.co.nz/exploratory-data-analysis.html
```

*Copyright &copy; 2016 The Data Incubator.  All rights reserved.*