# Setup Steps

1. Open RStudio on your machine or choose RStudio from the "New" dropdown menu on ACCRE (it is the last option)
2. Run `install.packages("ggplot2")` and `install.packages("png")` in RStudio
3. If using ACCRE, upload this notebook, the `Metro_Nashville_Public_Schools_Enrollment_and_Demographics.csv` csv file, and the `nashville_map.png` image
4. At this point all the cells should run
    - You will need to go through them in order, as some refer to previous outputs

In [None]:
# load libraries

# need to install
library("ggplot2") # visualization library
library("png") # library to read png images

# come pre-installed
library("plyr") # 
library("repr") # to set plot dimensions
options(repr.plot.width=10, repr.plot.height=6)

# [Grammar of Graphics](http://vita.had.co.nz/papers/layered-grammar.pdf)
[**R**](https://www.r-project.org) is a programming language written by and for academics. Thus, many libraries in R have an underpinning theory. For [ggplot2](https://github.com/tidyverse/ggplot2) that theory is the Grammar of Graphics.

You can purchase the full Grammar of Graphics book [here](https://www.amazon.com/Grammar-Graphics-Statistics-Computing/dp/0387245448/ref=as_li_ss_tl?ie=UTF8&qid=1477928463&sr=8-1&keywords=the+grammar+of+graphics&linkCode=sl1&tag=ggplot2-20&linkId=f0130e557161b83fbe97ba0e9175c431)

For some other great tutorials follow the links below:
- [Andrew Zieffler](https://www.dataplusscience.com/files/ggplot%20and%20violin%20plot.pdf)
- [Data Visualization with ggplot2](https://www.datacamp.com/courses/data-visualization-with-ggplot2-part-1)
- [ggplot2 open source book](https://github.com/hadley/ggplot2-book)
- [Examples of cool visualizations in R](https://www.r-graph-gallery.com)

## Core Elements

### Data
Dataset being plotted

### Aesthetics
Scales onto which we map our data

### Geometries
Visual elements used

-----------

### Data
Dataset being plotted

- variables of interest

### Aesthetics
Scales onto which we map our data

- axis
    - x-axis
    - y-axis
- color
    - fill
    - border
    - alpha
- size
- labels
- shape
- lines
    - width
    - type

### Geometries
Visual elements used

- point
    - scatter plot
    - dotplot
- line
    - line chart
    - best fit plotting
- area
    - histogram
    - bar chart
    - pie chart

[Metro Nashville school data](https://data.nashville.gov/browse?category=Education)

In [None]:
# read in data downloaded from https://data.nashville.gov/browse?category=Education
schools <- read.csv('Metro_Nashville_Public_Schools_Enrollment_and_Demographics.csv')

# have a column with the total number of students in each school
schools$Total <- schools$Male + schools$Female

# zip codes should be a categorical, not numerical
schools$Zip.Code <- as.factor(schools$Zip.Code)

# order school types in a way that makes sense
school_levels = c('Elementary School',
                  'Middle School',
                  'High School',
                  'Charter',
                  'Non-Traditional',
                  'Non-Traditional - Hybrid',
                  'Alternative Learning Center',
                  'Special Education',
                  'GATE Center',
                  'Adult'
                  )
schools$School.Level <- factor(schools$School.Level, levels = school_levels)

In [None]:
head(schools)

Now let us define the data, aesthetic and geometry for a plot.

We put these to together into a "sentence" by running `ggplot(data, aesthetic) + geometry`

At the end, `+ coord_flip()` can be added to swap the x and y axis.

In [None]:
# set the data
data <- schools #[schools$Total > 300, ]
# choose the aesthetics
aesthetic <- aes(x=School.Level)
# set the geometry (sometimes we have to specify a statstic)
geometry <- geom_bar(stat="count") #, width=.1)

ggplot(data, aesthetic) #+
# geometry #+
# geom_point(stat="count", size=5) + 
# coord_flip()

Turning to numerical data, let's see how many students attend each level of school in Nashville

In [None]:
# use Nasvhille schools as data again
data <- schools #[schools$School.Level %in% c("High School", "Elementary School", "Middle School", "Charter"), ]

# this time put the total number of students in each school on the x axis
# use fill to indicate the type of school
aesthetic <- aes(x=Total, fill=School.Level)

# use the histogram geometry
geometry <- geom_histogram(binwidth=100)
# uncomment the line below to use dotplot 
# geometry <- geom_dotplot(aes(color=School.Level), binwidth=100, dotsize=.4, stackgroups=TRUE, binpositions="all")

# plot it
ggplot(data, aesthetic) + geometry

Now let's look at another way to plot that information by changing the geometry and aesthetics

In [None]:
# use Nasvhille schools as data again
data <- schools #[schools$School.Level %in% c("High School", "Elementary School", "Middle School", "Charter"), ]

# put the total number of students in each school on the x axis and the school type on the y axis
aesthetic <- aes(x=School.Level, y=Total)

# use the boxplot geometry
geometry <- geom_boxplot()

ggplot(data, aesthetic) + 
geometry + 
# geom_point(stat = "identity", alpha=.5) + # uncomment to show the values as points
# geom_point(stat = "summary", fun.y = "mean", size=3, shape="square", color="blue") + # uncomment to show the mean as a blue square
# geom_point(stat = "summary", fun.y = "median", shape="|", size=10, color="red") + # uncomment to show the median as a red line
# swap the x and y axes
coord_flip()

Sometimes it is necessary to reformat data for different visualizations.

In our original csv, each grade is a column, however to stack up counts from each grade we want to have each (school, grade) pair be a single row.

This way we can look at where each grade goes to school by aggregating on grade instead of school.

Let's look at an example where we figure out which grades each school type serves.

In [None]:
# make a list of the columns that start with "Grade."
grade_names <- lapply(
    Filter(
        function(x) { startsWith(x, "Grade") },
        colnames(schools)), # go through each column
    function(x) { substring(x, nchar("Grade.")+1) })

# make a function that makes a (school, grade) row
getFrameForGrade <- function(grade_name) {
    key <- paste("Grade.", grade_name, sep="")
    grade_counts <- data.frame(schools[ , c("School.Level", "School.Name", key)])
    grade_counts$Grade <- grade_name
    names(grade_counts)[names(grade_counts) == key] <- "Students"
    return(grade_counts)
}

grade_info <- Reduce(rbind, lapply(grade_names, getFrameForGrade))

# make sure the grades are plotted in the correct order
grade_info$Grade <- factor(grade_info$Grade, levels = grade_names)

# head(schools)
head(grade_info)

In [None]:
# use the grade_info as the data
data <- grade_info #[complete.cases(grade_info), ]

# map grade to the x axis, the number of students to the y axis, use the fill color to indicate the school level
aesthetic <- aes(x=Grade, y=Students, fill=School.Level)

# use the bar chart geometry
geometry <- geom_bar(stat="sum", position = "stack")

# plot it
ggplot(data, aesthetic) + geometry + coord_flip()

## Optional Elements

### Facets
Plotting small multiples

### Statistics
Representations to aid understanding

### Coordinates
Space in which data is plotted

### Themes
Non-data ink

----

### Facets
Plotting small multiples

- subplots
    - columns
    - rows

### Statistics
Representations to aid understanding

- descriptive
    - mean
    - median
- inferential
    - confidence interval
- binning
    - grouping
    - bin shapes
- smoothing
    - curve fitting

### Coordinates
Space in which data is plotted

- coordinate systems
    - cartesian
    - polar
    - spherical
- fixed
    - fixed ratio between axes (such as latitude and longitude)
- limits
    - edges of chart

### Themes
Non-data ink

- labels
    - call outs
    - captions
- graphics
    - icons
- font

Let's start with an example of how extra information and fixed coordinate ratios can be helpful.

In [None]:
# uncomment to load a map image and turn it into a geometry
# map_imp <- readPNG('nashville_map.png')
# map_plot <- annotation_raster(map_imp, ymin = min(schools$Latitude, na.rm=TRUE), ymax= max(schools$Latitude, na.rm=TRUE), xmin = min(schools$Longitude, na.rm=TRUE), xmax = max(schools$Longitude, na.rm=TRUE))

# use the high schools as the data
data <- schools[schools$School.Level %in% c("High School"), ]

# map latitude to the y axis and longitude to the x axis
aesthetic <- aes(y=Latitude, x=Longitude)

# make the plot
ggplot(data, aesthetic) + 
# map_plot + # uncomment to add the map
# coord_fixed() + # uncomment to fix the coordinates
# ggtitle("Where Nashville Goes to School") + # uncomment to add a title
geom_point(color="red", size=5)

Can we encode some more information on this map?

In [None]:
# use all public schools as the data
data <- schools[schools$School.Level %in% c("Elementary School", "Middle School", "High School"), ]

# map latitude to the y axis and longitude to the x axis
# commented are some aesthetics to add
aesthetic <- aes(y=Latitude, x=Longitude) #, shape=School.Level, color=White/Total, size=Total)

# races
# White
# Black.or.African.American
# Hispanic.Latino
# Asian
# Native.Hawaiian.or.Other.Pacific.Islander
# American.Indian.or.Alaska.Native

geometry <- geom_point()

ggplot(data, aesthetic) + 
map_plot + 
geometry + 
# scale_colour_gradient(low = "blue", high = "red") + # uncomment to change the color gradient
# ggtitle("How segregated is Nashville?") + 
coord_fixed()

Are students more segregated in elementary school than high school?

In [None]:
race_names <- c(
    "White",
    "Black.or.African.American",
    "Hispanic.Latino",
    "Asian",
    "Native.Hawaiian.or.Other.Pacific.Islander",
    "American.Indian.or.Alaska.Native"
)

getFrameForRace <- function(race_name) {
    race_counts <- data.frame(schools[ , c("School.Level", "School.Name", "School.ID", race_name, "Total")])
    race_counts$Race <- race_name
    race_counts$Fraction <- race_counts[ , race_name]/race_counts$Total
    names(race_counts)[names(race_counts) == race_name] <- "Students"
    return(race_counts)
}

race_info <- Reduce(rbind, lapply(race_names, getFrameForRace))

# make sure the grades are plotted in the correct order
race_info$Race <- factor(race_info$Race, levels = race_names)

# order by fraction of <race_to_order_by> students
race_to_order_by <- "White"
school_names <- schools[,c("School.Name", race_to_order_by, "Total")]
school_names$Fraction <- school_names[,race_to_order_by]/school_names$Total
school_names <- school_names[order(-school_names[, "Fraction"]), ]
school_names <- school_names$School.Name

race_info$School.Name <- factor(race_info$School.Name, levels = school_names)

# head(schools)
head(race_info)

In [None]:
data <- race_info[race_info$School.Level %in% c("Elementary School", "Middle School", "High School"), ]
data <- data[complete.cases(data), ]
aesthetic <- aes(x=School.Name, y=Fraction, fill=Race)#, width=Total/1800)
geometry <- geom_bar(stat="identity", position="stack")

ggplot(data, aesthetic) + geometry + 
# facet_grid( ~ School.Level) + 
# theme(axis.title.y=element_blank(),
#         axis.text.y=element_blank(),
#         axis.ticks.y=element_blank()) +
coord_flip()

# Closing Thoughts

Even if you don't use R, the Grammar of Graphics is a useful framework for thinking about how to construct visualizations

A lot of studies have been done on how accurately humans can read information encoded in various ways, such as color, length and angle. When building graphs, try to put the most important information on the easiest to read axis.