## Creating Publication-Quality Graphics with ggplot2



Plotting our data is one of the best ways to quickly explore it and the various relationships between variables.

There are three main plotting systems in R, the base plotting system, the lattice package, and the ggplot2 package.

In [None]:
library("ggplot2")
library(gapminder)
library(tibble)

Here is an example:

In [None]:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

ggplot2 is built on the grammar of graphics, the idea that any plot can be expressed from the same set of components: a <b>data</b> set, a <b>coordinate system</b>, and a set of <b>geoms</b> – the visual representation of data points.
The key to understanding ggplot2 is thinking about a figure in layers. This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator, or Inkscape.
So the first thing we do is call the `ggplot` function. This function lets R know that we’re creating a new plot, and any of the arguments we give the `ggplot` function are the global options for the plot: they apply to all layers on the plot.

We’ve passed in two arguments to `ggplot`. First, we tell `ggplot` what data we want to show on our figure, in this example the gapminder data we read in earlier. For the second argument, we passed in the `aes` function, which tells `ggplot` how variables in the <b>data</b> map to aesthetic properties of the figure, in this case the <b>x</b> and <b>y</b> locations. Here we told ggplot we want to plot the “gdpPercap” column of the gapminder data frame on the x-axis, and the “lifeExp” column on the y-axis. Notice that we didn’t need to explicitly pass aes these columns (e.g. `x = gapminder[, "gdpPercap"]`), this is because `ggplot` is smart enough to know to look in the <b>data</b> for that column!

By itself, the call to `ggplot` isn’t enough to draw a figure:

In [None]:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))

We need to tell `ggplot` how we want to visually represent the data, which we do by adding a new <b>geom</b> layer. In our example, we used `geom_point`, which tells `ggplot` we want to visually represent the relationship between <b>x</b> and <b>y</b> as a scatterplot of points:

In [None]:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

## Layers
Using a scatterplot probably isn’t the best for visualizing change over time. Instead, let’s tell `ggplot` to visualize the data as a line plot:

In [None]:
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line()

Instead of adding a `geom_point` layer, we’ve added a `geom_line` layer.

We’ve added the by aesthetic, which tells ggplot to draw a line for each country.

But what if we want to visualize both lines and points on the plot? We can add another layer to the plot:

In [None]:
ggplot(data = gapminder, mapping = aes(x=year, y=lifeExp, by=country, color=continent)) +
  geom_line() + geom_point()

In this example, the aesthetic mapping of color has been moved from the global plot options in ggplot to the geom_line layer so it no longer applies to the points. Now we can clearly see that the points are drawn on top of the lines.

## Transformations and statistics
ggplot2 also makes it easy to overlay statistical models over the data. To demonstrate we’ll go back to our first example:

In [None]:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

+ It’s hard to see the relationship between the points due to some strong outliers in GDP per capita. 
+ We can change the scale of units on the x axis using the scale functions. These control the mapping between the data values and visual values of an aesthetic. 
+ We can also modify the transparency of the points, using the alpha function, which is especially helpful when you have a large amount of data which is very clustered.

In [None]:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point(alpha = 0.5) + scale_x_log10()

The `log10` function applied a transformation to the values of the gdpPercap column before rendering them on the plot, so that each multiple of 10 now only corresponds to an increase in 1 on the transformed scale, e.g. a GDP per capita of 1,000 is now 3 on the x axis, a value of 10,000 corresponds to 4 on the x axis and so on. This makes it easier to visualize the spread of data on the x-axis.
We can fit a simple relationship to the data by adding another layer, `geom_smooth`:

In [None]:
ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) +
  geom_point() + scale_x_log10() + geom_smooth(method="lm")

## Multi-panel figures
Earlier we visualized the change in life expectancy over time across all countries in one plot. Alternatively, we can split this out over multiple panels by adding a layer of <b>facet</b> panels.

In [None]:
americas <- gapminder[gapminder$continent == "Americas",]
ggplot(data = americas, mapping = aes(x = year, y = lifeExp)) +
  geom_line() +
  facet_wrap( ~ country) +
  theme(axis.text.x = element_text(angle = 45))

The `facet_wrap` layer took a “formula” as its argument, denoted by the tilde (~). This tells R to draw a panel for each unique value in the country column of the gapminder dataset.

## Modifying text 

To clean this figure up for a publication we need to change some of the text elements. The x-axis is too cluttered, and the y axis should read “Life expectancy”, rather than the column name in the data frame.

We can do this by adding a couple of different layers. The theme layer controls the axis text, and overall text size. Labels for the axes, plot title and any legend can be set using the `labs` function. Legend titles are set using the same names we used in the `aes` specification. Thus below the color legend title is set using `color = "Continent"`, while the title of a fill legend would be set using `fill = "MyTitle"`.

In [None]:
ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
  geom_line() + facet_wrap( ~ country) +
  labs(
    x = "Year",              # x axis title
    y = "Life expectancy",   # y axis title
    title = "Figure 1",      # main title of figure
    color = "Continent"      # title of legend
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

## Exporting the plot

The `ggsave()` function allows you to export a plot created with ggplot. You can specify the dimension and resolution of your plot by adjusting the appropriate arguments (`width`, `height` and `dpi`) to create high quality graphics for publication. In order to save the plot from above, we first assign it to a variable `lifeExp_plot`, then tell `ggsave` to save that plot in `png` format to a directory called `results`. (Make sure you have a `results/` folder in your working directory.)

In [None]:
lifeExp_plot <- ggplot(data = americas, mapping = aes(x = year, y = lifeExp, color=continent)) +
  geom_line() + facet_wrap( ~ country) +
  labs(
    x = "Year",              # x axis title
    y = "Life expectancy",   # y axis title
    title = "Figure 1",      # main title of figure
    color = "Continent"      # title of legend
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggsave(filename = "r_images/lifeExp.png", plot = lifeExp_plot, width = 12, height = 10, dpi = 300, units = "cm")

There are two nice things about `ggsave`. First, it defaults to the last `plot`, so if you omit the plot argument it will automatically save the last plot you created with `ggplot`. Secondly, it tries to determine the format you want to save your plot in from the file extension you provide for the filename (for example `.png` or `.pdf`). If you need to, you can specify the format explicitly in the `device` argument.

## Vectorization

Most of R’s functions are vectorized, meaning that the function will operate on all elements of a vector without needing to loop through and act on each element one at a time. This makes writing code more concise, easy to read, and less error prone.

In [None]:
x <- 1:4
x * 2

The multiplication happened to each element of the vector.

We can also add two vectors together:

In [None]:
y <- 6:9
x + y

Each element of `x` was added to its corresponding element of `y`:

In [None]:
output_vector <- c()
for (i in 1:4) {
  output_vector[i] <- x[i] + y[i]
}
output_vector

Compare this to the output using vectorised operations.

In [None]:
sum_xy <- x + y
sum_xy

## Writing Data

At some point, you’ll also want to write out data from R.

We can use the `write.table` function for this, which is very similar to `read.table` from before.

Let’s create a data-cleaning script, for this analysis, we only want to focus on the gapminder data for Australia:

In [None]:
aust_subset <- gapminder[gapminder$country == "Australia",]

write.table(aust_subset,
  file="data/gapminder-aus.csv",
  sep=","
)

Let’s look at the help file to work out.

In [None]:
?write.table

By default R will wrap character vectors with quotation marks when writing out to file. It will also write out the row and column names.

Let’s fix this:

In [None]:
write.table(
  gapminder[gapminder$country == "Australia",],
  file="data/gapminder-aus.csv",
  sep=",", quote=FALSE, row.names=FALSE
)

## Splitting and Combining Data Frames with plyr



In [None]:
# Takes a dataset and multiplies the population column
# with the GDP per capita column.
calcGDP <- function(dat, year=NULL, country=NULL) {
  if(!is.null(year)) {
    dat <- dat[dat$year %in% year, ]
  }
  if (!is.null(country)) {
    dat <- dat[dat$country %in% country,]
  }
  gdp <- dat$pop * dat$gdpPercap

  new <- cbind(dat, gdp=gdp)
  return(new)
}

A common task you’ll encounter when working with data, is that you’ll want to run calculations on different groups within the data. In the above, we were calculating the GDP by multiplying two columns together.
+ But what if we wanted to calculated the mean GDP per continent?
   + We could run calcGDP and then take the mean of each continent:

In [None]:
withGDP <- calcGDP(gapminder)
mean(withGDP[withGDP$continent == "Africa", "gdp"])


In [None]:
mean(withGDP[withGDP$continent == "Americas", "gdp"])

In [None]:
mean(withGDP[withGDP$continent == "Asia", "gdp"])

But this isn’t very nice. Yes, by using a function, you have reduced a substantial amount of repetition. That is nice. But there is still repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

We could write a new function that is flexible like `calcGDP`, but this also takes a substantial amount of effort and testing to get right.

The abstract problem we’re encountering here is know as “<b>split-apply-combine</b>”:

![image9](r_images\image9.png)

We want to split our data into groups, in this case continents, apply some calculations on that group, then optionally combine the results together afterwards.

### The `plyr` package

While R’s built in functions do work, we’re going to introduce you to another method for solving the “split-apply-combine” problem. The plyr package provides a set of functions that we find more user friendly for solving this problem.

In [None]:
library("plyr")

Plyr has functions for operating on `lists`, `data.frames` and `arrays` (matrices, or n-dimensional vectors). Each function performs:

+ A splitting operation
+ Apply a function on each split in turn.
+ Recombine output data as a single data object.

The functions are named based on the data structure they expect as input, and the data structure you want returned as output: [a]rray, [l]ist, or [d]ata.frame. The first letter corresponds to the input data structure, the second letter to the output data structure, and then the rest of the function is named “ply”.

This gives us 9 core functions `**ply`. There are an additional three functions which will only perform the split and apply steps, and not any combine step. They’re named by their input data type and represent null output by a `_` (see table)

Note here that plyr’s use of “array” is different to R’s, an array in ply can include a vector or matrix.

![image10](r_images\image10.png)

Each of the xxply functions (daply, ddply, llply, laply, …) has the same structure and has 4 key features and structure:

```python
xxply(.data, .variables, .fun)
```

+ The first letter of the function name gives the input type and the second gives the output type.
+ .data - gives the data object to be processed
+ .variables - identifies the splitting variables
+ .fun - gives the function to be called on each piece

we can quickly calculate the mean GDP per continent:

In [None]:
ddply(
 .data = calcGDP(gapminder),
 .variables = "continent",
 .fun = function(x) mean(x$gdp)
)

Going thorugh the code snippet above

+ The `ddply` function feeds in a `data.frame` (function starts with d) and returns another `data.frame` (2nd letter is a d) i
+ The first argument we gave was the data.frame we wanted to operate on: in this case the gapminder data. We called `calcGDP` on it first so that it would have the additional `gdp` column added to it.
+ The second argument indicated our split criteria: in this case the “continent” column. Note that we gave the name of the column, not the values of the column like we had done previously with subsetting. Plyr takes care of these implementation details for you.
+ The third argument is the function we want to apply to each grouping of the data. We had to define our own short function here: each subset of the data gets stored in `x`, the first argument of our function. This is an anonymous function: we haven’t defined it elsewhere, and it has no name. It only exists in the scope of our call to `ddply`.

If we want a different type of output data structure:

In [None]:
dlply(
 .data = calcGDP(gapminder),
 .variables = "continent",
 .fun = function(x) mean(x$gdp)
)

In [None]:
# We called the same function again, but changed the second letter to an l, so the output was returned as a list.
ddply(
 .data = calcGDP(gapminder),
 .variables = c("continent", "year"),
 .fun = function(x) mean(x$gdp)
)

In [None]:
# We can specify multiple columns to group by:
ddply(
 .data = calcGDP(gapminder),
 .variables = c("continent", "year"),
 .fun = function(x) mean(x$gdp)
)

In [None]:
daply(
 .data = calcGDP(gapminder),
 .variables = c("continent", "year"),
 .fun = function(x) mean(x$gdp)
)

You can use these functions in place of `for` loops (and it is usually faster to do so). To replace a `for` loop, put the code that was in the body of the for loop inside an anonymous function.

In [None]:
d_ply(
  .data=gapminder,
  .variables = "continent",
  .fun = function(x) {
    meanGDPperCap <- mean(x$gdpPercap)
    print(paste(
      "The mean GDP per capita for", unique(x$continent),
      "is", format(meanGDPperCap, big.mark=",")
   ))
  }
)

## Dataframe Manipulation with dplyr
Manipulation of dataframes means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics.

### The `dplyr` package

The `dplyr` package provides a number of very useful functions for manipulating dataframes in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the `dplyr` grammar easier to read.

Here we’re going to cover 5 of the most commonly used functions as well as using pipes (`%>%`) to combine them.

1. `select()`
2. `filter()`
3. `group_by()`
4. `summarize()`
5. `mutate()`

If you have have not installed this package earlier, please do so:

```python
install.packages('dplyr')
```

Now let’s load the package:

In [None]:
library("dplyr")

### Using `select()`

In [None]:
#If, for example, we wanted to move forward with only a few of the variables in our dataframe we could use the select() function. This will keep only the variables you select.
year_country_gdp <- select(gapminder, year, country, gdpPercap)

![image11](r_images\image11.png)

If we open up `year_country_gdp` we’ll see that it only contains the year, country and gdpPercap. Above we used ‘normal’ grammar, but the strengths of `dplyr` lie in combining several functions using pipes. Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.

In [None]:
year_country_gdp <- gapminder %>% select(year, country, gdpPercap)

let’s walk through it step by step. 
First we summon the gapminder dataframe and pass it on, using the pipe symbol `%>%`, to the next step, which is the `select()` function. In this case we don’t specify which data object we use in the `select()` function since in gets that from the previous pipe. Fun Fact: There is a good chance you have encountered pipes before in the shell. In R, a pipe symbol is `%>%` while in the shell it is `|` but the concept is the same!

In [None]:
tidy_gdp <- year_country_gdp %>% rename(gdp_per_capita = gdpPercap)

head(tidy_gdp)


### Using `filter()`

If we now wanted to move forward with the above, but only with European countries, we can combine `select` and `filter`

In [None]:
year_country_gdp_euro <- gapminder %>%
    filter(continent == "Europe") %>%
    select(year, country, gdpPercap)

As with last time, first we pass the gapminder dataframe to the `filter()` function, then we pass the filtered version of the gapminder dataframe to the `select()` function.

### Using `group_by()` and `summarize()`

Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each continent. Instead of `filter()`, which will only pass observations that meet your criteria (in the above: `continent=="Europe"`), we can use `group_by()`, which will essentially use every unique criteria that you could have used in filter.

In [None]:
str(gapminder)

In [None]:
str(gapminder %>% group_by(continent))

You will notice that the structure of the dataframe where we used `group_by()` (`grouped_df`) is not the same as the original `gapminder` (`data.frame`). A `grouped_df` can be thought of as a `list` where each item in the `list` is a `data.frame` which contains only the rows that correspond to the a particular value `continent`. 

![image12](r_images\image12.png)

### Using `summarize()`

The above was a bit on the uneventful side but `group_by()` is much more exciting in conjunction with `summarize()`. This will allow us to create new variable(s) by using functions that repeat for each of the continent-specific data frames. That is to say, using the `group_by()` function, we split our original dataframe into multiple pieces, then we can run functions (e.g. `mean()` or `sd()`) within `summarize()`.

In [None]:
gdp_bycontinents <- gapminder %>%
    group_by(continent) %>%
    summarize(mean_gdpPercap = mean(gdpPercap))

## Dataframe Manipulation with tidyr

Researchers often want to reshape their dataframes from ‘wide’ to ‘longer’ layouts, or vice-versa. The ‘long’ layout or format is where:

each column is a variable
each row is an observation
In the purely ‘long’ (or ‘longest’) format, you usually have 1 column for the observed variable and the other columns are ID variables.

For the ‘wide’ format each row is often a site/subject/patient and you have multiple observation variables containing the same type of data. These can be either repeated observations over time, or observation of multiple variables (or a mix of both). You may find data input may be simpler or some other applications may prefer the ‘wide’ format. However, many of `R`’s functions have been designed assuming you have ‘longer’ formatted data. This tutorial will help you efficiently transform your data shape regardless of original format.

![image14](r_images\image14.png)

Long and wide dataframe layouts mainly affect readability. For humans, the wide format is often more intuitive since we can often see more of the data on the screen due to its shape. However, the long format is more machine readable and is closer to the formatting of databases. The ID variables in our dataframes are similar to the fields in a database and observed variables are like the database values.

### From wide to long format with pivot_longer()

In [None]:
str(gapminder)

In [None]:
gap_wide <- read.csv("data/gapminder_all.csv", stringsAsFactors = FALSE)
str(gap_wide)

![image15](r_images\image15.png)

To change this very wide dataframe layout back to our nice, intermediate (or longer) layout, we will use one of the two available `pivot` functions from the `tidyr` package. To convert from wide to a longer format, we will use the `pivot_longer()` function. `pivot_longer()` makes datasets longer by increasing the number of rows and decreasing the number of columns, or ‘lengthening’ your observation variables into a single variable.

![image16](r_images\image16.png)

In [None]:
gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(starts_with('pop'), starts_with('lifeExp'), starts_with('gdpPercap')),
    names_to = "obstype_year", values_to = "obs_values"
  )
str(gap_long)

Here we have used piping syntax which is similar to what we were doing in the previous lesson with dplyr. In fact, these are compatible and you can use a mix of tidyr and dplyr functions by piping them together.

We first provide to `pivot_longer()` a vector of column names that will be pivoted into longer format. We could type out all the observation variables, but as in the `select()` function (see dplyr lesson), we can use the starts_with() argument to select all variables that start with the desired character string. pivot_longer() also allows the alternative syntax of using the - symbol to identify which variables are not to be pivoted (i.e. ID variables).

The next arguments to `pivot_longer()` are `names_to` for naming the column that will contain the new ID variable (`obstype_year`) and `values_to` for naming the new amalgamated observation variable (`obs_value`). We supply these new column names as strings.

![image17](r_images\image17.png)

In [None]:
gap_long <- gap_wide %>%
  pivot_longer(
    cols = c(-continent, -country),
    names_to = "obstype_year", values_to = "obs_values"
  )
str(gap_long)

That may seem trivial with this particular dataframe, but sometimes you have 1 ID variable and 40 observation variables with irregular variable names.

Now `obstype_year` actually contains 2 pieces of information, the observation type (`pop`,`lifeExp`, or `gdpPercap`) and the `year`. We can use the `separate()` function to split the character strings into multiple variables

In [None]:
gap_long <- gap_long %>% separate(obstype_year, into = c('obs_type', 'year'), sep = "_")
gap_long$year <- as.integer(gap_long$year)

### From long to intermediate format with pivot_wider()

Let’s use the second `pivot` function, `pivot_wider()`, to ‘widen’ our observation variables back out. `pivot_wider()` is the opposite of `pivot_longer()`, making a dataset wider by increasing the number of columns and decreasing the number of rows. We can use `pivot_wider()` to pivot or reshape our `gap_long` to the original intermediate format or the widest format. Let’s start with the intermediate format.

The `pivot_wider()` function takes `names_from` and `values_from` arguments.

To `names_from` we supply the column name whose contents will be pivoted into new output columns in the widened dataframe. The corresponding values will be added from the column named in the `values_from` argument.

In [None]:
gap_normal <- gap_long %>%
  pivot_wider(names_from = obs_type, values_from = obs_values)
dim(gap_normal)

In [None]:
dim(gapminder)

In [None]:
names(gap_normal)

In [None]:
names(gapminder)

Now we’ve got an intermediate dataframe `gap_normal` with the same dimensions as the original `gapminder`, but the order of the variables is different. Let’s fix that before checking if they are `all.equal()`.

In [None]:
gap_normal <- gap_normal[, names(gapminder)]
all.equal(gap_normal, gapminder)

In [None]:
head(gap_normal)

In [None]:
head(gapminder)

The original was sorted by `country`, then `year`.

In [None]:
gap_normal <- gap_normal %>% arrange(country, year)
all.equal(gap_normal, gapminder)

We’ve gone from the longest format back to the intermediate and we didn’t introduce any errors in our code.

Now let’s convert the long all the way back to the wide. In the wide format, we will keep country and continent as ID variables and pivot the observations across the 3 metric(`pop`,`lifeExp`,`gdpPercap`) and time (`year`). First we need to create appropriate labels for all our new variables (time*metric combinations) and we also need to unify our ID variables to simplify the process of defining `gap_wide`.

In [None]:
gap_temp <- gap_long %>% unite(var_ID, continent, country, sep = "_")
str(gap_temp)

In [None]:
gap_temp <- gap_long %>%
    unite(ID_var, continent, country, sep = "_") %>%
    unite(var_names, obs_type, year, sep = "_")
str(gap_temp)

In [None]:
gap_wide_new <- gap_long %>%
  unite(ID_var, continent, country, sep = "_") %>%
  unite(var_names, obs_type, year, sep = "_") %>%
  pivot_wider(names_from = var_names, values_from = obs_values)
str(gap_wide_new)

Now we have a great ‘wide’ format dataframe, but the `ID_var` could be more usable, let’s separate it into 2 variables with `separate()`

In [None]:
gap_wide_betterID <- separate(gap_wide_new, ID_var, c("continent", "country"), sep="_")
gap_wide_betterID <- gap_long %>%
    unite(ID_var, continent, country, sep = "_") %>%
    unite(var_names, obs_type, year, sep = "_") %>%
    pivot_wider(names_from = var_names, values_from = obs_values) %>%
    separate(ID_var, c("continent","country"), sep = "_")
str(gap_wide_betterID)

In [None]:
all.equal(gap_wide, gap_wide_betterID)


## Producing Reports With knitr

### Basic components of R Markdown

The initial chunk of text (header) contains instructions for R to specify what kind of document will be created, and the options chosen. You can use the header to give your document a title, author, date, and tell it that you’re going to want to produce html output (in other words, a web page).

```markdown
---

title: "Initial R Markdown document"
author: "Karl Broman"
date: "April 23, 2015"
output: html_document

---
```

You can delete any of those fields if you don’t want them included. The double-quotes aren’t strictly necessary in this case. They’re mostly needed if you want to include a colon in the title.

RStudio creates the document with some example text to get you started. Note below that there are chunks like

![image18](r_images\image18.png)

These are chunks of R code that will be executed by `knitr` and replaced by their results. More on this later.

Also note the web address that’s put between angle brackets (`< >`) as well as the double-asterisks in` **Knit**`. This is Markdown.