Data exploration and visualization {#chap:R_data}
==================================

Before you do any fancy statistical analyses with data, you must clean,
explore, and visualize it. And eventually, you want to produce a
finished product that presents visualizations you data and your results
clearly and concisely. Ultimately, both, at the data exploration and the
finished product stages, the goal of graphics is to present information
such that it provides intuitive ideas. As Edward Tufte says:

> *“Graphical excellence is that which gives to the viewer the greatest
> number of ideas in the shortest time with the least ink in the
> smallest space.”*

This chapter aims at introducing you to principles, and R packages and
commands that will allow you to build a computational pipeline/workflow
for critical steps of your data analysis and visualization.

We will start with some basic plotting and data exploration. Then, you
will learn to generate publication-quality graphics using the <span>
ggplot2</span> package. Finally, you will learn some principles and
methods for data processing and storage in R.

Basic plotting and graphical data exploration
---------------------------------------------

R can produce beautiful graphics without the time-consuming and fiddly
methods that you might have used in Excel or equivalent. You should also
make it a habit to quickly plot the data for exploratory analysis. So we
are going to learn some basic plotting first.

### Basic plotting commands

Here is a menu of basic R plotting commands (use
<span>?commandname</span> to learn more about it):

  --------------------------------- --------------------------------------------------------
  <span>plot(x,y)</span>            Scatterplot
  <span>plot(y$\sim$x)</span>       Scatterplot with <span>y</span> as a response variable
  <span>hist(mydata)</span>         Histogram
  <span>barplot(mydata)</span>      Bar plot
  <span>points(y1$\sim$x1)</span>   Add another series of points
  <span>boxplot(y$\sim$x)</span>    Boxplot
  --------------------------------- --------------------------------------------------------

\

### R graphics devices

In all that follows, you may often end up plotting multiple plots on the
same graphics window without intending to do so, because R by default
keeps plotting in the most recent plotting window that was opened. You
can close a particular graphics window or “device” by using
<span>dev.off()</span>, and all open devices/windows with <span>
graphics.off()</span>. By default, <span>dev.off()</span> will close the
most recent figure device that was opened.

Note that there are invisible devices in <span>R</span>! Fore example,
if you are printing to pdf (coming up below), the device or graphics
window will not be visible on your computer screen.

Now let’s try some simple plotting for data exploration. As a case
study, we will use a dataset on Consumer-Resource (e.g., Predator-Prey)
body mass ratios taken from the Ecological Archives of the ESA (Barnes
<span>*et al.*</span> 2008, Ecology 89:881).

\[$\quad\star$\]

Copy the file <span>EcolArchives-E089-51-D1.csv</span> from
<span>Data</span> directory in the master git repository on bitbucket to
your own <span>Data</span> directory.

Now, launch R and read in these data to a data frame:

In [None]:
> MyDF <- read.csv("../Data/EcolArchives-E089-51-D1.csv")
> dim(MyDF) #check the size of the data frame you loaded
[1] 34931    15

Let’s look at what the data contain (type <span>MyDF\$</span> and hit
the TAB key twice (if you are using RStudio, you just can hit it once).
This will give the following result:

In [None]:
> MyDF$
MyDF$Record.number                MyDF$Predator.mass
MyDF$In.refID                     MyDF$Prey
MyDF$IndividualID                 MyDF$Prey.common.name
MyDF$Predator                     MyDF$Prey.taxon
MyDF$Predator.common.name         MyDF$Prey.mass
MyDF$Predator.taxon               MyDF$Prey.mass.unit
MyDF$Predator.lifestage           MyDF$Location
MyDF$Type.of.feeding.interaction  

<span>**In RStudio, you will see a drop-down list of all the column
headers when you hit <span>tab</span>**</span>.

You can also use the <span>str()</span> and <span>head()</span> commands
that you learned about in Chapter \[chap:RI\].

As you can see, these data contain predator-prey body size information.
This is an interesting dataset because it is huge, and covers a wide
range of body sizes of aquatic species involved in consumer-resource
interactions — from unicells to whales. Analyzing this dataset should
tell us a lot about what sizes of prey predators like to eat.

![A consumer-resource (predator-prey) interaction waiting to
happen.](SeaLion.pdf){width="80.00000%"}

### Scatter Plot

Let’s start by plotting Predator mass vs. Prey mass:

In [None]:
> plot(MyDF$Predator.mass,MyDF$Prey.mass)

![image](PPScat1.pdf){width="50.00000%"}

That doesn’t look very nice! Let’s try taking logarithms (why?).

In [None]:
> plot(log(MyDF$Predator.mass),log(MyDF$Prey.mass))

![image](PPScat2.pdf){width="50.00000%"}

We can change almost any aspect of the resulting graph; let’s change the
symbols by specifying the <span>p</span>lot <span>ch</span>aracters
using <span>pch</span>:

In [None]:
> plot(log(MyDF$Predator.mass),log(MyDF$Prey.mass),pch=20) # Change marker

![image](PPScat3.pdf){width="50.00000%"}

In [None]:
> plot(log(MyDF$Predator.mass),log(MyDF$Prey.mass),pch=20,
    xlab = "Predator Mass (kg)", ylab = "Prey Mass (kg)") # Add labels

![image](PPScat4.pdf){width="50.00000%"}

A really great summary of basic R graphical parameters can be found at
<https://www.statmethods.net/advgraphs/parameters.html>

### Histograms

Why did we have to take a logarithm to see the relationship between
predator and prey size? Plotting histograms of the two classes
(predator, prey) should be insightful, as we can then see the “marginal”
distributions of the two variables.

Let’s first plot a histogram of predator body masses:

In [None]:
> hist(MyDF$Predator.mass)

![image](PrHist1.pdf){width="50.00000%"}

Clearly, the data are heavily right skewed, with small body sized
organisms dominating (that’s a universal pattern on planet earth). Let’s
now take a logarithm and see if we can get a better idea of what the
distribution of predator sizes looks like:

In [None]:
> hist(log(MyDF$Predator.mass), 
   xlab = "Predator Mass (kg)", ylab = "Count") # include labels

![image](PrHist2.pdf){width="50.00000%"}

In [None]:
> hist(log(MyDF$Predator.mass),xlab="Predator Mass (kg)",ylab="Count", 
    col = "lightblue", border = "pink") # Change bar and borders colors 

![image](PrHist3.pdf){width="50.00000%"}

So, taking a log really makes clearer what the distribution of body
predator sizes looks like. <span>*Try the same with prey body
masses.*</span>

#### Exercise

We can do a lot of beautification and fine-tuning of your R plots! As an
exercise, try adjusting the histogram bin widths to make them same for
the predator and prey, and making the x and y labels larger and in
boldface. To get started, look at the help documentation of
<span>hist</span>.

### Subplots

We can also plot both predator and prey body masses in different
sub-plots using <span>par</span> so that we can compare them visually.

In [None]:
> par(mfcol=c(2,1)) #initialize multi-paneled plot
> par(mfg = c(1,1)) # specify which sub-plot to use first 
> hist(log(MyDF$Predator.mass),
    xlab = "Predator Mass (kg)", ylab = "Count", 
    col = "lightblue", border = "pink", 
    main = 'Predator') # Add title
> par(mfg = c(2,1)) # Second sub-plot
> hist(log(MyDF$Prey.mass),
    xlab="Prey Mass (kg)",ylab="Count", 
    col = "lightgreen", border = "pink", 
    main = 'prey')

![image](PPHist1.pdf){width="50.00000%"}

Another option for making multi-panel plots is the <span>layout</span>
function.

### Overlaying plots

Better still, we would like to see if the predator mass and prey mass
distributions are similar by overlaying them.

In [None]:
> hist(log(MyDF$Predator.mass), # Predator histogram
    xlab="Body Mass (kg)", ylab="Count", 
    col = rgb(1, 0, 0, 0.5), # Note 'rgb', fourth value is transparency
    main = "Predator-prey size Overlap") 
> hist(log(MyDF$Prey.mass), col = rgb(0, 0, 1, 0.5), add = T) # Plot prey
> legend('topleft',c('Predators','Prey'),   # Add legend
    fill=c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5))) # Define legend colors

Plot annotation with text can be done with either single or double
quotes, i.e., ‘Plot Title’ or “Plot Title”, respectively. But it is
generally a good idea to use double quotes because sometimes you would
like to use an apostrophe in your title or axis label strings.

![image](PPOverlay.pdf){width="50.00000%"}

<span>*It would be nicer to have both the plots with the same bin sizes
– try to do it*</span>

### Boxplots

Now, let’s try plotting boxplots instead of histograms. These are useful
for getting a visual summary of the distribution of your data.

In [None]:
> boxplot(log(MyDF$Predator.mass),
    xlab = "Location", ylab = "Predator Mass",
    main = "Predator mass")

![image](PredBoxP1.pdf){width="50.00000%"}

Now let’s see how many locations the data are from:

In [None]:
> boxplot(log(MyDF$Predator.mass) ~ MyDF$Location, # Why the tilde?
    xlab = "Location", ylab = "Predator Mass",
    main = "Predator mass by location")

![image](PredBoxP2.pdf){width="100.00000%"}

Note the tilde (\~). This is to tell R to subdivide or categorize your
analysis and plot by the “Factor” location. More on this later.

That’s a lot of locations! You will need an appropriately wide plot to
see all the boxplots adequately. Now let’s try boxplots by feeding
interaction type:

In [None]:
> boxplot(log(MyDF$Predator.mass) ~ MyDF$Type.of.feeding.interaction,
    xlab = "Location", ylab = "Predator Mass",
    main = "Predator mass by feeding interaction type")

### Combining plot types

It would be nice to see both the predator and prey marginal
distributions as well as the scatterplot for an exploratory analysis. We
can do this by adding boxplots of the marginal variables to the
scatterplot.

In [None]:
> par(fig=c(0,0.8,0,0.8)) # specify figure size as proportion
> plot(log(MyDF$Predator.mass),log(MyDF$Prey.mass),
    xlab = "Predator Mass (kg)", ylab = "Prey Mass (kg)") # Add labels
> par(fig=c(0,0.8,0.55,1), new=TRUE)
> boxplot(log(MyDF$Predator.mass), horizontal=TRUE, axes=FALSE)
> par(fig=c(0.65,1,0,0.8),new=TRUE)
> boxplot(log(MyDF$Prey.mass), axes=FALSE)
> mtext("Fancy Predator-prey scatterplot", side=3, outer=TRUE, line=-3)

![image](PPScatFancy.pdf){width="60.00000%"}

To understand this plotting method, think of the full graph area as
going from (0,0) in the lower left corner to (1,1) in the upper right
corner. The format of the <span>fig=</span> parameter is a numerical
vector of the form <span>c(x1, x2, y1, y2)</span>. The first <span>fig=
</span> sets up the scatterplot going from 0 to 0.8 on the x axis and 0
to 0.8 on the y axis. The top boxplot goes from 0 to 0.8 on the x axis
and 0.55 to 1 on the y axis. The right hand boxplot goes from 0.65 to 1
on the x axis and 0 to 0.8 on the y axis. You can experiment with these
proportions to change the spacings between plots.

### Lattice plots

You can also make lattice graphs to avoid the somewhat laborious <span>
par()</span> approach above of getting multi-panel plots. For this, you
will need to “load” a “library” that isn’t included by default when you
run R:

In [None]:
> library(lattice)

\[Fig-Lattice-1\]

A lattice plot of the above data for predator mass could look like Fig.
\[Fig-Lattice-1\] (as a density plot). This was generated using (and
printing to a pdf with particular dimensions):

In [None]:
> densityplot(~log(Predator.mass) | Type.of.feeding.interaction, 
data=MyDF)

Look up <http://www.statmethods.net/advgraphs/trellis.html> and the
<span>lattice</span> package help.

### Saving your graphics

And you can also save the figure in a vector graphics format like a pdf.
It is important to learn to do this, because you want to be able to save
your plots in good resolution, and want to avoid the manual steps of
clicking on the figure, doing “save as”, etc. So let’s save the figure
as a PDF:

In [None]:
> pdf("../Results/Pred_Prey_Overlay.pdf", # Open blank pdf page 
    11.7, 8.3) # These numbers are page dimensions in inches
> hist(log(MyDF$Predator.mass), # Plot predator histogram (note 'rgb')
    xlab="Body Mass (kg)", ylab="Count", 
    col = rgb(1, 0, 0, 0.5), 
    main = "Predator-Prey Size Overlap") 
> hist(log(MyDF$Prey.mass), # Plot prey weights
    col = rgb(0, 0, 1, 0.5), 
    add = T)  # Add to same plot = TRUE
> legend('topleft',c('Predators','Prey'), # Add legend
    fill=c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5))) 
> dev.off() 

Always try to save results in a vector format, which can be scaled up to
any size. For more on vector vs raster images/graphics, see:
<https://en.wikipedia.org/wiki/Vector_graphics>.

Note that you are saving to the <span>Results</span> directory now. This
should always be your workflow: store and retrieve data from a <span>
Data</span> directory, keep your code and work from a <span>Code</span>
directory, and save outputs to a <span>Results</span> directory.

You can also try other graphic output formats. For example,
<span>png()</span> (a raster format) instead of <span>pdf()</span>. As
always, look at the help documentation of each of these commands!

Practicals
----------

In this practical, you will write script that draws and saves three
lattice graphs by feeding interaction type: one of predator mass, one of
prey mass and one of the size ratio of prey mass over predator mass.
Note that you would want to use logarithms of masses (or mass-ratios)
for all three plots. In addition, the script will calculate the mean and
median predator mass, prey mass and predator-prey size-ratios to a csv
file. So:

\[$\quad\star$\]

Write a script file called <span>PP\_Lattice.R</span> and save it in the
<span>Code</span> directory — sourcing or running this script should
result in three files called <span>Pred\_Lattice.pdf</span>, <span>
Prey\_Lattice.pdf</span>, and <span>SizeRatio\_Lattice.pdf</span> being
saved in the <span>Results</span> directory (the names are
self-explanatory, I hope).

In addition, the script should calculate the mean and median log
predator mass, prey mass, and predator-prey size ratio, <span>*by
feeding type*</span>, and save it as a single csv output table called
<span> PP\_Results.csv</span> to the <span>Results</span> directory. The
table should have appropriate headers (e.g., Feeding type, mean,
median). (Hint: you will have to initialize a new dataframe in the
script to first store the calculations)

The script should be self-sufficient and not need any external inputs —
it should import the above predator-prey dataset from the appropriate
directory, and save the graphic plots to the appropriate directory
(Hint: use relative paths!).

There are multiple ways to do this practical. The plotting and saving
component is simple enough. For calculating the statistics by feeding
type, you can either use the “loopy” way — first obtaining a list of
feeding types (look up the <span>unique</span> or <span>levels</span>
functions) and then loop over them, using <span>subset</span> to extract
the dataset by feeding type at each iteration, or the R-savvy way, by
using <span>tapply</span> or <span>ddply</span> and avoiding looping
altogether (Chapter \[chap:R\_II\]).

Publication-quality figures in R
--------------------------------

<span>R</span> can produce beautiful graphics, but it takes a lot of
work to obtain the desired result. This is because the starting point is
pretty much a “bare” plot, and adding features commonly required for
publication-grade figures (legends, statistics, regressions, etc.) can
be quite involved.

Moreover, it is very difficult to switch from one representation of the
data to another (i.e., from boxplots to scatterplots), or to plot
several datasets together. The <span>R</span> package
<span>ggplot2</span> overcomes these issues, and produces truly
high-quality, publication-ready graphics suitable for papers, theses and
reports.

Currently, <span>ggplot2</span> cannot be used to create 3D graphs or
mosaic plots. In any case, most of you won’t be needing 3D plots! If you
do, there are many ways to do 3D plots using other plotting packages in
R. In particular, look up the <span>scatterplot3d</span> and
<span>plot3D</span> packages.

<span>ggplot2</span> differs from other approaches as it attempts to
provide a “grammar” for graphics in which each layer is the equivalent
of a verb, subject etc. and a plot is the equivalent of a sentence. All
graphs start with a layer showing the data, other layers and commands
are added to modify the plot. Specifically, according to this grammar, a
statistical graphic is a “mapping” from data to aesthetic attributes
(colour, shape, size; set using <span>aes</span>) of geometric objects
(points, lines, bars; set using <span>geom</span>).

For more on the ideas underlying ggplot, see the book “ggplot2: Elegant
Graphics for Data Analysis”, by H. Wickham (in your Reading directory).
Also, the website [ ggplot2.org]( ggplot2.org) a great resource.

<span>ggplot2</span> should already be available on the college
computers. If you are using your own computer, look up the section on
installing packages in Chapter \[chap:RI\].

ggplot can be used in two ways: with <span>qplot</span> (for
<span>q</span>uick <span> plot</span>ting) and <span>ggplot</span> for
full-blown, customizable plotting.

Note that <span>ggplot2</span> only accepts data in data frames.

### Basic plotting with <span>qplot</span>

<span>qplot</span> can be used to quickly produce graphics for
exploratory data analysis, and as a base for more complex graphics. It
uses syntax that is closer to the standard R plotting commands.

We will use the same predator-prey body size dataset again – you will
soon see how much nice the same types of plots you made above look when
done with ggplot!.

#### Scatterplots

Let’s start plotting the <span>Predator.mass</span> vs
<span>Prey.mass</span>:

In [None]:
> require(ggplot2)  ## Load the package
Loading required package: ggplot2
> qplot(Prey.mass, Predator.mass, data = MyDF)

As before, let’s take logarithms and plot:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), data = MyDF)

Now, color the points according to the type of feeding interaction:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), data = MyDF, 
    colour = Type.of.feeding.interaction)

The same as above, but changing the shape:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), data = MyDF, 
    shape = Type.of.feeding.interaction)

#### Aesthetic mappings

These examples demonstrate a key difference between <span>qplot</span>
and the standard <span>plot</span> command: When you want to assign
colours, sizes or shapes to the points on your plot, using the
<span>plot</span> command, it’s your responsibility to convert (i.e.,
“map”) a categorical variable in your data (e.g., type of feeding
interaction in the above case) onto colors (or shapes) that
<span>plot</span> knows how to use (e.g., by specifying “red”, “blue”,
“green”, etc).

ggplot does this mapping for you automatically, and also provides a
legend! This makes it really easy to quickly include additional data
(e.g., if a new feeding interaction type was added to the data) on the
plot.

Instead of using ggplot’s automatic mapping, if you want to manually set
a color or a shape, you have to use <span>I()</span> (meaning
“Identity”). To see this in practice, try the following:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), 
    data = MyDF, colour = "red")

You chose red, but ggplot used mapping to convert it to a particular
shade of red. To set it manually to the real red, do this:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), 
    data = MyDF, colour = I("red"))

Similarly, for point size, compare these two:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), 
    data = MyDF, size = 3) #with ggplot size mapping
> qplot(log(Prey.mass), log(Predator.mass), 
    data = MyDF, size = I(3)) #no mapping

But for shape, ggplot doesn’t have a continuous mapping because shapes
are a discrete variable. To see this, compare these two:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), 
    data = MyDF, shape = 3) #will give error
> qplot(log(Prey.mass), log(Predator.mass), 
    data = MyDF, shape= I(3))

#### Setting transparency

Because there are so many points, we can make them semi-transparent
using <span>alpha</span> so that the overlaps can be seen:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), data = MyDF, 
    colour = Type.of.feeding.interaction, alpha = I(.5))

Here try using <span>alpha = .5</span> instead of <span>alpha =
I(.5)</span> and see what happens.

#### Adding smoothers and regression lines

Now add a smoother to the points:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), data = MyDF, 
    geom = c("point", "smooth"))

If we want to have a linear regression, we need to specify the method as
being <span>lm</span>:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), data = MyDF, 
    geom = c("point", "smooth")) + geom_smooth(method = "lm")

<span>lm</span> stands for <span>l</span>inear <span>m</span>odels
(linear regression is a type of linear model). You will learn about
linear models and fitting them to data (as you have done here) in the
Stats in R week.

We can also add a “smoother” for each type of interaction:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), data = MyDF, 
    geom = c("point", "smooth"), colour = Type.of.feeding.interaction)  
                + geom_smooth(method = "lm")

To extend the lines to the full range, use <span>fullrange =
TRUE</span>:

In [None]:
> qplot(log(Prey.mass), log(Predator.mass), data = MyDF, 
    geom = c("point", "smooth"),
    colour = Type.of.feeding.interaction) + 
    geom_smooth(method = "lm",fullrange = TRUE)

Now we want to see how the ratio between prey and predator mass changes
according to the type of interaction:

In [None]:
> qplot(Type.of.feeding.interaction, 
        log(Prey.mass/Predator.mass), data = MyDF)

Because there are so many points, we can “jitter” them to get a better
idea of the spread:

In [None]:
> qplot(Type.of.feeding.interaction, 
        log(Prey.mass/Predator.mass), data = MyDF,
        geom = "jitter")

#### Boxplots

Or we can draw a boxplot of the data (note the <span>geom</span>
argument, which stands for <span>geom</span>etry):

In [None]:
> qplot(Type.of.feeding.interaction, 
        log(Prey.mass/Predator.mass), data = MyDF,
        geom = "boxplot")

#### Histograms and density plots

Now let’s draw an histogram of predator-prey mass ratios:

In [None]:
> qplot(log(Prey.mass/Predator.mass), data = MyDF, 
    geom =  "histogram")

Color the histogram according to the interaction type:

In [None]:
> qplot(log(Prey.mass/Predator.mass), data = MyDF, 
    geom =  "histogram", 
    fill = Type.of.feeding.interaction)

You may want to define binwidth (in units of x axis):

In [None]:
> qplot(log(Prey.mass/Predator.mass), data = MyDF, 
    geom =  "histogram", 
    fill = Type.of.feeding.interaction,
    binwidth = 1)

To make it easier to read, we can plot the smoothed density of the data:

In [None]:
> qplot(log(Prey.mass/Predator.mass), data = MyDF, 
    geom =  "density", fill = Type.of.feeding.interaction)

And you can make the densities transparent so that the overlaps are
visible:

In [None]:
> qplot(log(Prey.mass/Predator.mass), data = MyDF, 
    geom =  "density", fill = Type.of.feeding.interaction, alpha = 
    I(0.5))

Or using <span>colour</span> instead of <span>fill</span> draws only the
edge of the curve:

In [None]:
> qplot(log(Prey.mass/Predator.mass), data = MyDF, 
    geom =  "density", colour = Type.of.feeding.interaction)

Similarly, <span>geom = “bar”</span> produces a barplot, <span>geom =
“line”</span> a series of points joined by a line, etc.

#### Multi-faceted plots

An alternative way of displaying data belonging to different classes is
using “faceting”. We did this using <span> lattice()</span> previously,
but ggplot does a much nicer job:

In [None]:
> qplot(log(Prey.mass/Predator.mass), 
    facets = Type.of.feeding.interaction ~., 
    data = MyDF, geom =  "density")

The <span>\~.</span> (the space is not important) notation tells ggplot
whether to do the faceting by row or by column. So if you want a
by-column configuration, switch <span>\~</span> and <span>.</span>, and
also swap the position of the <span>.\~</span>:

In [None]:
> qplot(log(Prey.mass/Predator.mass), 
    facets =  .~ Type.of.feeding.interaction, 
    data = MyDF, geom =  "density")

You can also facet by a combination of categories (this is going to be a
big plot!):

In [None]:
> qplot(log(Prey.mass/Predator.mass), 
    facets = .~ Type.of.feeding.interaction + Location, 
    data = MyDF, geom =  "density")

And you can also change the order of the combination:

In [None]:
> qplot(log(Prey.mass/Predator.mass), 
    facets = .~ Location + Type.of.feeding.interaction, 
    data = MyDF, geom =  "density")

For more fine-tuned faceting, look up the <span>facet\_grid()</span> and
<span>facet\_wrap()</span> functions within <span>ggplot2</span>. look
up [ 
    http://www.cookbook-r.com/Graphs/Facets\_(ggplot2)]( 
    http://www.cookbook-r.com/Graphs/Facets_(ggplot2)).

See Fig. \[PPRegress\] for an example result.

#### Logarithmic axes

A more elegant way of drawing logarithmic quantities is to set the axes
to be logarithmic:

In [None]:
> qplot(Prey.mass, Predator.mass, data = MyDF, log="xy")

#### Plot annotations

Let’s add a title and labels:

In [None]:
> qplot(Prey.mass, Predator.mass, data = MyDF, log="xy",
    main = "Relation between predator and prey mass", 
    xlab = "log(Prey mass) (g)", 
    ylab = "log(Predator mass) (g)")

Adding <span>+ theme\_bw()</span> makes it suitable for black and white
printing.

In [None]:
> qplot(Prey.mass, Predator.mass, data = MyDF, log="xy",
    main = "Relation between predator and prey mass", 
    xlab = "Prey mass (g)", 
    ylab = "Predator mass (g)") + theme_bw()

#### Saving your plots

Finally, let’s save a pdf file of the figure (same approach as we used
before):

In [None]:
> pdf("../Results/MyFirst-ggplot2-Figure.pdf")
> print(qplot(Prey.mass, Predator.mass, data = MyDF,log="xy",
    main = "Relation between predator and prey mass", 
    xlab = "log(Prey mass) (g)", 
    ylab = "log(Predator mass) (g)") + theme_bw())
> dev.off()

Using <span>print</span> ensures that the whole command is kept together
and that you can use the command in a script.

### Some more important ggplot options

Other important options to keep in mind:

  ---------------------- ------------------------------------------------------------------------------------------------
  <span>xlim</span>      limits for x axis: <span>xlim = c(0,12)</span>
  <span>ylim</span>      limits for y axis
  <span>log</span>       log transform variable <span>log = “x”</span>, <span>log = “y”</span>, <span>log = “xy”</span>
  <span>main</span>      title of the plot <span>main = “My Graph”</span>
  <span>xlab</span>      x-axis label
  <span>ylab</span>      y-axis label
  <span>asp</span>       aspect ratio <span>asp = 2</span>, <span>asp = 0.5</span>
  <span>margins</span>   whether or not margins will be displayed
  ---------------------- ------------------------------------------------------------------------------------------------

\

### Various <span>geom</span>

<span>geom</span> Specifies the geometric objects that define the graph
type. The geom option is expressed as a character vector with one or
more entries. geom values include “point”, “smooth”, “boxplot”, “line”,
“histogram”, “density”, “bar”, and “jitter”. Try the following:

### Advanced plotting: <span>ggplot</span>

The command <span>qplot</span> allows you to use only a single dataset
and a single set of “aesthetics” (x, y, etc.). To make full use of
<span>ggplot2</span>, we need to use the command <span>ggplot</span>,
which allows you to use “layering”. Layering is the mechanism by which
additional data elements are added to a plot. Each layer can come from a
different dataset and have a different aesthetic mapping, allowing us to
create plots that could not be generated using <span>qplot()</span>,
which permits only a single dataset and a single set of aesthetic
mappings.

For a <span>ggplot</span> plotting command, we need at least:

-   The data to be plotted, in a data frame;

-   Aesthetics mappings, specifying which variables we want to plot, and
    how;

-   The <span>geom</span>, defining the geometry for representing the
    data;

-   (Optionally) some <span>stat</span> that transforms the data or
    performs statistics using the data.

To start a graph, we nust specify the data and the aesthetics:

In [None]:
> p <- ggplot(MyDF, aes(x = log(Predator.mass),
                y = log(Prey.mass),
                colour = Type.of.feeding.interaction ))

Here we have created a graphics object <span>p</span> to which we can
add layers and other plot elements.

Now try to plot the graph:

In [None]:
> p
Error: No layers in plot

That is because we are yet to specify a geometry — only then can we see
the graph:

In [None]:
> p + geom_point()

We can use the “plus” sign to concatenate different commands:

In [None]:
> p <- ggplot(MyDF, aes(x = log(Predator.mass),
                y = log(Prey.mass),
                colour = Type.of.feeding.interaction ))
> q <- p + geom_point(size=I(2), shape=I(10)) + theme_bw()
> q

Let’s remove the legend:

In [None]:
> q + theme(legend.position = "none")

We will not look at some case studies to see some useful ways in which
you can use ggplot.

### Case study 1: plotting a matrix

Here we will plot a matrix of random values taken from a normal
distribution $\mathcal U [0,1]$. Our goal is to produce the plot in
Figure \[MatPlot\]. Because we want to plot a matrix, and <span>
ggplot2</span> accepts only dataframes, we use the package
<span>reshape2</span> that can “melt” a matrix into a dataframe:

In [None]:
require(ggplot2)
require(reshape2)

GenerateMatrix <- function(N){
    M <- matrix(runif(N * N), N, N)
    return(M)
}

> M <- GenerateMatrix(10)

> M[1:3, 1:3]
            [,1]      [,2]      [,3]
[1,] 0.2700254 0.8686728 0.7365857
[2,] 0.1744879 0.8488169 0.4165879
[3,] 0.3980783 0.7727821 0.4271121

> Melt <- melt(M)

> Melt[1:4,]
  Var1 Var2     value
1    1    1 0.0698925
2    2    1 0.6333296
3    3    1 0.8990120
4    4    1 0.8425578

> ggplot(Melt, aes(Var1, Var2, fill = value)) + geom_tile()

# adding a black line dividing cells
> p <- ggplot(Melt, aes(Var1, Var2, fill = value))
> p <- p + geom_tile(colour = "black")

# removing the legend
> q <- p + theme(legend.position = "none")

# removing all the rest
> q <- p + theme(legend.position = "none", 
     panel.background = element_blank(),
     axis.ticks = element_blank(), 
     panel.grid.major = element_blank(),
     panel.grid.minor = element_blank(),
     axis.text.x = element_blank(),
     axis.title.x = element_blank(),
     axis.text.y = element_blank(),
     axis.title.y = element_blank())

# exploring the colors
> q + scale_fill_continuous(low = "yellow",
                        high = "darkgreen")
> q + scale_fill_gradient2()
> q + scale_fill_gradientn(colours = grey.colors(10))
> q + scale_fill_gradientn(colours = rainbow(10))
> q + scale_fill_gradientn(colours =
                c("red", "white", "blue"))

### Case study 2: plotting two dataframes

According to Girko’s circular law, the eigenvalues of a matrix $M$ of
size $N \times N$ are approximately contained in a circle in the complex
plane with radius $\sqrt{N}$. We are going to draw a simulation
displaying this result (Figure \[Girko\]).

### Case study 3: annotating plots

In the plot in Figure \[Linebar\], we use the geometry “text” to
annotate the plot.

### Case study 4: mathematical display

In Figure \[LinReg\], you can see the mathematical annotation on the
axis and in the plot area itself.

### ggthemes

The package <span>ggthemes</span> provides you some additional
<span>geom</span>s, <span> scale</span>s, and <span>theme</span>s for
<span>ggplot</span>. These include a theme based on Tufte’s <span>*The
Visual Display of Quantitative Information*</span> (see the readings
section at the end of this Chapter). First install the package:

In [None]:
> install.packages("ggthemes")

Then try:

In [None]:
> library(ggthemes)

p <- ggplot(MyDF, aes(x = log(Predator.mass), y = log(Prey.mass),
                colour = Type.of.feeding.interaction )) +
                geom_point(size=I(2), shape=I(10)) + theme_bw()

> p + geom_rangeframe() + # now fine tune the geom to Tufte's range frame
        theme_tufte() # and theme to Tufte's minimal ink theme    

Go to <https://github.com/jrnold/ggthemes> for more information and a
list of <span>geom</span>s, <span>theme</span>s, and
<span>scale</span>s.

Both <span>library()</span> and <span>require()</span> are
commands/functions to load packages. The difference is that
<span>require()</span> is designed for use inside other functions, so it
returns <span>FALSE</span> and gives a warning, whereas
<span>library()</span> returns an error by default if the package does
not exist.

Practicals {#sec:PPPrac2}
----------

In this practical, you will write script that draws and saves a pdf file
of Fig. \[PPRegress\], and writes the accompanying regression results to
a formatted table in csv. Note that the plots show that the analysis
must be subsetted by the <span> Predator.lifestage</span> field of the
dataset. The guidelines are:

Write a <span>R</span> script file called <span>PP\_Regress.R</span> and
save it in the <span>Code</span> directory — sourcing or running this
script should result in one pdf file containing the following figure
being saved in the <span>Results</span> directory: (Hint: Use the
<span>print()</span> command to write to the pdf)

![Write a script that generates this figure<span
data-label="PPRegress"></span>](Figure1.pdf)

In addition, the script should calculate the regression results
corresponding to the lines fitted in the figure and save it to a csv
delimited table called (<span>PP\_Regress\_Results.csv</span>), in the
<span> Results</span> directory. (Hint: you will have to initialize a
new dataframe in the script to first store the calculations and then
<span> write.csv()</span> or <span>write.table()</span> it.)\
All that you are being asked for here is results of an analysis of
Linear regression on subsets of the data corresponding to available
Feeding Type $\times$ Predator life Stage combination — not a
multivariate linear model with these two as separate covariates!

The regression results should include the following with appropriate
headers (e.g., slope, intercept, etc, in each Feeding type $\times$ life
stage category): regression slope, regression intercept, R$^2$,
F-statistic value, and p-value of the overall regression (Hint: Review
the Stats week!).

The script should be self-sufficient and not need any external inputs —
it should import the above predator-prey dataset from the appropriate
directory, and save the graphic plots to the appropriate directory
(Hint: use relative paths). I should be able to <span>source</span> it
without errors.

You can also use the <span>dplyr</span> function instead of looping (se
Chapter \[chap:R\_II\]), and the <span>ggplot</span> command instead of
<span>qplot</span>.

<span>**Extra Credit**</span>: Do the same as above, but the analysis
this time should be separate by the dataset’s <span>Location</span>
field. Call it <span>PP\_Regress\_loc.R</span>. No need to generate
plots for this (just the analysis results to a <span>.csv</span> file),
as a combination of <span>Type.of.feeding.interaction</span>,
<span>Predator.lifestage</span>, and <span> Location</span> will be far
to busy (faceting by three variables is one step too far)!

Data wrangling and exploration
------------------------------

You are likely to spend far more time than you think dredging through
your data manually, checking it, editing it, and reformatting it to make
it useful for the actual data exploration and statistical analysis. For
example, you may need to:

Identify the variables vs observations within the data — somebody else
might have recorded the data, or you might have collected the data some
time back!

Fill in zeros (true measured/observed absences)

Identify and add a value (e.g., <span>-999999</span>) to denote missing
observations

Derive or calculate new variables from the raw observations (e.g.,
convert measurements to SI units; kilograms, meters, seconds, etc.)

Reshape/reformat your data into a layout that works best for analysis
(e.g., for <span>R</span> itself) — e.g., from wide to long data format
for repeated (across sites, plots, plates, chambers, etc) measures data

Merge multiple datasets together into a single data sheet

And this is far from an exhaustive list. Doing so many different things
to your raw data is both time-consuming and risky. Why risky? Because to
err is very human, and every new, tired mouse-click and/or keyboard-stab
has a high probability of being incorrect!

![An illustration of a (metaphorical) datum being wrangled into
submission.](Wrangling2.jpg){width=".6\textwidth"}

### Some data wrangling principles

So you would like to a record of the data wrangling process (so that it
is repeatable and even reversible), and automate it to the extent
possible. To this end, here are some guidelines:

-   Store data in universally-readable, nonproprietary software formats
    (e.g., <span>.csv</span>)

-   Use plain ASCII text for your file names, variable/field/column
    names, and data values — make sure the “text encoding” is correct
    and standard (e.g., <span>UTF-8</span>)

-   Keep a metadata file for each unique dataset (agian, in
    non-proprietary format)

-   Don’t (over-)modify your raw data by hand — use scripts instead —
    keep a copy of the data as they were recorded.

-   Use meaningful names for your data and files and field (column)
    names

-   When you add data, try not to add columns (widening the format);
    rather, design your tables/datasheets so that you add only rows
    (lengthening the format) — and convert “wide format data” to “long
    format data” using scripts, not by hand!

-   All cells within a data column should contain only one type of
    information (i.e., either text (character), numeric, etc.).

-   Ultimately, consider creating a relational database for your data
    (see the last section of this Chapter).

This is not an exhaustive list either — see the Borer et al (2009) paper
in your readings list.

We will use the Pound Hill dataset collected by students in a past
Silwood Field Course week for understanding/illustrating some of these
principles.

To start with, we need to import the <span>raw</span> data file.

\[$\quad\star$\]

Go to the bitbucket repository and navigate to the <span>Data</span>
directory.

Copy the file <span>PoundHillData.csv</span> and <span>
PoundHillMetaData.csv</span> files into your own R module’s <span>
Data</span> directory.

Now load the data in R:

In [None]:
> MyData <- as.matrix(read.csv("../Data/PoundHillData.csv",header = F, 
stringsAsFactors = F))
> MyMetaData <- read.csv("../Data/PoundHillMetaData.csv",header = T, 
sep=";", stringsAsFactors = F)

Note that:

Loading the data <span>as.matrix()</span>, and setting The
<span>header</span> and <span>stringsAsFactors</span> guarantees that
the data are imported “as is” so that you can wrangle them. Otherwise
<span>read.csv</span> will convert the first row to column headers,
convert everything to factors, etc. Note that all the data will be
converted to character class in matrix here because at least some of the
entries are already character class.

For the metadata loading, the <span>header</span> is set to true because
we do have metadata headers (<span>FieldName</span> and
<span>Description</span>), and I have used semicolon (<span>;</span>) as
delimiter because there are commas in one of the field descriptions.

I have avoided spaces in the columns headers (so “FieldName” instead of
“Field Name”) — please avoid spaces in field or column names as much a
possible as R will replace each space in a column header with a dot,
which may be confusing.

In <span>R</span>, you can use <span>F</span> and <span>T</span> for
boolean <span>FALSE</span> and <span>TRUE</span> respectively. Try:

In [None]:
> a <- T
> a
[1] TRUE

We won’t do anything with the metadata file in this session except
inspect the information it contains.

#### Keep a metadata file for each unique dataset

Data wrangling really begins immediately after data collection. You may
collect data of different kinds (e.g., diversity, biomass, tree girth),
etc. Keep the original spreadsheet well documented using a “metadata”
file that describes the data (you would hopefully have written the first
version of this even before you started collecting the data!). The
minimum information needed to make a metadata file useful is a
description of each of the <span>*fields*</span> — the column or row
headers under which the information is stored in your data/spreadsheet.
Here is the metadata file for the Pound Hill dataset:

  <span>**Field/Column Name**</span>   <span>**Description**</span>
  ------------------------------------ ---------------------------------------------------------------------
  Cultivation                          Cultivation treatments applied in three months: october, may, march
  Block                                Treatment blocks ids: a–d
  Plot                                 Plot ids under each treatment : 1–12
  Quadrat                              Sampling quadrats (25$\times$50 cm each) per plot: Q1–Q6
  Species data                         Number of individuals (count) per quadrat

Ideally, you would also like to add more information about the data,
such as the measurement units of each type of observation. Here, there
is just one type of observation: Number of individuals of species per
sample (plot), which is a count (integer, or <span>int</span> class).

#### Don’t (over-)modify your raw data by hand

When the dataset is large (e.g., 1000’s of rows), cleaning and exploring
it can get tricky, and you are very likely to make many mistakes. You
should record all the steps you used to process it with an R script
rather than risking a manual and basically irreproducible processing.
Most importantly <span>*avoid or minimize editing your raw data
file*</span>. Let’s see how we can modify the data using <span>R</span>.
In fact, we should now start to keep a record of what we are doing to
the data. This is illustrated in a code data file available on the
bitbucket repository.

Sometimes you may run into (unexpected) bugs when importing and running
scripts in <span>R</span> because your file has a no-standard text
encoding. You may need to specify the encoding in that case, using the
<span>encoding</span> argument of <span>read.csv()</span> and <span>
read.table()</span>. You can check the encoding of a file by using
<span>find</span> in Linux/Mac. Try:

In [None]:
$ file -i ../Data/PoundHillData.csv

or, check encoding of all files in the <span>Data</span> directory:

In [None]:
$ file -i ../Data/*.csv     

use <span>file -I</span> instead of <span>file -i</span> in a Mac
terminal

\[$\quad\star$\]

Go to the bitbucket repository and navigate to the <span>Code</span>
directory.

Copy the script <span>DataWrang.R</span> into your own R module’s <span>
Code</span> directory and open it.

Go through the script carefully line by line, and make sure you
understand what’s going on. Read the comments — add to them if you want.
One of the examples of data modification that you must avoind doing by
hand, is illustrated in the script: filling in zeros.

#### Convert wide format data to long format using scripts

You typically record data in the field or experiments using a “wide”
format, where a subject’s (e.g., habitat, plot, treatment, species etc)
repeated responses or observations (e.g., species count, biomass, etc)
will be in a single row, and each response in a separate column. The raw
Pound Hill data were recorded in precisely this way. However, the wide
format is not ideal for data analysis — instead you need the data in a
“long” format, where each row is one observation point per subject. So
each subject will have data in multiple rows. Any
measures/variables/observations that don’t change across the subjects
will have the same value in all the rows.

For humans, the wide format is generally more intuitive for viewing and
recording (e.g., in field data sheets) since one can actually view more
of the data visually. However, the long format is more machine-readable
and is closer to the formatting of databases.

You can switch between wide and formats using <span>melt()</span> and
<span> dcast()</span> from the <span>reshape2</span> package, as
illustrated in <span>DataWrang.R</span>.

### And then came <span>dplyr</span> and <span>tidyr</span>

So if you think this is the end of the options you have for data
wrangling in R, think again. There are new kids on the block:
<span>dplyr</span> — the next iteration of <span>plyr</span> that
addresses the speed issues in the latter, and <span>tidyr</span>,
essentially a nicer wrapper to the <span> reshape2</span> package with
additional functions.

You will have to install these packages using <span>sudo apt get
install</span> in Linux or <span>install.packages()</span> across all
platforms (see Chapter \[chap:RI\]).

Together, these two packages have many many useful functions. Let’s have
a quick look at <span>dplyr</span>:

In [None]:
require(dplyr)
attach(iris)
dplyr::tbl_df(iris) #like head(), but nicer!
dplyr::glimpse(iris) #like str(), but nicer!
utils::View(iris) #same as fix()!
dplyr::filter(iris, Sepal.Length > 7) #like subset(), but nicer!
dplyr::slice(iris, 10:15) # something new!

Note that the double colon (<span>::</span>) notation of
<span>dplyr</span> and <span> tidyr</span> is like the dot notation in
<span>python</span> — it allows you to access a particular function from
these packages. So, for instance, if you want to use
<span>tbl\_df()</span> from <span>dplyr</span>, the command syntax would
be <span>dplyr::tbl\_df(MyData)</span>. This new syntax is basically to
avoid conflict in names of functions in by these new packages with the
function names that already exist in the base R packages. For example,
the <span>dplyr</span> function <span>filter</span> already exists in
the base R package <span>stats</span>. Thus, when you first load
<span>dplyr</span>:

In [None]:
> library(dplyr)

you get:

In [None]:
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Learning to use <span>ddply</span> and <span>tidyr</span> involves
learning some new syntax and a lot of new commands, but if you plan to
do a lot of data wrangling, you may want to get to know them well. Have
a look at <https://blog.rstudio.org/2014/01/17/introducing-dplyr> and
this cheatsheet:
<https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf>,
also available in the Readings directory under
<span>DataDataData!</span> in the course bitbucket repository.

### On to data exploration

Once you have wrangled the Pound Hill data to its long format, you are
ready to go! You may want to start by examining the basic properties of
the data, such as the number of tree species (41) in the dataset, number
of quadrats (replicates) per plot and cultivation treatment, etc.

The first plot you can try is a histogram of abundances of species,
grouped by different factors. For example, you can look at distributions
of species’ abundances grouped by <span>Cultivation</span>).

Practicals {#practicals-1}
----------

<span>**(Extra Credit)**</span> We used <span>reshape2</span> in
<span>DataWrang.R</span> for wrangling that dataset. Write a new script
called <span>DataWrangTidy.R</span> that uses <span>dplyr</span> and
<span>tidyr</span> for the same data wrangling steps. The best way to do
this is to <span>cp</span> <span>DataWrang.R</span> and rename it
<span>DataWrangTidy.R</span>. Then systematically redo the script from
start to end, looking for a function in <span>dplyr</span> and <span>
tidyr</span> that does the same thing in each wrangling step.

For example, to convert from wide to long format, instead of using
<span> melt()</span> (or <span>dcast()</span>) from the
<span>reshape2</span> package, you can use <span>gather()</span> from
<span>tidyr</span>.

Don’t forget the cheatsheet:
<https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf>
(also available in the Readings directory under
<span>DataDataData!</span> in the course bitbucket repository).

Handling Big Data in R
----------------------

The buzzword ‘Big Data’ basically refers to datasets that have the
following properties:

1.  A dataset that does not fit into available RAM on one system
    (say, &gt; 2 gigabytes).

2.  A dataset that has so many rows (when in it’s long format — see
    above sections) that it significantly slows down your analysis or
    simulation without vectorization (that is, when looping).

Both these criteria are programming language- and computer
hardware-dependent, of course. For example, a 32-bit OS can only handle
$\textasciitilde$2 GB of RAM, so your computer screams “Big Data!”
(slows down/hangs) every time you try to handle a dataset in that range.

R reads data into RAM all at once when you using the
<span>read.table</span> (or its wrapper, <span>read.csv()</span> — maybe
you have realized by now that <span>read.csv()</span> is basically
calling <span>read.table()</span> with a particular set of options. That
is, objects in R live in memory entirely, and big-ish data in RAM will
cause R to choke. Python has similar problems, but you can circumvent
these to an extent by using <span>numpy</span> arrays (Chapter
\[chap:pythonII\]).

There are a few options (which you can combine, of course) if you are
actually using datasets that are so large:

-   Import large files smartly; e.g., using <span>scan()</span> in R,
    and then create subsets of the data that you need. Also, use the
    <span> reshape</span> or <span>tidyr</span> package to covert your
    data in the most “square” (so neither too long or too wide) format
    as possible. Of course, you will need subsets of data in long format
    for analysis (see sections above).

-   use the <span>bigmemory</span> package to load data in the gb range
    (e.g., use <span>read.big.matrix()</span> instead of
    <span>read.table()</span>. This package also has other useful
    functions, such as <span>foreach()</span> instead of
    <span>for()</span> for better memory management.

-   Use a 64 bit version of R with enough memory and preferably on
    Linux!

-   Vectorize your analyses/simulations to the extent possible (Chapters
    \[chap:pythonII\], \[chap:R\_II\]).

-   Use databases.

-   use distributed computing (distribute the analysis/simulation across
    multiple CPU’s).

The next subsection superficially covers databases. We will cover memory
management in the advanced Python, HPC and C weeks.

### Databases and R

R can be used to link to and extract data from online databases such as
PubMed and GenBank, or to manipulate and access your own. Computational
Biology datasets are often quite large, and it makes sense to access
their data by querying the databases instead of manually downloading
them. So also, your own data may be complex and large, in which case you
may want to organize and manage those data in a proper relational
database.

Practically all the data wrangling principles in the previous sections
are a part and parcel of relational databases.

There are many R packages that provide an interface to databases
(SQLite, MySQL, Oracle, etc). Check out R packages <span>DBI</span>
(<http://cran.r-project.org/web/packages/DBI/index.html>) and <span>
RMySQL</span>
(<https://cran.r-project.org/web/packages/RMySQL/index.html>.

And just like python (see Chapter \[chap:pythonII\]), R can also be used
to access, update and manage databases. In particular
<span>SQLite</span> allows you to build, manipulate, and access
databases easily. Try this script (available in the <span>Code</span>
directory of the master repo): This assumes that you are already
familiar with the databases section of Chapter \[chap:pythonII\].

Practicals wrap-up
------------------

1.  Review and make sure you can run all the commands, code fragments,
    and named scripts we have built till now and get the
    expected outputs.

2.  Annotate/comment your code lines as much and as often as necessary
    using \#.

3.  Keep all code, data and results files organized in you R module
    directory

*<span>git add</span>, <span>commit</span> and <span>push</span> all
your code and data from this chapter to your git repository.*

Readings
--------

Check out <span>DataDataData!</span>, <span>Visualization</span> and
<span>R</span> under <span>Readings</span> on the bitbucket master
repository

-   Brian McGill’s “Ten commandments for good data management”;
    <https://dynamicecology.wordpress.com/2016/08/22/ten-commandments-for-good-data-management/>

-   This paper covers similar ground (look in your readings directory):
    Borer, E. T., Seabloom, E. W., Jones, M. B., & Schildhauer, M.
    (2009). Some Simple Guidelines for Effective Data Management.
    Bulletin of the Ecological Society of America, 90(2), 205–214.

-   wrangler: <http://vis.stanford.edu/papers/wrangler>

-   An interactive framework for data cleaning:
    <https://www2.eecs.berkeley.edu/Pubs/TechRpts/2000/CSD-00-1110.pdf>

-   <http://www.theanalysisfactor.com/wide-and-long-data/>

-   Rolandi et al. “A Brief Guide to Designing Effective Figures for the
    Scientific Paper”, doi:10.1002/adma.201102518

-   The classic Tufte
    [www.edwardtufte.com/tufte/books\_vdqi](www.edwardtufte.com/tufte/books_vdqi)

In [None]:
Available in the Central Library, I have also added extracts and a
related book in pdf on the master repository

(btw, check out what Tufte thinks of PowerPoint;
[ https://www.edwardtufte.com/tufte/powerpoint]( https://www.edwardtufte.com/tufte/powerpoint))\

-   Lauren <span>et al.</span> “Graphs, Tables, and Figures in
    Scientific Publications: The Good, the Bad, and How Not to Be the
    Latter”, doi:10.1016/j.jhsa.2011.12.041

-   Effective scientific illustrations:
    [www.labtimes.org/labtimes/issues/lt2008/lt05/lt\_2008\_05\_52\_53.pdf](www.labtimes.org/labtimes/issues/lt2008/lt05/lt_2008_05_52_53.pdf)

-   <https://web.archive.org/web/20120310121708/http://addictedtor.free.fr/graphiques/thumbs.php>