Introduction to R: October 13, 2021

Introduction to R

Wednesday October 13, 2021 10am-noon PDT/ 1-3pm EDT

Instructors: Saranya Canchi

Helpers: Rayna Harris, Jeremy Walter

Description:

This free two hour workshop will provide an overview of the basics of R programming language along with RStudio which is a user-friendly environment for working with R. You will be introduced to the syntax, variables, functions, packages along with the various data structures in R. You will also learn basics of data wrangling from reading data from files, subsetting, merging and exporting data out of the R environment.

While we wait to get started --

✔️ Have you checked out the pre-workshop resources page?
📝 Please fill out our pre-workshop survey if you have not already done so! Click here
💻 Open a web browser - e.g., Firefox, Chrome, or Safari. Open the binder by clicking this button:

Introduction

We are part of the training and engagement team for the NIH Common Fund Data Ecosystem, a project supported by the NIH to increase data reuse and cloud computing for biomedical research.

Have you heard of the NIH Common Fund Data Ecosystem?

Put up a ✔️ for yes and a ❎ for no!

You can contact us at training@cfde.atlassian.net

How to ask questions

If you have questions at any point,

Drop them in the chat, or
Direct messages to the helpers are welcome, or
Unmute yourself and ask during the workshop

We're going to use the "raise hand" reaction in zoom to make sure people are on board during the hands-on activities.

Some background

What is R?

R is a statistical computing and data visualization programming language.

What is RStudio?

R is often used via integrated development environments (IDE), and RStudio is probably the most popular option. R and RStudio work on Mac, Linux, and Windows operating systems. The RStudio layout displays lots of useful information and allows us to run R code from a script and view and save outputs all from one interface.

RStudio panels
The 4 panels:
- View R scripts
- Console to run code
- Environment, history and collections
  - Environment: collection of objects, variables, and functions
  - View/load/manage R package list
- View file system, help docs, view/save plots

Let's customize the layout to see code better!

Terminology

Function: A key feature of R is functions. Functions are “self contained” modules of code that accomplish a specific task. Functions usually take in some sort of data structure (value, vector, dataframe etc.), process it, and return a result.
Variable/object: Variables are containers for storing data values
Assignment operator: The <- is called an assignment operator and it is an R convention for assigning values to variable names (i.e., in Python it is =).
Packages: R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN (the Comprehensive R Archive Network)
Library: Often used interchangeably with package. A library is usually a single file containing many functions or objects, and a package can contain libraries and other files.

Our example dataset

We are using sequencing data from a European Nucleotide Archive dataset (PRJEB5348).

This dataset is for an RNA-Seq experiment comparing two yeast strains, SNF2 (mut) and WT. They sequenced 96 samples in 7 different lanes, with 48 wt and 48 mut strains.

So there are 48 biological replicates and 7 technical replicates, with a total of 672 samples!

Loading packages

Let's start by loading the packages we will be using!

📓 Please note!

The RStudio binder comes pre-loaded with R and the R packages we're using in today's workshop. On your local computer, you need to install R, RStudio, and the R packages before using them. Here's the code to install R packages:

# install
install.packages('ggplot2')
install.packages('readr')
install.packages('dplyr')

All of these packages belong to a giant package called tidyverse. On your local computer, you can install all these useful packages and more by simply intalling tidyverse.

# install.packages('tidyverse')

We are not doing this today because tidyverse makes the RStudio binder very slow.

Some Useful Info 🏝️

Nucleic Acid Conc. = RNA concentration in the sample in units of nanogram per microliter (ng/ul)
260/280 = ratio measured with a spectrophotometer (i.e., Nanodrop) to assess RNA sample purity. For RNA, the ratio should be ~2.0
Total RNA = total RNA molecular weight in the sample in units of microgram (ugm)
General information about RNA quality for RNA-Seq

The `library()` function:

library(dplyr)
library(readr)
library(ggplot2)

The notes about masked objects mean that two or more R packages have functions with the same name

For example, both stats and dplyr have a functions called filter

check them out with ?stats::filter and ?dplyr::filter)

# are used for notes in scripts or "commented out" lines of code that we don't want to run

Reading in the data

Next, we will examine the RNA quality and quantity that the authors measured for all their samples.

We will explore this dataset using some base R and an RStudio package called dplyr.

The data file we're using today is loaded in the binder in the home directory and it is a tab-delimited .tsv file. To read in the dataset in R, we are using the read_delim() function from the Base R package.

read.delim(file = '~/sample_details.tsv', 
                              check.names=FALSE)

Functions like read_delim specify required and optional inputs within ( ).

Let's assign the input data table values to the variable experiment_info.

experiment_info <- read.delim(file = '~/sample_details.tsv', 
                              check.names=FALSE)

Now let's look at the data!

# how big is the data set ?
dim(experiment_info)

# examine the first 6 rows across all columns of the data
head(experiment_info)

Tibbles are a type of data table -- a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. data.frame is another type of data table in R.

Data Wrangling with `dplyr`

Why might you need to wrangle data?

Real world data is messy and needs some formatting before you can use it for analysis.
Often, you are only interested in a subset of the data and want to avoid loading the entire data set.
Some programs/visualizations require data to be in a specific format.

dplyr is a R package that contains sets of tools for dataframe manipulation.

For this workshop we will cover three dplyr functions:

dplyr verb	Description
`select()`	subset columns
`filter()`	subset rows matching some condition
`mutate()`	create new columns using information from other columns

`select()`

There are some empty columns in the experiment data file we have been working with. Let's create a subset of data without those empty columns using the select function.

experiment_info_cleaned <- select(experiment_info,                                   
                            Sample, 
                            Yeast Strain, 
                            Nucleic Acid Conc., 
                            Unit, 
                            A260, 
                            A280,
                            260/280,
                            Total RNA)

# As before, since R cannot parse spaces in column names, we need to enclose them 
# in backticks to indicate that these words belong together.

experiment_info_cleaned <- select(experiment_info,                                   
                            Sample, 
                            `Yeast Strain`, 
                            `Nucleic Acid Conc.`, 
                            Unit, 
                            A260, 
                            A280,
                            260/280,
                            `Total RNA`)

# As a general rule, it is best to avoid column names that start with a number;
# we can use backticks for this column name.

experiment_info_cleaned <- select(experiment_info,                                   
                            Sample, 
                            `Yeast Strain`, 
                            `Nucleic Acid Conc.`, 
                            Unit, 
                            A260, 
                            A280,
                            `260/280`,
                            `Total RNA`)

# Check the dimensions of the subsetted dataframe
dim(experiment_info_cleaned)

To make it easier to specify column names, let's rename to remove spaces (you could also specify the column names using backticks as we did above):

# current column names
colnames(experiment_info_cleaned)

# new names specified with a vector
colnames(experiment_info_cleaned) <- ('Sample', 
'Yeast_strain',
'Nucleic_Acid_Conc.',
'ng/ul',
'A260', 
'A280',
'A260_280',
'Total_RNA',
'ugm')

You can also use select to specify the columns you don't want. This can be done with the - notation.

# Remove two columns: A260 and A280
select(experiment_info, -A260, -A280)

Exercise 1

Select the columns Sample, Yeast Strain, A260 and A280 and assign them to a new object called "experiment_data".

Solution

experiment_data <-  select(experiment_info_cleaned, 
                           Sample, 
                           Yeast_strain,
                           A260, 
                           A280)
# Check the data
head(experiment_data)

`filter()`

We learned to subset entire columns. Using the filter function, we can subset based on conditions, using a logical vector with TRUE and FALSE, or with operators.

The filter function works on rows, so we can use it to subset data based on conditions in a column. Here's how you would use the function:

filter(experiment_info_cleaned, 
       Nucleic_Acid_Conc. > 1500)
       
filter(experiment_info_cleaned, 
       A260_280 >= 2.1 & 
         Nucleic_Acid_Conc. > 1500)

Here's a list of operators you can use:

Conditional Subsetting	Description
>	greater than the value
>=	greater than or equal to the value
<	less than the value
<=	less than or equal to the value
l	applies logical operator `OR`
`&`	applies logical operator `AND`
`==`	equal to the value
`%in%`	operator used to identify if the value belongs to a vector
`!=`	not equal to the value

`mutate()`

mutate can be used to create new columns in our data frame using existing data.

experiment_info_wnewcolumn <- mutate(experiment_info_cleaned, 
                             conc_ug_uL=Nucleic_Acid_Conc./1000)

Pipes

So far, we have applied the actions individually. What if we wanted to apply select and filter at the same time?

We can use the concept of a pipe which is represented by %>% in R. This is equivalent to | in shell.

Pipes are available via the magrittr package, installed automatically with dplyr. If you use RStudio, the shortcut to produce the pipe is:

Ctrl + Shift + M on a PC or
Cmd + Shift + M on a Mac

Here's how you would use it:

experiment_info_wt <- experiment_info_cleaned %>% 
                      filter(Yeast_Strain == 'WT' & 
                               Nucleic_Acid_Conc. > 1500) %>% 
                      select(Sample,
                             Yeast_Strain,
                             A260, 
                             A280)

(when using pipes, the order of the command matters!)

Exercise 2

Create a new data table called exp_info_wrangled and only keep the samples that have 260/280 ratio values < 2.15. Add a new column called total_RNA_in_nano_grams that has total RNA in nano grams. Include the columns sample, yeast strain and 260/280 and Total_RNA_in_nano_grams

Solution

# create data object
exp_info_wrangled<- experiment_info_cleaned %>% 
  filter(`260/280` < 2.15) %>% 
  mutate(total_RNA_in_nano_grams = `Total RNA`*1000) %>% 
  select(Sample,
  Yeast_strain,
  A260_280, 
  total_RNA_in_nano_grams) 

# check the data
head(exp_info_wrangled)

Plotting with `ggplot`

ggplot and its related packages are a flexible and widely-used plotting R package. You can make standard plots (i.e., scatter, histogram, box, bar, density plots), as well as a variety of other plot types like maps and phylogenetic trees!

It's a very helpful tool for exploring your data/results and generating publication-ready figures.

We're going to create some plots with the dataframe from the previous section.

The ggplot syntax always begins with a function that specifies the data and then layers on additional functions to specify a whole bunch of customizable plot aesthetics. Here, we specify the data object as the experiment_info_cleaned dataframe and set the axes with the mapping= flag.

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA,
       y=Nucleic_Acid_Conc.))

RStudio has some really helpful features for finding help docs, and for autofilling. Use the tab key to autofill variable or R package function names for example. Find help docs about functions or R packages with the ? key:

# help docs for ggplot() function
? ggplot 

# help docs for ggplot2 R package
? ggplot2

Let's make a scatter plot by adding geom_point() with a + to the ggplot() function:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA,
       y=Nucleic_Acid_Conc.)) +
  geom_point() +
  ggtitle('Scatter plot')

Notice the indentation after the + sign.

Add transparency to points (also works for other plot types!) with the alpha= option (0=completely transparent, 1=solid color) and change point size with size=:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA,
       y=Nucleic_Acid_Conc.)) +
  geom_point(alpha=0.6, size=4) +
  ggtitle('Scatter with transparent points')

We can modify the code to fill points by yeast strain and make point size correspond to the 260/280 ratio:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA,
       y=Nucleic_Acid_Conc.,
       fill=Yeast_strain,
       size=A260_280)) +
  geom_point(alpha=0.6, pch=21, col='white') +
  ggtitle('Scatter: color by strain \n and
  size by 260/280 ratio')

There are 2 strains in this dataset, WT and snf2.

unique(experiment_info_cleaned$Yeast_strain)

Let's make a boxplot by strain:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Yeast_strain,
       y=Total_RNA)) +
  geom_boxplot() +
  ggtitle('Boxplot')

Boxplots can be more informative if we add points to see the distribution of the data:

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Yeast_strain,
       y=Total_RNA)) +
  geom_point(alpha=0.6) +
  geom_boxplot() +
  ggtitle('Boxplot w/ points')

Exercise 3

What happens if you reverse the order of the geom_point() and geom_boxplot() functions for the boxplot code above?

Solution

Above, the points are plotted first, then the boxplot. If we reverse the order, the points will be plotted on top of the boxplot.

Exercise 4

What happens if you use geom_jitter() instead of geom_point()?

Solution

geom_jitter() 'jitters' or spreads the points out so they are not overlapping as much.

More modifications and plot types!

Modifying axis label text: xlab() and ylab()
Facets: facet_wrap() or facet_grid()

ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=Total_RNA,
       y=Nucleic_Acid_Conc.,
       col=Yeast_strain)) +
  geom_point() +
  facet_wrap(~Yeast_strain) +
  xlab('Total RNA (ugm)') +
  ylab('Nucleic Acid Conc. (ng/ul)') +
  ggtitle('Facet scatter plot')

Exercise 5

Try making 2 plots of the A260_280 ratio values: 1) a histogram with geom_histogram() and 2) a density plot with geom_density().
Hint: a histogram plots counts on the y-axis so you only need to specify the x-axis variable in the ggplot() function.

Bonus: how would you make 1 plot with a histogram for each strain? Solutions Here is the basic histogram code - you may need to modify the bin size:

ggplot(data=experiment_info_cleaned, 
   mapping=aes(x=A260_280))+
  geom_histogram(bins=20) +
  ggtitle('Histogram')

Here is the basic density plot code:

ggplot(data=experiment_info_cleaned,
    mapping=aes(x=A260_280)) +
  geom_density() +
  ggtitle('Density plot')

Extras! Below are more ways to customize these plots.

You could add colors and plot by strain type like this:

ggplot(data=experiment_info_cleaned, 
   mapping=aes(x=A260_280, fill=Yeast_strain))+
  geom_histogram(bins=20, col='black') +
  ggtitle('Histogram by strain')

And we could facet the plot by strain like this:

ggplot(data=experiment_info_cleaned, 
   mapping=aes(x=A260_280,
   fill=Yeast_strain))+
  geom_histogram(bins=20,
  col='black') +
  facet_grid(Yeast_strain~.) + #this sets whether the facet is horizontal or vertical. here, we will get 2 rows of histograms by yeast strain
  ggtitle('Facet histogram')

One more edit! Let's change the colors on the histogram. There are many built-in color palettes! Check more out here:

ggplot(data=experiment_info_cleaned, 
   mapping=aes(x=A260_280, fill=Yeast_strain))+
  geom_histogram(bins=20, col='black') +
  facet_grid(Yeast_strain~.) +
  scale_fill_manual(values =c("cornflowerblue", 
  "goldenrod2")) + # Add blue and yellow colors that are more colorblind friendly for plotting
  ggtitle('Histogram w/ custom colors')

Now, let's switch this to a density plot and change the plot background

ggplot(data=experiment_info_cleaned, 
   mapping=aes(x=A260_280, fill=Yeast_strain))+
  geom_density(alpha=0.6) +
  facet_grid(Yeast_strain~.) +
  scale_fill_manual(values = c("cornflowerblue", "goldenrod2")) +
  theme_bw() +
  ggtitle('Density plot')

Hopefully this exercise demonstrates how flexible ggplot is!

Save plots

Use the ggsave() function to save images (variety of file options available), for example:

myplot <- ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280, fill=Yeast_strain))+
      geom_density(alpha=0.6) +
      facet_grid(Yeast_strain~.) +
      scale_fill_manual(values = c("cornflowerblue",
      "goldenrod2")) +
      theme_bw() +
      ggtitle('Density plot')
ggsave(filename='mydensityplot.jpg', plot=myplot,
height=4,
width=4, 
units='in',
dpi=600)

Want to save a figure with multiple plots? Check out cowplot!

# install cowplot package
install.packages('cowplot')

# load library
library(cowplot)

# assign plots to variables
hist_plt <- ggplot(data=experiment_info_cleaned, 
       mapping=aes(x=A260_280))+
      geom_histogram(bins=20) +
      ggtitle('Histogram')

density_plt <- ggplot(data=experiment_info_cleaned,
        mapping=aes(x=A260_280)) +
      geom_density() +
      ggtitle('Density plot')

# save plot as a jpeg file. note that the syntax is similar to ggsave() 
# but resolution is specified with res=, instead of dpi=
jpeg('histogram_and_density_plt.jpg',
height=5,
width=7, 
units='in',
res=500)

# arrange the plot objects in a grid, label plots, and specify plots in 1 row
plot_grid(hist_plt,
density_plt,
labels=c('A', 'B'),
nrow=1)

# shut down plotting device
dev.off()

There are many more options to edit plot aesthetics and there are lots of excellent tutorials and resources, including:

Rstudio cheatsheet for ggplot
R Cookbook graphs section
It's important to consider colorblind-friendly color palettes for plots too!
The ColorBrewer website is a fantastic tool for selecting nice color combinations
Viridis has one of the best color contrasts between light and dark colors; it works reasonably well for printing color plots in black and white
cowplot is a useful package for further customization of plot arrangements and publication-ready figures
ggplot tutorial from Cedric Scherer

Be sure to save any work/notes you took in the binder to your computer. Any new files/changes are not available when the binder session is closed!

For example, select a file, click "More", click "Export":

Wrap up

📝 Please fill out our post-workshop survey! We use the feedback to help improve our training materials and future workshops. Click here!

Questions?

Acknowledgements: This lesson was inspired by material from 2019 ANGUS Intro to R lesson

Resources

Schedule

clock time	time	section
10:00 - 10:15	15 mins	intro
10:15 - 11:00	45 mins	data types/wrangling
11:00 - 11:45	45 mins	plotting
11:45 - 12:00	15 mins	wrap up/Q&A

Appendix

More about data types

There are six data types in R:

Data Type	Description	Examples
Numeric	numerical values	1, 1.5, 20, pi
Character	letters or words, enclosed in quotes	“random”, “2”, “TRUE”, “@”, “#”
Integer	for integer numbers; the L indicates to R that it’s an integer)	2L, 500L, -17L
Logical	for boolean type data	TRUE, FALSE, T, F
Complex	numbers with real and imaginary parts	4+ 3i
Raw	contains raw bytes encoding

# what type of data structure is this ?
class(experiment_info)

Some useful commands

There are other useful commands to get attribute information about a dataset.

# how many number of rows ?
nrow(experiment_info)

# how many number of columns ?
ncol(experiment_info)

# what are the column names of your dataset ?
colnames(experiment_info)

# what are the rownames of your dataset ?
rownames(experiment_info)

# what does the last few lines of my dataset look like ?
tail(experiment_info)

# what is the structure of my dataset ?
str(experiment_info)

data.frame is tabular data similar to an excel spreadsheet which can contain mixed data types.

vectors are uni-dimensional data structures with uniform data types.

Let's create a vector from the table we have.

a260_vector <- experiment_info$A260

# confirm the data structure
class(a260_vector)

# get the data type of the vector
typeof(a260_vector)

We can access individual elements within the vector using it's position. Indices start at 1 and use the [] notation.

# obtain the fifth element in the vector
a260_vector[5]

# subset the fifth and tenth element
a260_vector[c(5,10)]

# subset the fifth through tenth elements
a260_vector[c(5:10)]

Another, two dimensional data structure is the matrix. The major difference between matrix and dataframe is that matrix can have only one type of data type i.e. either character or numeric but not both!

Let's create a matrix containing the Nucleic Acid Conc., 260/280 ratio and Total RNA values.

# Create a matrix by subsetting the data frame by selecting 
# the appropriate columns by column names
yeast_strain_matrix <- data.matrix(experiment_info[,
c("Nucleic Acid Conc.",
"A260_280", 
"Total_RNA")])
# View the data
head(yeast_strain_matrix )

Split-apply-combine

The approach of split-apply-combine allows you to split the data into groups, apply a function/analysis and then combine the results. We can do this using group_by() and summarize()

group_by() takes as argument the column name/s that contain categorical variables that can be used to group the data.

experiment_info_cleaned %>% 
  group_by(Yeast_strain) %>% 
  summarize(mean_conc. = mean(Nucleic_Acid_Conc.))
  
# summarize using more than one column
experiment_info_cleaned %>% 
  group_by(Yeast_strain) %>% 
  summarize(mean_conc. = mean(Nucleic_Acid_Conc.),
  mean_total_RNA = mean(Total_RNA))

Next, we can use arrange() to sort the data in either ascending (defalt order) or descending order.

# arrange new table in ascending mean concentrations
experiment_info_cleaned %>% 
  group_by(Yeast_strain) %>% 
  summarize(mean_conc. = mean(Nucleic_Acid_Conc.),
  mean_total_RNA = mean(Total_RNA)) %>% 
  arrange(mean_concentration) 
  
# arrange new table in descending mean concentrations 
experiment_info_cleaned %>% 
  group_by(Yeast_strain) %>% 
  summarize(mean_conc. = mean(Nucleic_Acid_Conc.),
  mean_total_RNA = mean(Total_RNA)) %>% 
  arrange(desc(mean_concentration))

We can also examine the number of categorical values using count()

experiment_info_cleaned %>% 
  count(Yeast_strain)

Bonus exercises

Exercise 6

Generate a vector containing the yeast strain.

Solution

yeast_strain_vector <- experiment_info$Yeast_strain

Alternatively, you could also use the index of the column:

# dataframe[row number, column number]
# subset based on column index
yeast_strain_vector <- experiment_info[,1]

Exercise 7

Create a new object called wt_high_conc that contains data from WT strains and contain nucleic acid concentration of more than 1500 ng/uL.

Solution

wt_high_conc<- filter(experiment_info_cleaned,
Yeast_strain == 'WT' &
Nucleic_Acid_Conc. > 1500)

# Check the data
head(wt_high_conc)

Exercise 8

Create a new table called library_start that includes the columns sample, yeast strain and two new columns called RNA_100 with the calculation of microliters to have 100ng of RNA and another column called water that says how many microliters of water we need to add to that to reach 50uL.

Solution

# create data object
library_start <- experiment_info_cleaned %>% 
  mutate(RNA_100 = 100/ Nucleic_Acid_Conc.,
          water = 50 - RNA_100) %>% 
  select(Sample, Yeast_strain, RNA_100, water) 

# check the data
head(library_start)

Exercise 9

Calculate the mean value for A260/A280 for each of the yeast strains. Can these sample be used for downstream analysis ?

Solution

# calculate mean values
experiment_info_cleaned %>% 
  group_by(Yeast_strain`) %>% 
  summarize(mean_ratio = mean(A260_280))

Home
Resources for Attendees
Resources for Instructors
Training Workshop Notes
- Amazon Web Services (AWS)
- Conda
- HuBMAP Tools
  - April 28, 2021
- R
- RNA-Seq Concepts, Design and Workflows
- RNA-Seq in the Cloud
  - June 21 & 23, 2021
- Snakemake Part I & II
  - May 12 & 14
- UNIX

Introduction to R: October 13, 2021

Introduction to R

Introduction

How to ask questions

Some background

Terminology

Our example dataset

Loading packages

The library() function:

Reading in the data

Now let's look at the data!

Data Wrangling with dplyr

Why might you need to wrangle data?

select()

Exercise 1

filter()

mutate()

Pipes

Exercise 2

Plotting with ggplot

Exercise 3

Exercise 4

More modifications and plot types!

Exercise 5

Save plots

Wrap up

Resources

Schedule

Appendix

More about data types

Some useful commands

Split-apply-combine

Bonus exercises

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Clone this wiki locally

The `library()` function:

Data Wrangling with `dplyr`

`select()`

`filter()`

`mutate()`

Plotting with `ggplot`