# DSI Summer Workshops Series

## June 21, 2018

Peggy Lindner<br>
Center for Advanced Computing & Data Science (CACDS)<br>
Data Science Institute (DSI)<br>
University of Houston  
plindner@uh.edu 


This tutorial is available at:
http://130.211.184.150/hub/login


## Computational Genomics with R

Basis understanding of Genomic Data Analysis using R

### Goals

* If you are not familiar with R, you will get the basics of R and divide right in to specialized uses of R for computational genomics.
* You will understand genomic intervals and operations on them, such as overlap
* You will be able to use R and its vast package library to do sequence analysis: Such as calculating GC content for given segments of a genome or find transcription factor binding sites
* You will be familiar with visualization techniques used in genomics, such as heatmaps,meta-gene plots and genomic track visualization
* You will be familiar with supervised and unsupervised learning techniques which are important in data modelling and exploratory analysis of high-dimensional dat


![](Images/DataAnalysis.png)

## Some R Basics 
### Packages and functions

In [None]:
libray(MASS)
ls("package:MASS") # functions in the package
ls() # objects in your R enviroment
# get help on hist() function
?hist
help("hist")
# search the word "hist" in help pages
help.search("hist")
??hist


### Basic Computations in R

In [None]:
2 + 3 * 5       # Note the order of operations.
log(10)        # Natural logarithm with base e
5^2            # 5 raised to the second power
3/2            # Division
sqrt(16)      # Square root
abs(3-7)      # Absolute value of 3-7
pi             # The number
exp(2)        # exponential function
# This is a comment line

### Data Structures
#### Vectors

In [None]:
x <- c(1, 3, 2, 10, 5)  #create a vector x with 5 components
x
## [1]  1  3  2 10  5
y <- 1:5  #create a vector of consecutive integers y
y + 2  #scalar addition
## [1] 3 4 5 6 7
2 * y  #scalar multiplication
## [1]  2  4  6  8 10
y^2  #raise each component to the second power
## [1]  1  4  9 16 25
2^y  #raise 2 to the first through fifth power
## [1]  2  4  8 16 32
y  #y itself has not been unchanged
## [1] 1 2 3 4 5
y <- y * 2
y  #it is now changed
## [1]  2  4  6  8 10
r1 <- rep(1, 3)  # create a vector of 1s, length 3
length(r1)  #length of the vector
## [1] 3
class(r1)  # class of the vector
## [1] "numeric"
a <- 1  # this is actually a vector length one

#### Matrix

In [None]:
x <- c(1, 2, 3, 4)
y <- c(4, 5, 6, 7)
m1 <- cbind(x, y)
m1
##      x y
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [4,] 4 7
t(m1)  # transpose of m1
##   [,1] [,2] [,3] [,4]
## x    1    2    3    4
## y    4    5    6    7
dim(m1)  # 2 by 5 matrix
## [1] 4 2

#### Data Frames

In [None]:
chr <- c("chr1", "chr1", "chr2", "chr2")
strand <- c("-","-","+","+")
start<- c(200,4000,100,400)
end<-c(250,410,200,450)
mydata <- data.frame(chr,start,end,strand)
#change column names
names(mydata) <- c("chr","start","end","strand")
mydata # OR this will work too
mydata <- data.frame(chr=chr,start=start,end=end,strand=strand)
mydata

#### Slicing and Dicing

![](Images/slicingDataFrames.png)

In [None]:
mydata[,2:4] # columns 2,3,4 of data frame
mydata[,c("chr","start")] # columns chr and start from data frame
mydata$start # variable start in the data frame
mydata[c(1,3),] # get 1st and 3rd rows
mydata[mydata$start>400,] # get all rows where start>400


#### List

In [None]:
# example of a list with 4 components
# a string, a numeric vector, a matrix, and a scalar
w <- list(name="Fred",
       mynumbers=c(1,2,3),
       mymatrix=matrix(1:4,ncol=2),
       age=5.3)
w

In [None]:
w[[3]] # 3rd component of the list
w[["mynumbers"]] # component named mynumbers in list
w$age

#### Factors

In [None]:
features=c("promoter","exon","intron")
f.feat=factor(features)

### Data types
 * numeric
 * logical
 * character
 * integer

In [None]:
#create a numeric vector x with 5 components
x<-c(1,3,2,10,5)
x
#create a logical vector x
x<-c(TRUE,FALSE,TRUE)
x
# create a character vector
x<-c("sds","sd","as")
x
class(x)
# create an integer vector
x<-c(1L,2L,3L)
x
class(x)

### Reading and Writing Data

Most of the genomics data sets are in the form of genomic intervals associated with a score. That means mostly the data will be in table format with columns denoting chromosome, start positions, end positions, strand and score. One of the popular formats is BED format used primarily by UCSC genome browser but most other genome browsers and tools will support BED format. We have all the annotation data in BED format. In R, you can easily read tabular format data with read.table() function.

In [None]:
enh.df <- read.table("data/subset.enhancers.hg18.bed", header = FALSE)  # read enhancer marker BED file
cpgi.df <- read.table("data/subset.cpgi.hg18.bed", header = FALSE) # read CpG island BED file
# check first lines to see how the data looks like
head(enh.df)
head(cpgi.df)

In [None]:
write.table(cpgi.df,file="cpgi.txt",quote=FALSE,
            row.names=FALSE,col.names=FALSE,sep="\t")

In [None]:
save(cpgi.df,enh.df,file="mydata.RData")
load("mydata.RData")
# saveRDS() can save one object at a type
saveRDS(cpgi.df,file="cpgi.rds")
x=readRDS("cpgi.rds")
head(x)

One important thing is that with save() you can save many objects at a time and when they are loaded into memory with load() they retain their variable names. For example, in the above code when you use load("mydata.RData") in a fresh R session, an object names “cpg.df” will be created. That means you have to figure out what name you gave it to the objects before saving them. On the contrary to that, when you save an object by saveRDS() and read by readRDS() the name of the object is not retained, you need to assign the output of readRDS() to a new variable (“x” in the above code chunk).

### Plotting in R
Let us sample 50 values from normal distribution and do some plots.

In [None]:
# setting figure size in notebook
options(repr.plot.width = 4, repr.plot.height = 4)
# sample 50 values from normal distribution
# and store them in vector x
x<-rnorm(50)
hist(x) # plot the histogram of those values

In [None]:
#let's add a title and change the color
hist(x,main="Hello histogram!!!",col="red")

#### Scatterplot

In [None]:
# randomly sample 50 points from normal distribution
y<-rnorm(50)
#plot a scatter plot
# control x-axis and y-axis labels
plot(x,y,main="scatterplot of random samples",
        ylab="y values",xlab="x values")

#### Boxplot

lowerWhisker=Q1-1.5[IQR] and upperWhisker=Q1+1.5*[IQR]

In addition, outliers can be depicted as dots. In this case, outliers are the values that remain outside the whiskers.

In [None]:
 boxplot(x,y,main="boxplots of random samples")

#### Barplot

In [None]:
perc=c(50,70,35,25)
barplot(height=perc,names.arg=c("CpGi","exon","CpGi","exon"),
        ylab="percentages",main="imagine %s",
        col=c("red","red","blue","blue"))
legend("topright",legend=c("test","control"),fill=c("red","blue"))

 ## Saving plots
 If you want to save your plots to an image file there are couple of ways of doing that. Normally, you will have to do the following:
 1. Open a graphics device
 2. Create the plot
 3. Close the graphics device

In [None]:
pdf("myplot.pdf",width=5,height=5)
plot(x,y)
dev.off()

 #Alternatively, you can first create the plot then copy the plot to a graphic device.

plot(x,y)
dev.copy(pdf,"myplot.pdf",width=7,height=5)
dev.off()

## Operations on Genomic Intervals
### GenomicRanges package


[Bioconductor](http://bioconductor.org) project has a dedicated package called **GenomicRanges** to deal with genomic intervals. In this section, we will provide use cases involving operations on genomic intervals. The main reason we will stick to this package is that it provides tools to do overlap operations. However package requires that users operate on specific data types that are conceptually similar to a tabular data structure implemented in a way that makes overlapping and related operations easier. The main object we will be using is called GRanges object and we will also see some other related objects from the GenomicRanges package.

#### How to create and manipulate a GRanges object

In [None]:
library(GenomicRanges)
gr=GRanges(seqnames=c("chr1","chr2","chr2"),
           ranges=IRanges(start=c(50,150,200),end=c(100,200,300)),
           strand=c("+","-","-")
)
gr

In [None]:
# subset like a data frame
gr[1:2,]

In [None]:
gr=GRanges(seqnames=c("chr1","chr2","chr2"),
           ranges=IRanges(start=c(50,150,200),end=c(100,200,300)),
           names=c("id1","id3","id2"),
           scores=c(100,90,50)
)
# or add it later (replaces the existing meta data)
mcols(gr)=DataFrame(name2=c("pax6","meis1","zic4"),
                    score2=c(1,2,3))

gr=GRanges(seqnames=c("chr1","chr2","chr2"),
           ranges=IRanges(start=c(50,150,200),end=c(100,200,300)),
           names=c("id1","id3","id2"),
           scores=c(100,90,50)
)

# or appends to existing meta data
mcols(gr)=cbind(mcols(gr),
                          DataFrame(name2=c("pax6","meis1","zic4")) )
gr

In [None]:
# elementMetadata() and values() do the same things
elementMetadata(gr)

In [None]:
values(gr)

#### Getting genomic regions into R as GRanges objects

There are multiple ways you can read in your genomic features into R and create a GRanges object. Most genomic interval data comes as a tabular format that has the basic information about the location of the interval and some other information. We already showed how to read BED files as data frame. Now we will show how to convert it to GRanges object.

In [None]:
# read CpGi data set
cpgi.df = read.table("data/cpgi.hg19.chr21.bed", header = FALSE,
                     stringsAsFactors=FALSE) 
# remove chr names with "_"
cpgi.df =cpgi.df [grep("_",cpgi.df[,1],invert=TRUE),]

cpgi.gr=GRanges(seqnames=cpgi.df[,1],
                ranges=IRanges(start=cpgi.df[,2],
                              end=cpgi.df[,3]))

cpgi.gr

Sometimes pre-processing is necessary

In [None]:
# read refseq file
ref.df = read.table("data/refseq.hg19.chr21.bed", header = FALSE,
                     stringsAsFactors=FALSE) 
ref.gr=GRanges(seqnames=ref.df[,1],
               ranges=IRanges(start=ref.df[,2],
                              end=ref.df[,3]),
               strand=ref.df[,6],name=ref.df[,4])
# get TSS
tss.gr=ref.gr
# end of the + strand genes must be equalized to start pos
end(tss.gr[strand(tss.gr)=="+",])  =start(tss.gr[strand(tss.gr)=="+",])
# startof the - strand genes must be equalized to end pos
start(tss.gr[strand(tss.gr)=="-",])=end(tss.gr[strand(tss.gr)=="-",])
# remove duplicated TSSes ie alternative transcripts
# this keeps the first instance and removes duplicates
tss.gr=tss.gr[!duplicated(tss.gr),]

Reading the genomic features as text files and converting to GRanges is not the only way to create GRanges object. With the help of rtracklayer package we can directly import.

In [None]:
library(rtracklayer)
import.bed("data/refseq.hg19.chr21.bed")

Now we will show how to use other packages to automatically obtain the data in GRanges format. But you will not be able to use these methods for every data set so it is good to now how to read data from flat files as well. First, we will use rtracklayer package to download data from UCSC browser. We will download CpG islands as GRanges objects.

In [None]:
library(rtracklayer)
session <- browserSession()
genome(session) <- "mm9"
## choose CpG island track on chr12
query <- ucscTableQuery(session, track="CpG Islands",table="cpgIslandExt",
        range=GRangesForUCSCGenome("mm9", "chr12"))
## get the GRanges object for the track
track(query)

In [None]:
Next, we will show how to use GenomicFeatures package.
<<genomicFeaturesImport>>=
# using GenomicFeatures to import genomic features

@
## Finding regions that (does/does not) overlap with another set of regions

This is one of the most common tasks in genomics. Usually, you have a set of regions that you are interested in and you want to see if they overlap with another set of regions or see how many of them overlap. A good example is transcription factor binding sites determined by ChIP-seq experiments. In these types of experiments and followed analysis, one usually ends up with genomic regions that are enriched in bound transcription factors (sites of transcription factor binding) and the

subsetByOverlaps

<<subsetByOverlaps>>=

countOverlaps

<<countOverlaps>>=

findOverlaps and its uses

<<findOverlaps>>=

find nearest

<<findNearest>>=

## Finding coverage of intervals over the genome
find cannonical binding sites

<<findCanonical>>=

find coverage of biding sites on promoters

<<cannonicalCoverage>>=

sdsd

First plot ...

You might notice that could have used simplify to combine multiple edges by summing their weights with a command like  simplify(net, edge.attr.comb=list(Weight="sum","ignore")). The problem is that this would also combine multiple edge types (in our data: “hyperlinks” and “mentions”).

Let’s and reduce the arrow size and remove the labels (we do that by setting them to NA):

Tutorial based on input from:

https://al2na.github.io/compgenr/