Kasper D. Hansen
This document has the following dependencies:
Use the following commands to install these packages in R.
source("http://www.bioconductor.org/biocLite.R") biocLite(c("BSgenome", "BSgenome.Scerevisiae.UCSC.sacCer2"))
r Biocpkg("BSgenome") package contains infrastructure for representing genome sequences in Bioconductor.
r Biocpkg("BSgenome") package provides support for genomes. In Bioconductor, we have special classes for genomes, because the chromosomes can get really big. For example, the human genome takes up several GB of memory.
available.genomes() function lists which genomes are currently available from from Bioconductor (it is possible to make your own genome package). Note that there are several so-called "masked" genomes, where some parts of the genome are masked. We will avoid this subject for now. We can
grep() for known organisms.
allgenomes <- available.genomes() grep("Hsapiens", allgenomes, value = TRUE) grep("Scerevisiae", allgenomes, value = TRUE)
Let us load the latest yeast genome
r Biocpkg("BSgenome") package contains a single object which is the second component of the name. At first, nothing is loaded into memory, which makes it very fast. You can get the length and names of the chromosomes without actually loading them.
We load a chromosome by using the
We can now do things like compute the GC content of the first chromosome
letterFrequency(Scerevisiae$chrI, "CG", as.prob = TRUE)
To iterate over chromosomes seems straightforward with
lapply. However, this function may end up using a lot of memory because the entire genome is loaded. Instead there is the
bsapply function which handles loading and unloading of different chromosomes. The interface to
bsapply is weird at first; you set up a
BSparams object which contains which function you are using and which genome you are using it on (and a bit more information). This paradigm is being used in other packages these days, for example
r Biocpkg("BiocParallel"). An example will make this clear:
param <- new("BSParams", X = Scerevisiae, FUN = letterFrequency) head(bsapply(param, letters = "GC"))
note how the additional argument
letters to the
letterFrequency function is given as an argument to
bsapply, not to the
BSParams object. This gives us a list; you can simplify the output (like the difference between
param <- new("BSParams", X = Scerevisiae, FUN = letterFrequency, simplify = TRUE) bsapply(param, letters = "GC")
Note how the mitochondria chromosome is very different. To conclude, the GC percentage of the genome is
sum(bsapply(param, letters = "GC")) / sum(seqlengths(Scerevisiae))