# Annotation Resources

***Bioconductor annotation resources*** have traditionally been used near the end of an analysis. After the bulk of the data analysis, annotations would be used interpretatively to learn about the most significant results. 

But increasingly, they are also used as a starting point or even as an intermediate step to help guide a study that is still in progress. In addition to this, what it means for something to be an annotation is also becoming less clear than it once was. It used to be clear that annotations were only those things that had been established after multiple different studies had been performed (such as the primary role of a gene product). 

But today many large data sets are treated by communities in much the same way that classic annotations once were: as a reference for additional comparisons.

Another change that is underway with annotations in Bioconductor is in the way that they are obtained. In the past annotations existed almost exclusively as separate annotation packages[2,3,4]. Today packages are still an enormous source of annotations. The current release repository contains over eight hundred annotation packages. This table summarizes some of the more important classes of annotation objects that are often accessed using packages:

  Object Type  | Example Package Name | Contents 
  ------------- | ------------- | -------------
  TxDb  | TxDb.Hsapiens.UCSC.hg19.knownGene |  Transcriptome ranges for the known gene track of Homo sapiens, e.g., introns, exons, UTR regions.
  OrgDb | org.Hs.eg.db | Gene-based information for Homo sapiens; useful for mapping between gene IDs, Names, Symbols, GO and KEGG identifiers, etc.
  BSgenome | BSgenome.Hsapiens.UCSC.hg19 | Full genome sequence for Homo sapiens.
  OrganismDb | Homo,sapiens | Collection of multiple annotations for a common organism and genome build. 
  AnnotationHub | AnnotationHub | Provides a convenient interface to annotations from many different sources; objects are returned as fully parsed Bioconductor data objects or as the name of a file on disk.
  
 
But in spite of the popularity of annotation packages, annotations are increasingly also being pulled down from web services like ***biomaRt***[5,6,7] or from the ***AnnotationHub***[8]. And both of these represent enormous resources for annotation data.

## Set Up

In [16]:
## try http:// if https:// URLs are not supported
source("http://bioconductor.org/biocLite.R")
# anno_lib <- c("AnnotationHub", "Homo.sapiens","TxDb.Hsapiens.UCSC.hg19.knownGene", "BSgenome.Hsapiens.UCSC.hg19", "biomaRt","TxDb.Athaliana.BioMart.plantsmart22") 

# biocLite(anno_lib) 

Bioconductor version 3.1 (BiocInstaller 1.18.5), ?biocLite for help
A newer version of Bioconductor is available for this version of R,
  ?BiocUpgrade for help
BioC_mirror: http://bioconductor.org
Using Bioconductor version 3.1 (BiocInstaller 1.18.5), R version 3.2.2.
Installing package(s) ‘AnnotationHub’, ‘Homo.sapiens’,
  ‘TxDb.Hsapiens.UCSC.hg19.knownGene’, ‘BSgenome.Hsapiens.UCSC.hg19’,
  ‘biomaRt’, ‘TxDb.Athaliana.BioMart.plantsmart22’



The downloaded source packages are in
	‘/tmp/RtmpcUOOTq/downloaded_packages’


Old packages: 'BH', 'boot', 'car', 'caret', 'digest', 'evaluate', 'formatR',
  'ggplot2', 'glmnet', 'gtable', 'htmltools', 'htmlwidgets', 'jsonlite',
  'knitr', 'lme4', 'maps', 'Matrix', 'mgcv', 'munsell', 'nlme', 'nnet',
  'quantreg', 'R6', 'rbokeh', 'Rcpp', 'RcppEigen', 'rmarkdown', 'scales',
  'shiny', 'tidyr', 'TTR', 'xtable'


## Using AnnotationHub

The top of the list for learning about annotation resources is the relatively new ***AnnotationHub*** package[8]. 

The AnnotationHub was created to provide a convenient access point for end users to find a large range of different annotation objects for use with Bioconductor. Resources found in the AnnotationHub are easy to discover and are presented to the user as familiar Bioconductor data objects. Because it is a recent addition, the AnnotationHub allows access to a broad range of annotation like objects, some of which may not have been considered annotations even a few years ago. To get started with the AnnotationHub users only need to load the package and then create a local AnnotationHub object like this:

In [17]:
library("AnnotationHub") 
ah <- AnnotationHub() 

: 'AnnotationHub' database may not be current
  database: ‘/home/madbunny/.AnnotationHub/annotationhub.sqlite3’
  reason: Problem with the SSL CA cert (path? access rights?)

In [18]:
ah

AnnotationHub with 35307 records
# snapshotDate(): 2016-03-07 
# $dataprovider: BroadInstitute, UCSC, Ensembl, NCBI, Haemcode, Inparanoid8,...
# $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Danio r...
# $rdataclass: GRanges, FaFile, BigWigFile, OrgDb, ChainFile, Inparanoid8Db,...
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH2"]]' 

            title                                               
  AH2     | Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa   
  AH3     | Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa
  AH4     | Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa
  AH5     | Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa          
  AH6     | Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa        
  ...       ...                                                 
  AH49436 | Xiphophorus_maculatus.Xipmac4.4.2.dna_rm.toplevel.fa
  AH49437 | Xiphophorus_maculatus.Xipm

In [19]:
unique(ah$dataprovider) 
unique(ah$rdataclass)

Once you have identified which sorts of metadata you would like to use to find your data of interest, you can then use the subset or query methods to reduce the size of the hub object to something more manageable. For example you could select only those records where the string ‘GRanges’ was in the metadata. As you can see GRanges are one of the more popular formats for data that comes from the AnnotationHub.

In [20]:
grs <- query(ah, "GRanges") 
grs 
grs <- ah[ah$rdataclass == "GRanges",] 
grs

AnnotationHub with 17365 records
# snapshotDate(): 2016-03-07 
# $dataprovider: BroadInstitute, UCSC, Ensembl, Haemcode, Pazar, EncodeDCC
# $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Canis f...
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH3166"]]' 

            title                                   
  AH3166  | wgEncodeRikenCageSknshraCellPapTssHmm   
  AH3912  | wgEncodeUwDgfTregwb78495824Hotspots     
  AH3913  | wgEncodeUwDgfTregwb78495824Pk           
  AH4368  | wgEncodeUwDnaseWi38PkRep1               
  AH4369  | wgEncodeUwDnaseWi38PkRep2               
  ...       ...                                     
  AH48001 | Tupaia_belangeri.TREESHREW.81.gtf       
  AH48002 | Tursiops_truncatus.turTru1.81.gtf       
  AH48003 | Vicugna_pacos.vicPac1.81.gtf            
  AH48004 | Xenopus_tropicalis.JGI_4.2.81.gtf       
  AH48005 | Xiphophorus_maculatus.

AnnotationHub with 17365 records
# snapshotDate(): 2016-03-07 
# $dataprovider: BroadInstitute, UCSC, Ensembl, Haemcode, Pazar, EncodeDCC
# $species: Homo sapiens, Mus musculus, Bos taurus, Pan troglodytes, Canis f...
# $rdataclass: GRanges
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH3166"]]' 

            title                                   
  AH3166  | wgEncodeRikenCageSknshraCellPapTssHmm   
  AH3912  | wgEncodeUwDgfTregwb78495824Hotspots     
  AH3913  | wgEncodeUwDgfTregwb78495824Pk           
  AH4368  | wgEncodeUwDnaseWi38PkRep1               
  AH4369  | wgEncodeUwDnaseWi38PkRep2               
  ...       ...                                     
  AH48001 | Tupaia_belangeri.TREESHREW.81.gtf       
  AH48002 | Tursiops_truncatus.turTru1.81.gtf       
  AH48003 | Vicugna_pacos.vicPac1.81.gtf            
  AH48004 | Xenopus_tropicalis.JGI_4.2.81.gtf       
  AH48005 | Xiphophorus_maculatus.

The subset function is also provided.

In [21]:
orgs <- subset(ah, ah$rdataclass == "OrgDb") 
orgs

AnnotationHub with 1145 records
# snapshotDate(): 2016-03-07 
# $dataprovider: NCBI
# $species: 'Nostoc azollae'_0708, Acaryochloris marina_MBIC11017, Acetobact...
# $rdataclass: OrgDb
# additional mcols(): taxonomyid, genome, description, tags, sourceurl,
#   sourcetype 
# retrieve records with, e.g., 'object[["AH12818"]]' 

            title                                                    
  AH12818 | org.Pseudomonas_mendocina_NK-01.eg.sqlite                
  AH12819 | org.Streptomyces_coelicolor_A3(2).eg.sqlite              
  AH12820 | org.Cricetulus_griseus.eg.sqlite                         
  AH12821 | org.Streptomyces_cattleya_NRRL_8057_=_DSM_46488.eg.sqlite
  AH12822 | org.Cavia_porcellus.eg.sqlite                            
  ...       ...                                                      
  AH13958 | org.Ochotona_princeps.eg.sqlite                          
  AH13959 | org.Aeromonas_veronii_B565.eg.sqlite                     
  AH13960 | org.Oryctolagus_cuniculus.eg.s

In [22]:
meta <- mcols(ah) 
meta

DataFrame with 35307 rows and 10 columns
                                                       title dataprovider
                                                 <character>  <character>
AH2        Ailuropoda_melanoleuca.ailMel1.69.dna.toplevel.fa      Ensembl
AH3     Ailuropoda_melanoleuca.ailMel1.69.dna_rm.toplevel.fa      Ensembl
AH4     Ailuropoda_melanoleuca.ailMel1.69.dna_sm.toplevel.fa      Ensembl
AH5               Ailuropoda_melanoleuca.ailMel1.69.ncrna.fa      Ensembl
AH6             Ailuropoda_melanoleuca.ailMel1.69.pep.all.fa      Ensembl
...                                                      ...          ...
AH49436 Xiphophorus_maculatus.Xipmac4.4.2.dna_rm.toplevel.fa      Ensembl
AH49437 Xiphophorus_maculatus.Xipmac4.4.2.dna_sm.toplevel.fa      Ensembl
AH49438    Xiphophorus_maculatus.Xipmac4.4.2.dna.toplevel.fa      Ensembl
AH49439           Xiphophorus_maculatus.Xipmac4.4.2.ncrna.fa      Ensembl
AH49440         Xiphophorus_maculatus.Xipmac4.4.2.pep.all.fa      Ensem

Also if you are a fan of GUI’s ***you can use the display method to look at your data in a browser*** and return selected rows back as a smaller AnnotationHub object like this:

In [9]:
# sah <- display(ah) 

![result](http://bioconductor.org/help/workflows/annotation/Annotation_Resources/display.png)

Once you have the AnnotationHub object pared down to a reasonable size, and are sure about which records you want to retrieve, then you only need to use the ‘[[’ operator to extract them. Using the ‘[[’ operator, you can extract by numeric index (1,2,3) or by AnnotationHub ID. If you choose to use the former, you simply extract the element that you are interested in. So for our chain example, you might just want to 1st one like this:

In [23]:
library(GenomicRanges)


res <- grs[[1]] 
## require("GenomicRanges")

head(res, n=3) 

retrieving 1 resources
: download failed
  hub path: ‘https://annotationhub.bioconductor.org/fetch/3166’
  cache path: ‘/home/madbunny/.AnnotationHub/3166’
  reason: Problem with the SSL CA cert (path? access rights?)

ERROR: Error in value[[3L]](cond): failed to load hub resource ‘wgEncodeRikenCageSknshraCellPapTssHmm’ of
    class GRanges; reason: 1 resources failed to download


ERROR: Error in head(res, n = 3): error in evaluating the argument 'x' in selecting a method for function 'head': Error: object 'res' not found



Or you might have decided that you want to see the data for the green spotted pufferfish by that you spotted in the orgs subset under the name ‘AH13961’. That data could also be extracted like this:

In [29]:
httr::handle_reset(paste0(hubUrl(), "/")); file.remove(dbfile(orgs))
rabbit <- query(orgs, "Oryctolagus")[[1]] 

retrieving 1 resources
: download failed
  hub path: ‘https://annotationhub.bioconductor.org/fetch/13960’
  cache path: ‘/home/madbunny/.AnnotationHub/13960’
  reason: Problem with the SSL CA cert (path? access rights?)

ERROR: Error in value[[3L]](cond): failed to load hub resource ‘org.Oryctolagus_cuniculus.eg.sqlite’ of
    class SQLiteFile; reason: 1 resources failed to download


## OrgDb objects

At this point you might be wondering: What is this OrgDb object about? ***OrgDb objects are one member of a family of annotation objects that all represent hidden data through a shared set of methods.*** So if you look closely at the rabbit object in the example above you can see that it contains data for the European rabbit Oryctolagus cuniculus (taxonomy ID = 9986). You can learn a little more about it by learning about the columns method.

In [25]:
columns(rabbit) 

ERROR: Error in eval(expr, envir, enclos): could not find function "columns"


The ***columns method gives you a vector of data types that can be retrieved from the object that you call it on.*** So the above call indicates that there are several different data types that can be retrieved from the tetra object.


A very similar method is the ***keytypes method, which will list all the data types that can also be used as keys.***

In [None]:
keytypes(rabbit) 

In many cases most of the things that are listed as columns will also come back from a ***keytypes call***, but since these two things are not guaranteed to be identical, we maintain two separate methods.


Now that you can see what kinds of things can be used as ***keys, you can call the keys method to extract out all the keys of a given key type.***

In [30]:
head(keys(rabbit, keytype="ENTREZID")) 

ERROR: Error in head(keys(rabbit, keytype = "ENTREZID")): error in evaluating the argument 'x' in selecting a method for function 'head': Error: could not find function "keys"



This is useful if you need to get all the IDs of a particular kind but the keys method has a few extra arguments that can make it even more flexible. 

For example, using the ***keys method*** you could also extract the gene SYMBOLS that contain “COX” like this:

Or if you really needed an other keytype, you can use the column argument to extract the ENTREZ GENE IDs for those gene SYMBOLS that contain the string “COX”:

In [None]:
keys(rabbit, keytype="SYMBOL", pattern="COX")
keys(rabbit, keytype="ENTREZID", pattern="COX", column="SYMBOL")

But often, you will really want to extract other data that matches a particular key or set of keys. 

For that there are two methods which you can use. ***The more powerful of these is probably select.***
Here is how you would look up the gene SYMBOL, and REFSEQ id for the entrez gene IDs “808231” and “808233”.

In [None]:
select(rabbit, keys=c("808231","808233"), columns=c("SYMBOL","REFSEQ"), keytype="ENTREZID")