Ruby UCSC API: An API for the UCSC Genome Database
Pull request Compare This branch is 231 commits behind misshie:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
devtools
lib
samples
spec
.document
.gitignore
.rspec
COPYING
COPYING.ja
Gemfile
Gemfile.lock
README.rdoc
Rakefile
VERSION
bio-ucsc-api.gemspec

README.rdoc

bio-ucsc-api version 0.3.2

The Ruby UCSC API: accessing the UCSC Genome Database using Ruby.

Your comments, suggestions and requests are welcome. Documentation and feedback are available at the UserEcho site at rubyucscapi.userecho.com/.

Install

$ gem install bio-ucsc-api --no-ri --no-rdoc

You may need to be root or use “sudo”. “–no-ri” and “–no-rdoc” options are recommended because generation of ri/rdoc files takes considerable time.

Features

  • Supporting all organisms in the UCSC genome database.

  • Using ActiveRecord as an O/R mapping framework. Basically, each tables can access using ActiveRecord method convention.

  • Using the Bin index system to improve query performance. This is one of the reason why you use Ruby UCSC API instead of submitting SQL queries directly.

  • Supporting genomic sequence query using locally downloaded “2bit” files. Genomic sequences are not stored in UCSC's official MySQL database.

  • Automatic conversion of “1-based full-closed intervals” to internal “0-based left-closed right-open intervals” (see also bioruby-genomic-interval)

  • Supporting non-official full/partial mirror MySql hosts (e.g. local servers)

  • Using Rspec for the testing framework

  • Written in pure Ruby and supporting multiple Ruby interpreter implementations including Ruby1.8, Ruby1.9, and JRuby1.6

  • Designed as a BioRuby plugin

  • Current version does not support table-linked bigWIG/bigBED/BAM files.

Supported databases (genome assemblies)

human

Hg19, Hg18

mammals

chimp (PanTro3), orangutan (PonAbe2), rhesus (RheMac2), marmoset (CalJac3), mouse (Mm9), rat (Rn4), guinea pig (CavPor3), rabbit (OryCun2), cat (FelCat4), panda (AilMel1), dog (CanFam2), horse (EquCab2), pig (SusScr2), sheep (OviAri1), cow (BosTau4), elephant (LoxAfr3), opossum (MonDom5), platypus (OrnAna1)

vertebrates

chicken (GalGal3), zebra finch (TaeGut1), lizard (AnoCar2), X. tropicalis (XenTro2), zebrafish (DanRer7), tetraodon (TetNig2), fugu (Fr2), stickleback (GasAcu1), medaka (OryLat2), lamprey (PetMar1)

deuterostomes

lancelet (BraFlo1), sea squirt (Ci2), sea urchin (StrPur2)

insects

D.melanogaster (Dm3), D.simulans (DroSim1), D.sechellia (DroSec1), D.yakuba (DroYak2), D.erecta (DroEre1), D.ananassae (DroAna2), D.pseudoobscura (Dp3), D.persimilis (DroPer1), D.virilis (DroVir2), D.mojavensis (DroMoj2), D.grimshawi (DroGri1), Anopheles mosquito (AnoGam1), honey bee (ApiMel2)

nematodes

C.elegans (Ce6), C.brenneri (CaePb3), C.briggsae (Cb3), C.remanei (CaeRem3), C.japonica (CaeJap1), P.pacificus (PriPac1)

others

sea hare (AplCal1), yeast (SacCer2)

genome assembly independent

Go, HgFixed, Proteome, UniProt, VisiGene

Implementation

This package is based on the followings:

Supported Ruby interpreter implementations:

  • Ruby version 1.9.2 or later

  • Ruby version 1.8.7 or later

  • JRuby version 1.6.3 or later - Appropiate Java heap size may have to be specified to invoke JRuby, especially when you use Bio::Ucsc::Reference. Try “jruby -J-Xmx3g your_script.rb” to keep 3G byte heap.

Major dependent gems:

See also:

  • Strozzi F, Aerts J: A Ruby API to query the Ensembl database for genomic features. Bioinformatics 2011, 27:1013-1014.

  • UCSCBin library - github.com/misshie/UCSCBin

Change Log

  • UPDATE (v.0.3.2): Genomic interval queries are implemented using ARel's relation objects instead of (named) scopes. Usage of the API is not changed.

  • BUG (v.0.3.1): Does not work with ActiveRecord version 3.1.0. Data retrieval methods occur the error, “(Object doesn't support #inspect)”. The author is working on this bug. So far, please use version 3.0 seriese. Gemfile for gem dependencies is updated. Thanks for bug reports from Diego F. Pereira.

  • BUG-FIX (v.0.3.1): “func” fields in tables did not work. The bug was fixed.

  • BUG-FIX (v.0.3.1): PredGene-type tables without the bin index did not work. The bug was fixed.

  • NEW (v.0.3.0): Now genomic interval queries are expressed using the named scope “with_interval”. Table#find_(all_)by_interval is now deprecated. Sorry for an inconstant API. However, this change enable combination queries using genomic intervals and any fields.

  • NEW (v.0.3.0): Bio::GenomicInterval#bin_all and Bio::GenomicInterval#bin return the bin index for the given interval.

  • NEW (v.0.3.0): Supporting JRuby 1.6.3 or later. Appropiate Java heap size may have to be specified to invoke JRuby, especially when you use Bio::Ucsc::Reference. Try “jruby -J-Xmx3g your_script.rb” to keep 3G byte heap.

  • NEW (v.0.2.1): New genome assemblies are supported: [chimp] PanTro3, [orangutan] PonAbe2, [rhesus] RheMac2, [marmoset] CalJac3, [rat] Rn4, [guinea pig] CavPor3, [rabbit] OryCun2, [cat] FelCat4, [panda] AilMel1, [Dog] CanFam2, [horse] EquCab2, [pig] SusScr2, [sheep] OviAri1, [cow] BosTau4, [elephant] LoxAfr3, [opossum] MonDom5, [platypus] OrnAna1, [chicken] GalGal3, [zebra finch] TaeGut1, [lizard] AnoCar2, [X. tropicalis] XenTro2, [zebrafish] DanRer7, [tetraodon] TetNig2, [fugu] Fr2, [stickleback] GasAcu1, [medaka] OryLat2, [lamprey] PerMar1, [lancelet] BraFlo1, [sea squirt] Ci2, [sea urchin] StrPur2, [D.simulans] DroSim1, [D.sechellia] DroSec1, [D.yakuba] DroYak2, [D.electa] DroEre1, [D.ananassae] DroAna2, [D.pseudoobscura] Dp3, [D.persimilis] DroPer1, [D. virilis] DroVir2, [D.mojavensis] DroMoj2, [D.grimshawi] DroGri1, [Anopheles mosquito] AnoGam1, [honey bee] ApiMel2, [C.brenneri] CaePb3, [C.briggsae] Cb3, [C.remanei] CaeRem3, [P.pacificus] PriPac1, [sea hare] AplCal1, [yeast] SacCer2

  • NEW (v.0.2.1): Supporting Ruby 1.8.7 or later

  • NEW Adding to human Hg19 and Hg18, the following genome assemblies are supported: [mouse] Mm9, [fruitfly] Dm3, [C. elegans] Ce6, [genome assembly independent] Go, HgFixed, Proteome, UniProt, VisiGene

  • UPDATE (v0.2.0): Internal table class mapping algorithm are changed. Now table types are automatically detected and dynamically defined as classes. Previous versions used static class definition for all tables.

  • MODIFIED (v0.2.0): Bio::Ucsc::::ReferenceSequence are removed. Use Bio::Ucsc::Reference instead. This class is more object-oriented.

  • MODIFIED (v0.1.0): The name of this library is now “Ruby UCSC API”. The RubyGem name and the GitHub account and the library name are not changed.

  • MODIFIED (v0.1.0): Bio::Ucsc::::Reference is replaced by Bio::Ucsc::::ReferenceSequence.

  • UPDATE (v0.0.5): Almost all hg18 tables are supported.

  • UPDATE (v0.0.5): find_by_interval and find_all_by_interval class methods accept the “partial” option. Default is true. When “partial: false” is opted, return value will be only fully-included (non-partially-included) records.

  • UPDATE (v0.0.4): Almost all hg19 tables are supported. “filename” tables in ENCODE dataset are omitted. Each of them contains only single record of a path to the raw data file. Definitions of table relations are incomplete.

  • NEW (v0.0.3): Supporting locally-stored '2bit' files, which can be downloaded from the UCSC site, to retrieve referential sequence. Now supporting unknown “N” nucleotide blocks, however, “mask-blocks”, which are shown in lower-case in UCSC's DNA function, are not supported yet.

  • MODIFIED (v0.0.3): For the “TABLE” class and the “column” column, TABLE.find_by_column retrieves a first record, and TABLE.find_all_by_column retrieves all the records as an Array.

  • NEW (v0.0.3-0.0.4): Supporting tables divided into each chromosome, such as “*_RmsK” and “*_gold”. Actual names of them are like “chr1_Rmsk”, “chr2_Rmsk”… They can be accessed without chromosome names; but with just like “Rmsk” and “Gold”.

How to Use

Basics

  • A database of a genome assembly is represented as a module in the Bio::Ucsc module. For example, human hg19 database is referred by “Bio::Ucsc::Hg19”.

  • Before using a database, establish a connection to the database. For example, “Bio::Ucsc::Hg19::DBConnection.connect”.

  • A table in a database is represented as a class in the database module. For example, the snp132 table in the hg19 database is referred by “Bio::Ucsc::Hg19::Snp132”.

  • Queries to a field (column) in a table are represented by class methods of the table class. For example, finding the first record (row) of the snp132 table in the hg19 database is “Bio::Ucsc::Hg19::Snp132.first”.

  • Queries using genomic intervals are supported by the named scope “.with_intervals” and “.with_intervals_excl (omitting pertially included annotations)” method of the table class. The method accepts a Bio::GenomicInterval object containing a genomic interval such as “chr1:1233-5678”. If a table to query has the “bin” column, the bin index system is automatically used to speed-up the query.

  • Fields in a retrieved record can be acccessed by using instance methods of a record object. For example, the name field of a table record stored in the “result” variable is “result.name”.

Sample Codes

At first, you have to declare the API and establish the connection to a database.

require 'bio-ucsc'

include Bio # To short-cut the class path
Ucsc::Hg19::DBConnection.connect

In the first reference of a table class, the followings does not work:

include Bio::Ucsc::Hg19
Snp131.first # The Ruby interpreter searchs Snp131 at the top-level

But the following line works because the API will fail to prefetch the table and define the appropriate class dynamically. “include Bio” or “include Bio::Ucsc” will work.

Ucsc::Hg19::Snp131 # This line works

Table search using genomic intervals:

gi =  GenomicInterval.parse("chr1:1-11,000")
Ucsc::Hg19::Snp131.with_interval(gi).find(:all).each do |e|
  i = GenomicInterval.zero_based(e.chrom, e.chromStart, e.chromEnd)
  puts "#{i.chrom}\t#{i.chr_start}\t#{e.name}\t#{e[:class]}" # "e.class" does not work
end

gi = GenomicInterval.parse("chr17:7,579,614-7,579,700")
Ucsc::Hg19::Snp131.with_interval(gi).find(:all)

Ucsc::Hg19::Snp131.with_interval_excl(gi).find(:all)

relation = Ucsc::Hg19::Snp131.with_interval(gi).select(:name)
relation.to_sql 
 # => SELECT name FROM `snp131`
       WHERE (chrom = 'chr17' AND bin in (642,80,9,1,0)
       AND ((chromStart BETWEEN 7579613 AND 7579700) AND
            (chromEnd   BETWEEN 7579613 AND 7579700)))"
relation.find_all_by_class_and_strand("in-del", "+").size # => 1

Ucsc::Hg19::Snp131.find_by_name("rs56289060")

Sometimes, queries using raw SQLs provide elegant solutions.

sql << 'SQL'
SELECT name,chrom,chromStart,chromEnd,observed
FROM snp131 
WHERE name="rs56289060"
SQL
p Ucsc::Hg19::Snp131.find_by_sql(sql)

retrieve reference sequence from a locally-stored 2bit file. The “hg19.2bit” file can be downloaded from hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.2bit

hg19ref = Ucsc::Reference.load("hg19.2bit")
gi = GenomicInterval.parse("chr1:9,500-10,999")
hg19ref.find_by_interval(gi)

Connetcting to non-official or local full/partial mirror MySQL servers

Ucsc::Hg18::DBConnection.db_host = 'localhost'
Ucsc::Hg18::DBConnection.db_username = 'genome'
Ucsc::Hg18::DBConnection.db_password = ''
Ucsc::Hg18::DBConnection.connect

Ucsc::Hg18::DBConnection.default # reset to connect UCSC's public MySQL sever
Ucsc::Hg18::DBConnection.connect

And see also sample scripts in the samples directory.

  • num-gene-exon.rb - calculation of total number of genes and exons using genomic interval

  • symbol2summary.rb - getting summary descriptions using gene symbol

  • hg19-2bit-retrieve - outputting reference sequence in FASTA format

  • bed2refseq - getting unique gene symbols in the genomic intervals in a BED file.

  • snp2gene - sample for retrieving fields from associated tables

Notes of Exceptions in Table Support

  • Table names starting with a number: Because Ruby class names cannot start with number, use the table class name starting with “T” (T for Table). Thus, the “2micron_est” table is supported by the “T2micron_est” class.

  • Table names starting with uppercase character: Classes for “HInv” and “NIAGene” tables are “HInv” and “NIAGene”, respectively

  • Accessing chromosome-specific tables: For example, the 'rmsk' table in hg18 is actually separated into 'chr1_rmsk', 'chr2_rmsk'… There is two way to access to them. (1) Accessing separated tables directly. There is no difference from other regular tables. However, you have to manage each separated tables. (2) Use abstract table classes (e.g., 'Rmsk') and their class methods “.find_by_interval' or '.find_all_by_interval'. These methods look for correspondent separated tables automatically. However, you cannot combine with other 'find_by_' methods. Moreover, if you have to perform single- or multi-chromosomal search, you have to access separated tables individually and integrate results by yourself. Fortunately, recent databases, including hg19, seem not to use chromosome-specific tables.

  • For honey bee ApiMel2 database, Group*_chainDm2 and Group*_chainDm2Link tables are accessible using find(_all)_by_interval class methods of the ChainDm2 and ChainDm2Link classes.

  • Special field (column) names: Field names such as 'attribute', 'valid', 'validate', 'class', 'method', 'methods', and 'type' cannot be accessed using instance methods. This restriction is because of the collision of method names that are internally used by ActiveRecord. Instead, use hash to access the field like “result”.

Details in “with_interval”

  • When a table class is referred first time, the API prefetches the table to get a list of fields and dynamically defines a class using following algorithm.

  • If chrom/chromStart/chromEnd fields exist (BED table), the API uses them for interval queries.

  • When tName/tStart/tEnd fields exist (PSL table), the API uses them for interval queries.

  • When chrom/txStart/txEnd fields exist (genePred table), the API uses them for interval queries.

  • When genoName/genoStart/genoEnd fields exist (RMSK table), the API uses them for interval queries.

  • If the table has the “bin” column, the API calculate bin index to build a query.

  • Otherwise, the API does not support interval queries but support only ActiveRecord's standard methods such as “find_(all_)by_[field name]”.

Table Associations

See samples/snp2gene.rb. Association definition using “has_one/has_many” methods is shown below. class_eval is used not to replace but to add definition.

Bio::Ucsc::Hg19::KnownGene.class_eval %!
  has_one :knownToEnsembl, {:primary_key => :name, :foreign_key => :name}
!
Bio::Ucsc::Hg19::KnownToEnsembl.class_eval %!
  has_one :ensGtp, {:primary_key => :value, :foreign_key => :transcript}
  has_one :kgXref, {:primary_key => :name, :foreign_key => :kgID}
!
Bio::Ucsc::Hg19::KgXref.class_eval %!
  has_one :refLink, {:primary_key => :mRNA, :foreign_key => :mrnaAcc}
!

And fields can be referred like the followings:

kg.knownToEnsembl.ensGtp.gene
kg.knownToEnsembl.kgXref.geneSymbol
kg.knownToEnsembl.kgXref.refLink.mrnaAcc

ActiveRecord::Base#find can be used with the :include option to perform “eager fetching” to reduce number of SQL statement submission.

kg = Bio::Ucsc::Hg19::KnownGene.with_interval(gi).
       find(:first,
            :include => [:knownToEnsembl => :ensGtp,
                         :knownToEnsembl => {:kgXref => :refLink}])

Copyright

Copyright

© 2011 MISHIMA, Hiroyuki (missy at be.to / hmishima at nagasaki-u.ac.jp / @mishimahryk in Twitter)

Copyright

© 2010 Jan Aerts

License

Ruby license (Ruby's / GPLv2 dual). See COPYING and COPYING.ja for further details.