Tutorial1
Under construction. Please come back later…
In this first tutorial, we will see how to parse an UniProt file. The full code can be found here: http://github.com/paradigmatic/SwissParser/blob/master/examples/tutorial_1.rb
Before starting, you can try have a look at the result and try to guess how it works.
We want to parse a UniProt flat file and retrieve for every entry the following information;
- the uniprot id
- the species
- the full taxonomy
- the protein sequence
- the protein sequence length
This information will be stored in instances of a Protein
class, simply defined as:
class Protein
attr_accessor :id, :size, :species, :taxonomy, :sequence
def initialize
@taxonomy = []
@sequence = ""
end
end
We start by drawing the application outlines:
require 'yaml'
require 'swissparser'
class Protein
#...
end
module Uniprot
Rules = Swiss::Rules.define do
#...
end
Parser = Rules.make_parser do
# insert here parser workflow
end
end
# MAIN METHOD
if $0 == __FILE__
#...
end
We start by specifying the parsing rules declaration. They should be enclosed in a rule
declaration such as:
require 'yaml'
require 'swissparser'
class Protein
#...
end
module Uniprot
Rules = Swiss::Rules.define do
# Add parsing rules here
end
end
Inside the rules
block the order of declaration has no particular meaning. There are three kind of rules:
- separator rules, which specify the line separating entries in the input file.
-
key rules, which parses the line starting with a key such as :
FT SIGNAL 1 17
where ‘FT’ is the key and ‘SIGNAL 1 17’ the content. - text after key which parses lines whithout keys, like the sequence lines in the UniProt format.
You can specify a separator with a set_separator( "//" )
directive inside the rules
block. However, by default the separator is already equal to “//” so we skip this part.
In UniProt we can retrieve the ID and the sequence length in the ‘ID’ line:
ID PPBT_HUMAN Reviewed; 524 AA.
rules
block;
with("ID") do |content|
content =~ /([A-Z]\w+)\D+(\d+)/
@id = $1
@size = $2.to_i
end
Here, the with
directive declares a rule to parse lines with keys. It takes two argument. First the key (here ‘ID’) and then a code block specifying how two parse the content. This code block takes in turn one argument: the content of the line without the key and without the separating whitespaces.
We simply use regexp to find the values we want and we set the corresponding fields in the protein instance.
We use the same strategy to parse the species and the taxonomy of uniprot entries:
with("OS") do |content|
content =~ /(\w+ \w+)/
@species = $1
end
with("OC") do |content|
ary = content.gsub(".","").split("; ")
@taxonomy ||= [] #initializing the array if needed
@taxonomy += ary
end
Note that the taxonomy can span on several lines in the UniProt file, such as:
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC Catarrhini; Hominidae; Homo
This is not a problem, the rule will be called several times, and it simply add to an existing array new taxonomic fields.
Parsing the sequence is a bit different, because it has no key. However, we know that sequence lines come after lines with key ’SQ? such as:
SQ SEQUENCE 524 AA; 57305 MW; 71B45F17F6211900 CRC64;
MISPFLVLAI GTCLTNSLVP EKEKDPKYWR DQAQETLKYA LELQKLNTNV AKNVIMFLGD
GMGVSTVTAA RILKGQLHHN PGEETRLEMD KFPFVALSKT YNTNAQVPDS AGTATAYLCG
VKANEGTVGV SAATERSRCN TTQGNEVTSI LRWAKDAGKS VGIVTTTRVN HATPSAAYAH
We can the use the with_text_after
directive to identify them:
with_text_after("SQ") do |content|
seq = content.strip.gsub(" ","")
@sequence ||= "" #initializing blank string if needed
protein.sequence += seq
end
SwissParser are created using the Swiss::Parser.define
class method. The definition takes a code block, where the parsing behaviour is specified by declaration.
.
We can now how our parser behaves.
Parser = Rules.make_parser do |entries|
proteins = []
entries.each do |entry|
#assign
end
end
if $0 == __FILE__
filename = ARGV.shift
proteins = Uniprot::Parser.parse_file( filename )
proteins.each do |p|
puts p.to_yaml
end
end
We are done ! In just xx lines of ruby we have solved the problem…
If you want to exercise, you can modify the parser to parse other data.
The next tutorial will show how to extend an existing parser and how to change the workflow.