Skip to content
paradigmatic edited this page Dec 18, 2010 · 21 revisions

Under construction. Please come back later…

In this first tutorial, we will see how to parse an UniProt file. The full code can be found here: http://github.com/paradigmatic/SwissParser/blob/master/examples/tutorial_1.rb

Before starting, you can try have a look at the result and try to guess how it works.

The goal

We want to parse a UniProt flat file and retrieve for every entry the following information;

  • the uniprot id
  • the species
  • the full taxonomy
  • the protein sequence
  • the protein sequence length

This information will be stored in instances of a Protein class, simply defined as:

class Protein
  attr_accessor :id, :size, :species, :taxonomy, :sequence
  def initialize
    @taxonomy = []
    @sequence = ""
  end
end

Application outlines

We start by drawing the application outlines:

require 'yaml'
require 'swissparser'
class Protein
  #...
end
module Uniprot
  Rules = Swiss::Rules.define do
    #...
  end
  Parser = Rules.make_parser do
    # insert here parser workflow 
  end  
end 
# MAIN METHOD 
if $0 == __FILE__    
  #...
end

The parsing rules

We start by specifying the parsing rules declaration. They should be enclosed in a rule declaration such as:

  require 'yaml'
require 'swissparser'
class Protein
  #...
end
module Uniprot
  Rules = Swiss::Rules.define do
    # Add parsing rules here
  end
end

Inside the rules block the order of declaration has no particular meaning. There are three kind of rules:

  1. separator rules, which specify the line separating entries in the input file.
  2. key rules, which parses the line starting with a key such as : FT SIGNAL 1 17 where ‘FT’ is the key and ‘SIGNAL 1 17’ the content.
  3. text after key which parses lines whithout keys, like the sequence lines in the UniProt format.

Entry separator

You can specify a separator with a set_separator( "//" ) directive inside the rules block. However, by default the separator is already equal to “//” so we skip this part.

ID and protein length

In UniProt we can retrieve the ID and the sequence length in the ‘ID’ line:

ID   PPBT_HUMAN              Reviewed;         524 AA.
To parse it we add the following directive in the rules block;

with("ID") do |content|
  content =~ /([A-Z]\w+)\D+(\d+)/
  @id = $1
  @size = $2.to_i
end

Here, the with directive declares a rule to parse lines with keys. It takes two argument. First the key (here ‘ID’) and then a code block specifying how two parse the content. This code block takes in turn one argument: the content of the line without the key and without the separating whitespaces.

We simply use regexp to find the values we want and we set the corresponding fields in the protein instance.

Species and Taxonomy

We use the same strategy to parse the species and the taxonomy of uniprot entries:

with("OS") do |content|
  content =~ /(\w+ \w+)/
  @species = $1
end
with("OC") do |content|
  ary = content.gsub(".","").split("; ")
   @taxonomy ||= []    #initializing the array if needed
   @taxonomy += ary
end

Note that the taxonomy can span on several lines in the UniProt file, such as:

OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC   Catarrhini; Hominidae; Homo

This is not a problem, the rule will be called several times, and it simply add to an existing array new taxonomic fields.

The sequence

Parsing the sequence is a bit different, because it has no key. However, we know that sequence lines come after lines with key ’SQ? such as:

SQ   SEQUENCE   524 AA;  57305 MW;  71B45F17F6211900 CRC64;
     MISPFLVLAI GTCLTNSLVP EKEKDPKYWR DQAQETLKYA LELQKLNTNV AKNVIMFLGD
     GMGVSTVTAA RILKGQLHHN PGEETRLEMD KFPFVALSKT YNTNAQVPDS AGTATAYLCG
     VKANEGTVGV SAATERSRCN TTQGNEVTSI LRWAKDAGKS VGIVTTTRVN HATPSAAYAH

We can the use the with_text_after directive to identify them:

with_text_after("SQ") do |content|
  seq = content.strip.gsub(" ","")
  @sequence ||= "" #initializing blank string if needed
  protein.sequence += seq
end

The Parser

SwissParser are created using the Swiss::Parser.define class method. The definition takes a code block, where the parsing behaviour is specified by declaration.
.

Writing the parser workflow

We can now how our parser behaves.

 Parser = Rules.make_parser do |entries|
  proteins = []
  entries.each do |entry|
    #assign
  end
end
if $0 == __FILE__
  filename = ARGV.shift
  proteins = Uniprot::Parser.parse_file( filename ) 
  proteins.each do |p|
    puts p.to_yaml
  end
end

That’s all

We are done ! In just xx lines of ruby we have solved the problem…

If you want to exercise, you can modify the parser to parse other data.

The next tutorial will show how to extend an existing parser and how to change the workflow.