Bio::Graphics is a ruby library for visualizing annotations on chromosomes
Ruby
Pull request Compare This branch is 2 commits ahead of jandot:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
images
lib
samples
scripts
test/unit
README.DEV
Rakefile
TUTORIAL
bio-graphics.gemspec
init.rb

README.DEV

 # 
 # = README.DEV - Information for developers
 #
 # Copyright::   Copyright (C) 2007, 2008
 #               Jan Aerts <jan.aerts@gmail.com>
 # License::     The Ruby License
 #

= README for developers
This README is mainly meant to explain (a) how the code works (rather than how to _use_ the library), and (b) how to contribute to the project.

== ABOUT ARGUMENTS

As there can be a host of optional arguments for several of the methods (e.g. Track#add_feature), it becomes unwieldy to have them all defined using the standard ruby way. The Track#add_feature method for example would then have been defined like this:

  def add_feature(feature_object, label = 'anonymous', link = nil, glyph = @glyph, colour = @colour)
    ...
  end
  
So if the user would like to use all default settings except for the colour, he'd have to specify all those defaults:

  my_track.add_feature(my_feature_object, 'anonymous', nil, my_track.glyph, [0,1,1])
  
To make the library a bit more userfriendly, I've decided to use keyword arguments for all optional arguments. The above line then becomes:

  my_track.add_feature(my_feature_object, :colour => [0,1,1])
  
The construct you'll see for this is as follows (for Track#new):

  def initialize(panel, name, opts = {})
    @panel = panel
    @name = name
  
    opts = {
      :label => true,
      :glyph => :generic,
      :colour => [0,0,1]
    }.merge(opts)
    
    @show_label = opts[:label]
    @glyph = opts[:glyph]
    @colour = opts[:colour]
    
    ...
  end

Basically, this means that the 'panel' and 'name' arguments are mandatory and the other ones are optional. Which optional arguments are excepted should be listed in the description comments just above the method code. The Hash#merge that's used here just assigns default values for each arguments that is not mentioned.

== A. FLOW OF THE CODE

=== Overview
I've tried to document as much as possible in the code itself, see for example the comments that accompany the setting of the defaults for Bio::Graphics in the panel.rb file. However, the bigger picture can not be explained that way.

=== The files
There's one file for each class: panel, track, feature, subfeature, ruler and image_map. See the tutorial on a breakdown what each of these do. All of these except the image_map make up a picture. The image_map is used to describe the HTML map that can be created to make a picture clickable.

Classes are:
  Bio::Graphics::Panel
  Bio::Graphics::Ruler
  Bio::Graphics::Track
  Bio::Graphics::Feature
  Bio::Graphics::SubFeature
  several Bio::Graphics::Glyph::<something>

=== The workflow
==== 1. Creating the panel
The user has to start with a
  my_panel = Bio::Graphics::Panel.new(length, width, clickable, display_start, display_stop, verticle)

When this happens, among other things, the instance variable @tracks is created that will later contain the actual Track objects. In addition, there's @number_of_feature_rows. You'll later see that each Track object also has its @number_of_feature_rows. The panel needs this information to know how far it has to go down before it can start drawing a track: the first track will be just below the ruler, but the vertical coordinates of the second one depend on the height of all the ones that were drawn previously. And _that_ in turn is defined by the number of times a feature would overlap with another one and therefore had to be _bumped_ down.

@display_start and @display_stop are used for zooming in on a region. Even though the full @length of the sequence can be really long, setting @display_start and @display_stop will only consider that region.

Then there is @rescale_factor, which plays a crucial role in drawing the stuff: it tells the script how many basepairs are contained in one pixel. This variable will be used _very_ extensively in the drawing code.

So this covered the Panel#initialize...

==== 2. Adding tracks to the panel
Because tracks are inherently part of a panel and cannot exist on their own, they can only be created by using a Panel method rather than a Track method.
  my_track_1 = my_panel.add_track(name, label = false, feature_glyph = :generic, feature_colour = [0,0,1])
Feature_glyph and feature_colour are the default values to use for all features in this track.

The feature_glyph can either be one of the approved symbols (:generic, :spliced, ...; see documentation Bio::Graphics::Panel#add_track) or a hash. The keys of the hash refer to the types of subfeature; the values are these symbols again. For how that's used, see below (5. Drawing the thing).

This creates a new Track object and adds it to the @tracks array of the Panel object. Several instance variables are set for the Track object, including @features (which is an array of Feature objects for that track) and @number_of_feature_rows. Every time a feature cannot be drawn because it would overlap with another one, it will be 'bumped' down until it can be drawn. This effectively results in _rows_ that contain the features. The @number_of_feature_rows is just this number of rows (to be able to calculate the height of the track afterwards).

  ------------------------------------------------------
    *******    ****  *********         *****    *****
         *****                       ********
                                    **

The Panel#add_track method returns the Track object itself, because the latter has to be accessible to be able to assign features to it.

==== 3. Adding features to a track
Same thing as adding a track to a panel: the feature can only be added by the user by using the Track#add_feature method. Arguments are a Bio::Feature object (which itself has a type and location), a label, and the glyph and colour. The glyph can either be one of the approved symbols, or a hash (see 2. Adding tracks to the panel).

The location of a Bio::Feature object can be something like 'complement(join(10..20,50..70))'. To be able to parse this, I use the Bio::Locations object from bioruby (see http://www.bioruby.org). A Bio::Locations (plural) object contains one or more Bio::Location (singular) objects, which are the subfeatures: 10..20 and 50..70. It's these Bio::Location objects we use to calculate the ultimate start and stop of the feature.

The Track#add_feature method returns the Track object itself.

Now let's look at the other end: the Feature object that gets created. In the Feature#initialize method, you'll notice, apart from the obvious variables, the following instances variables: @subfeatures, @left_pixel_of_subfeatures and @right_pixel_of_subfeatures. The @subfeatures thing is quite important. In some cases (e.g. mRNAs that consist of 5'UTR, CDS and 3'UTR), we will want to make a distinction between the UTRs and the CDS for drawing (see TUTORIAL). To make this possible, we will have Feature#new always create an Array of SubFeature objects. Often this array will only contain one object. Drawing of the glyphs is done on a subfeature-by-subfeature basis.
The @left_pixel_of_subfeatures and @right_pixel_of_subfeatures just represent the outermost pixels for this feature.

==== 4. Creating subfeatures (done automatically by Feature#new)
For each subfeature#new several instance variables are created: @pixel_range_collection, @chopped_at_start, @chopped_at_stop, @hidden_subfeatures_at_start and @hidden_subfeatures_at_stop. Let's take these one by one:

===== @pixel_range_collection
Now _this_ is the crucial bit: it will hold the information on what pixels (on the horizontal axis) should be covered. This means that any part of the feature that does not fall within the view is _not_ in this collection. Basically, for every subfeature (e.g. exon for a gene), the location of that subfeature is compared to the region of the view. If a subfeature is not in the view at all, its positions are discarded (but other stuff does happen, see below); if a subfeature is at the left of the picture but actually extends outwith the view, the start pixel will become 1. You get the picture. Also see the mini diagrams in the code itself.

These start and stop positions are used to create Bio::Graphics::Panel::Track::Feature::SubFeature::PixelRange objects. Unspliced objects will have an array @pixel_range_collection with just one element.

===== @chopped_at_start and @chopped_at_stop
Suppose you've got a directed feature (so one with an arrow), and the 3' end falls outside of the view. What would happen, is that the 3' end that's out of view would be chopped of (that's good), but also that the end of the glyph (which is _not_ the end of the feature) becomes an arrow. We don't want that; instead, the arrow should be removed.

That's where the @chopped_at_start and @chopped_at_stop come in. If these are set to true (while building the @pixel_range_collection), the arrow is not drawn.

===== @hidden_subfeatures_at_start and @hidden_subfeatures_at_stop
For spliced features, it might be that one or more of the subfeatures (e.g. exons) lies outwith the view. We normally draw e.g. genes by drawing the exons as boxes and connecting them with small lines. The drawing code itself (see later) takes all exons within view and draws those connections. However, if an exon is outside of the viewing area, this line is not drawn. The @hidden_subfeatures_at_start and @hidden_subfeatures_at_stop are just flags to capture this.

==== 5. Drawing the thing
The Cairo and Pango libraries (http://cairographics.org, http://www.pango.org) are used for the actual drawing. The main concepts in the Cairo drawing model are (please also see http://cairographics.org/tutorial):
* *source*: the _paint_ you'll be using
* *destination*: the _surface_ (Cairo::ImageSurface) that you want to draw onto
* *mask*: controls where you apply the source to the destination. Stuff like 'line_to'.
* *context*: tracks one source, one mask and one destination.

From the cairo tutorial: "Before you can start to draw something with cairo, you need to create the context. <SNIP> When you create a cairo context, it must be tied to a specific surface - for example, an image surface if you want to create a PNG file." So that's what we have to do: create a Cairo::ImageSurface and connect a Cairo::Context to it.

Now let's walk through the drawing code itself...

When a user draws a panel, the first thing that happens, is the creation of a Cairo::ImageSurface (the _destination_). To be able to do this, we need to know the dimensions. But there's a slight problem: we can't know the height of the picture until it's actually drawn. The way we'll circumvent this, is that we create a really high picture (called "huge_panel_drawing") that we'll crop afterwards.

===== Drawing the ruler
A ruler consists of a line with tickmarks on it. The major issue with drawing the ruler, is determining the distance between those ticks. Suppose we have zoomed into a small region, we'd still want to see usable ticks; and if we've zoomed out to a huge region, we don't want to have those ticks all bumping into each other.

To calculate the distance between consecutive ticks, we start with a distance of 1 basepair, and increase it until the minimal distance criterion is met. We also set the distance between major tickmarks (which are the ones that will get a number).
There's a small issue when you actually start drawing the ticks. Most of the time, we don't want the first tick on the very first basepair of the view. Suppose that would be position 333 in the sequence. Then the numbers under the major tickmarks would be: 343, 353, 363, 373 and so on. Instead, we want 350, 360, 370, 380. So we want to find the position of the first tick. If we've found that one, it's simple to add the rest of them.

The ruler height @height consists of the height of the ruler itself plus the height of the numbers.

===== Drawing the tracks
Drawing each track starts out with the general header: a line above it and the title. We also 'translate' the track down to not let it overlap with previously drawn tracks. Obviously, the more challenging part is drawing the features themselves.

===== Drawing the (sub)features.
First thing we have to do, is figure out what the *vertical* *coordinates* of the glyph should be (i.e. the row). To keep track of what parts of the screen are already occupied by features (so that we know when a new feature has to be bumped down), we make use of a *grid*. The grid is basically a hash with the keys being the row number, and the values arrays of ranges. (These ranges use basepair units rather than pixels, but that's completely arbitrary.) For each feature, we first check if we can draw it at the top of the track (i.e. row 1) and if we can't move it down a row at a time until there's room for it.

So for example, suppose we've already drawn two features that have the following positions: 100..150 and 200..225. The grid would then look like this:

  grid = { 1 => [(100..150),(200..225)] }

If we'd like to draw a new feature from 125..175 (which overlaps the first of the two ranges above), we see that row_available becomes false, and the row number is increased. The grid after adding this feature looks like:

  grid = { 1 => [(100..150),(200..225)],
           2 => [(125..175)] }

So now we know what the vertical coordinates of the glyph should be. But the Bio::Graphics::Feature does not do the drawing... The reason for this is that a Feature can have SubFeatures (e.g. a transcript feature can have UTR and CDS subfeatures). If that's not the case, the Feature is seen as having one SubFeature.

Next step is to check if there's reasons we would like to *change* *the* *requested* *glyph* *type* *from* *directed* *to* *undirected*. If the user asks for directed glyphs (i.e. ones with an arrow at the end), but the view is zoomed _way_ out, there's no way the arrow will be visible. If we'd try to draw that arrow anyway, it would become bigger than the feature itself. Another reason would be if the feature's 3' end extends outwith the picture.

Finally, we can *draw*. As there are several types of glyphs, this is handled by a Glyph object. These are defined in the lib/bio/graphics/glyphs directory. These files are automatically read when requiring lib/bio-graphics.
The actual drawing bit should be quite self-explanatory (_move_to_, _line_to_, ...).

For the spliced features (:spliced and :directed_spliced), we first draw the components (i.e. the exons) keeping track of the start and stop positions of the gaps (i.e. introns). We then add the connections in those gaps. In addition, we draw a line that extends to the side of the picture if there are exons out of view. This flag was set when the feature was created (see above: @hidden_subfeatures_at_start and @hidden_subfeatures_at_stop).

When the user wants a clickable map, we also have to record that this region should be added to the image map.

When everything has been drawn, we finally know the number of rows for that track (i.e. the number_of_times_bumped).

===== Finalizing the panel
So now we have a huge panel (see "huge_panel_drawing" above) which is way to high. This is converted to a panel of the right size by creating a new panel (i.e. the cairo destination), and then using the huge panel as a source to be transferred on that new destination.

And we just write the PNG to a file. If the user wanted a clickable map, also create the HTML file.

== B. CONTRIBUTING TO THE PROJECT
(This text is adapted from the bioruby README.DEV)

There are many possible ways to contribute to the Bio::Graphics project, such as:

* Join the discussion on the BioRuby mailing list
* Send a bug report or write a bug fix patch
* Add and correct documentation
* Develop code for new features, etc.

All of these are welcome!  However, this document describes the last option, how to contribute your code to Bio::Graphics.

== Coding style

You will need to follow the typical coding styles of the BioRuby modules:

=== Use the following naming conventions

* CamelCase for module and class names
* '_'-separated_lowercase for method names
* '_'-separated_lowercase for variable names
* all UPPERCASE for constants

=== Indentation must not include tabs

* Use 2 spaces for indentation.
* Don't replace spaces to tabs.

=== Comments

Don't use <tt>=begin</tt> and <tt>=end</tt> blocks for comments.  If you need to add comments, include it in the RDoc documentation.

=== Tutorials

Additional tutorial documentation and working examples are encouraged with your contribution.  You may use the header part of the file for this purpose.

=== Standard documentation

==== of files

Each file should start with a header, which covers the following topics:
* copyright
* license
* description of the file (_not_ the classes; see below)
* any references, if appropriate

The header should be formatted as follows:

  #
  # = bio/graphics/ruler - class to draw ruler
  #
  # Copyright::  Copyright (C) 2001, 2003-2005 Bio R. Hacker <brh@example.org>,
  # Copyright::  Copyright (C) 2006 Chem R. Hacker <crh@example.org>
  #
  # License::    The Ruby License
  #
  # == Description
  #
  # This file contains classes that implement the ruler of a graphic.
  #
  module Bio
    module Graphics
      class Ruler
      
      end
    end
  end

==== of classes and methods within those files

Classes and methods should be documented in a standardized format, as in the following example (from bioruby's lib/bio/graphics/sequence.rb):

  # == Description
  #
  # Bio::Sequence objects represent annotated sequences in bioruby.
  # A Bio::Sequence object is a wrapper around the actual sequence, 
  # represented as either a Bio::Sequence::NA or a Bio::Sequence::AA object.
  # For most users, this encapsulation will be completely transparent.
  # Bio::Sequence responds to all methods defined for Bio::Sequence::NA/AA
  # objects using the same arguments and returning the same values (even though 
  # these methods are not documented specifically for Bio::Sequence).
  #
  # == Usage
  # 
  #   require 'bio'
  #   
  #   # Create a nucleic or amino acid sequence
  #   dna = Bio::Sequence.auto('atgcatgcATGCATGCAAAA')
  #   rna = Bio::Sequence.auto('augcaugcaugcaugcaaaa')
  #   aa = Bio::Sequence.auto('ACDEFGHIKLMNPQRSTVWYU')
  # 
  #   # Print in FASTA format
  #   puts dna.output(:fasta)
  # 
  #   # Print all codons
  #   dna.window_search(3,3) do |codon|
  #     puts codon
  #   end
  # 
  class Sequence
  
    # Create a new Bio::Sequence object
    #
    #   s = Bio::Sequence.new('atgc')
    #   puts s                                  # => 'atgc'
    #
    # Note that this method does not intialize the contained sequence
    # as any kind of bioruby object, only as a simple string
    #
    #   puts s.seq.class                        # => String
    #
    # See Bio::Sequence#na, Bio::Sequence#aa, and Bio::Sequence#auto 
    # for methods to transform the basic String of a just created 
    # Bio::Sequence object to a proper bioruby object
    # ---
    # *Arguments*:
    # * (required) _str_: String or Bio::Sequence::NA/AA object
    # *Returns*:: Bio::Sequence object
    def initialize(str)
      @seq = str
    end
  
    # The sequence identifier.  For example, for a sequence
    # of Genbank origin, this is the accession number.
    attr_accessor :entry_id
  
    # An Array of Bio::Feature objects
    attr_accessor :features
  end # Sequence

Preceding the class definition (<tt>class Sequence</tt>), there is at least a description and a usage example. Please use the +Description+ and +Usage+ headings. If appropriate, refer to other classes that interact with or are
related to the class.

The code in the usage example should, if possible, be in a format that a user can copy-and-paste into a new script to run. It should illustrate the most important uses of the class. If possible and if it would not clutter up the
example too much, try to provide any input data directly into the usage example, instead of refering to ARGV or ARGF for input.

  dna = Bio::Sequence.auto('atgcatgcATGCATGCAAAA')

Otherwise, describe the input shortly, for example:

  # input should be string consisting of nucleotides
  dna = Bio::Sequence.auto(ARGF.read)

Methods should be preceded by a comment that describes what the method does, including any relevant usage examples. (In contrast to the documentation for the class itself, headings are not required.) In addition, any arguments should be listed, as well as the type of thing that is returned by the method. The format of this information is as follows:

  # ---
  # *Arguments*:
  # * (required) _str_: String or Bio::Sequence::NA
  # * (optional) _nr_: a number that means something
  # *Returns*:: true or false

Attribute accessors can be preceded by a short description.

=== Exception handling

Don't use

  $stderr.puts "WARNING"

in your code. Instead, try to avoid printing error messages. For fatal errors, use +raise+ with an appropriate message.

=== Testing code should use 'test/unit'

Unit tests should come with your modules by which you can assure what you meant to do with each method.  The test code is useful to make maintenance easy and ensure stability. The use of

  if __FILE__ == $0

is deprecated.

== Maintenance

Finally, please maintain the code you've contributed. Please let us know (on the bioruby list) before you commit, so that users can discuss on the change.