An extension to rake that can be used to build database-backed workflows
Ruby
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
lib
sample
test
README.textile
Rakefile
biorake.gemspec

README.textile

Why?

There is some interest in the bioinformatics community for using rake
as a workflow tool (see e.g. this blog post from BioinformaticsZen
).

Rake could be ideal for this type of work: a typical workflow will
take data and perform a first set of conversions on it (i.e. a task),
followed by a second set of conversions (that is dependent on the
first task), and so on. And obviously, bioinformaticians want to keep their data
in databases rather than files…

A typical Rakefile could look like this:


task :001_load_data do

end

task :002_calculate_averages => [:001_load_data] do end task :003_make_histogram_of_averages => [:002_calculate_averages] do end

The trouble is that there is no way yet to check whether a task has to
be rerun or not, because there are no timestamps. Regular rake will
rerun all three tasks from the example above, regardless if some of
them have already been completed.

BioRake adds this task timestamp functionality to rake for working with
databases. The functionality needed is very similar to the one available for
FileTasks.

So if we had reloaded the data (001), the timestamp for that task in a
metadata would be later than the one for task 002. As a result, task
002 would automatically have to be rerun if we were to run task 003.

Install

 gem sources -a http://gems.github.com (you only have to do this once)
 sudo gem install jandot-biorake

Implementation

I’ve started to implement an additional type of task, called
event. The above snippet from a Rakefile would actually contain

 event :001_load_data do
   ...
 end
 
 event :002_calculated_averages => [:001_load_data] do
   ...
 end

instead of using the task tag.

Similar to a FileTask, timestamps are used to check if certain tasks
have to be re-run or not. FileTasks have the advantage that every file
has a timestamp. To implement this the metadata of event completion
times is stored in the .rake directory inside the current directory.

A event task automatically:

  1. checks the metadata to see if the task has already been run
  2. if so: are there any prerequisites with timestamps that are newer than the task
    itself?
  3. (re)run the task if necessary
  4. update the metadata

To re-run all tasks from scratch issue a Rake::EventTask.clean or simply


rm -rf .rake

to reset the metadata to before any events have occured.

Status

Even though the tests seem to run and I’ve tried some things out, I
can’t guarantee production-level stability (well: call it beta). Use
at your own risk.

Sample

The sample/ directory contains an example Rakefile. Suppose a
researcher has intensities for a group of individuals on a number of
probes. This information should be loaded into a database with the
tables individuals, probes and intensities.

As the intensities table contains foreign keys for individual and
probe, the individuals and probes tables have to be loaded
before the intensities can be loaded.

In rake-speak, this would look like:


event :load_probes do
load the actual data
end

event :load_individuals do
load the actual data
end

event :load_intensities => [:load_probes, :load_individuals] do
load the actual data
end

In a later step, the researcher might want to calculate the average
intensity per probe. This would be a new task that depends on the
intensities being loaded:

event :calculate_averages => [:load_intensities] do
  _calculate averages and store in probes table_
end

Here, we call the database that will contain the data sample.sqlite3. The
metadata about completed events is stored in the .rake directory.

Try a rake -T