The library bistro.bioinfo
offers a handful of functions to call various tools in computational biology, but of course many are missing. The purpose of this chapter is to demonstrate the few steps required to make a new tool available in bistro
(a.k.a. wrapping).
As a starting example, let's see how we'd proceed with a very silly example, wrapping the touch
command. To do so, we will use the Bistro.Shell_dsl
module which provides many convenient functions to create new workflow
values. Here's what it looks like:
open Bistro.Shell_dsl
let touch =
Workflow.shell ~descr:"touch" [
cmd "touch" [ dest ] ;
]
- Let's describe what we wrote:
- the first line (open statement) makes all the many handy functions from
Bistro.Shell_dsl
visible in the current scope; many functions we describe below come from this module - we define
touch
by calling a function fromBistro.Workflow
namedshell
. As the name suggests, workflow steps it defines are built calling a command line on a shell. - this function takes an argument
descr
which can be used to give a name to the workflow. This argument is optional and is only used for display purpose, but it helpsbistro
to display readable information when logging - the second and last argument of
Workflow.shell
is a list of commands that will be executed when the workflow is run - a command can be built with the
cmd
function fromBistro.Shell_dsl
, which takes a string providing the name of the executable to run and a list of arguments - arguments are of type
Bistro.Shell_dsl.template
, which can be seen as a representation of text with some special tokens inside, that can be replaced by some value when we try to execute the command - the single argument to our command (
dest
) is an example of these special tokens, and represents a path wherebistro
expects to find the result file or directory of the workflow
- the first line (open statement) makes all the many handy functions from
Basically defining a workflow amounts to providing a list of commands that are expected to produce a result at the location represented by the token dest
. Note that a workflow that doesn't use ``dest`` is necessarily incorrect since it has no means to produce its output at the expected location. The value touch
we have defined has type 'a path workflow
, and represents a recipe (right, a very simple one) to produce a result file. This type is too general and we'd have to restrict it to prevent run-time error, but we'll see that later. Let's now see how we make make a pipeline on some parameter.
Our touch
workflow is a very normal OCaml value. It's a datastructure that describes a recipe to produce a file. Let's write another one which is very similar:
let echo_hello =
workflow ~descr:"echo_hello" [
cmd "echo" ~stdout:dest [ string "hello" ] ;
]
- There are a few newcomers here:
- there is an argument
stdout
to thecmd
function, which adds to the command what's necessary to redirect its standard output to a file. Here we redirect todest
- we see that we can form arguments from simple strings with the
string
function. There are other such argument constructors, likeint
,float
and other more sophisticated ones
- there is an argument
With this wrapper, we've encoded the following command line:
$ echo "hello" > $DEST
So far so good. But do we really have to write a new wrapper each time we want to change a small detail in the workflow? Of course not, instead we can simply write a function that produces our workflow:
let echo msg =
workflow ~descr:"echo" [
cmd "echo" ~stdout:dest [ string msg ] ;
]
Our workflow is now a lot more generic, since it can be used to produce files with any content. Well saying workflow here is slightly incorrect, because the value echo
has type string -> 'a path workflow
. It's a function that produces workflows, but since it will be so common, I'll just call them workflows. To put it another way, instead of writing a single script, we now have a function that can produce a particular kind of script given a string.
Most of the time, a computational step in a workflow will take as an input the results obtained from some other. This can be expressed thanks to the function dep
. Let's see right away how it can be used to wrap the program sort
:
let sort text_file =
workflow ~descr:"sort" [
cmd "sort" ~stdout:dest [ dep text_file ] ;
]
The value sort
thus defined is again a function, but this time its argument is a workflow. If you ask OCaml, it will say that sort
has type 'a path workflow -> 'b path workflow
. That is, given a first workflow, this function is able to build a new one. This new workflow will call sort
redirecting the standard output to the expected destination and giving it text_file
as an argument. More precisely, bistro
will inject the location it decided for the output of workflow text_file
in the command invocating sort
. By combining the use of dep
and dest
, you can write entire collections of interdependent scripts without ever caring about where the generated files are stored.
The functions string
and dep
are enough to describe virtually any command-line argument to a program. In addition, the module Bistro.Shell_dsl
provides a few more utility functions that help writing concise and readable wrappers. The following code illustrates the use of a few of them on a simplified wrapper for the bowtie
command:
let bowtie ?v index fq1 fq2 =
workflow ~descr:"bowtie" [
cmd "bowtie" [
string "-S" ;
opt "-1" dep fq1 ;
opt "-2" dep fq2 ;
option (opt "-v" int) v ;
seq ~sep:"" [ dep index ; string "/index" ] ;
dest ;
]
]
- Let us examine each parameter to this command from top to bottom:
- the first argument is a simple
-S
switch, we encode it directly with thestring
function - the second and third arguments are paths to input files introduces with a switch; here writing
[ ... ; opt "-1" dep fq1 ; ... ]
is equivalent to writing[ ... ; string "-1" ; dep fq1 ; ... ]
but is shorter and more readable - the fourth argument is optional; notice that the variable
v
is an optional argument to thebowtie
function, so it is of type'a option
; theoption
function fromBistro.Shell_dsl
will add nothing to the command line ifv
isNone
or else apply its first argument to the value if holds. In that case, the applied function adds an integer argument introduced by a-v
switch - the fifth argument features a constructor called
seq
that can be used to concatenate a list of other chunks interspersed with a string (here the empty string); here we use it to describe a subdirectory of a workflow result - the last argument is simply the destination where to build the result.
- the first argument is a simple
We have seen that the Workflow.shell
function from Bistro.Shell_dsl
can be used to make new workflows that call external programs. This function has of course no means to know what the format of the result file or directory will be. For this reason, it outputs a value of type 'a path workflow
, which means a result whose format is compatible with any other. This is obviously wrong in the general case, and could lead to run-time errors by feeding a tool with inputs of an unsupported format. In order to prevent such run-time errors, we can provide more precise types to our functions producing workflows, when we have more information. Let's see that on an example. FASTA files have the property that when you concatenate several of them, the result is still a FASTA file (this is false in general case of course). We are now going to write a workflow that concatenates several FASTA files, and make sure its typing reflects this property.
Both Bistro
and Bistro_bioinfo
define a few type definitions for annotating workflows. In particular we'll use Bistro_bioinfo.fasta
for our example. Here's how it looks:
open Bistro
open Bistro.Shell_dsl
open Bistro_bioinfo
let fasta_concat (x : fasta pworkflow) (y : fasta pworkflow) : fasta pworkflow =
workflow ~descr:"fasta-concat" [
cmd "cat" ~stdout:dest [ dep x ; dep y ] ;
]
Note the 'a pworkflow
type which is used here, and which is synonym for 'a path workflow
. Alternatively, you can define your workflow in a .ml
file:
open Bistro.Shell_dsl
let fasta_concat x y =
workflow ~descr:"fasta-concat" [
cmd "cat" ~stdout:dest [ dep x ; dep y ] ;
]
and constraint its type in the corresponding .mli
file:
open Bistro
open Bistro_bioinfo
val fasta_concat : fasta pworkflow -> fasta pworkflow -> fasta pworkflow