CHANGES.TXT

= v. 2.4=
    _21/January/2014_
    ==Naive support for drmaa / running Ruffus on a cluster submission node
        * Needs multithreading instead of multiprocessing added `pipeline_run(..., use_multi_threading = True)`
		  Multithreading allows a drmaa session to be shared and prevents the submission node from being overloaded by completion notifications from the cluster.
		* ruffus.drmaa_wrapper.run_job() is a convenience function much like `qrsh` which can be used as a drop in replacement for `os.system()` or `subprocess.check_output()` etc. 
		* Takes drmaa queue name, priority, job name, session etc., creates a temp script file and runs specified command on the cluster, returns stdout and stderr lines only when the command is finished
= v. 2.3=
    _03/October/2011_
    ==`@active_if` turns off tasks at runtime==
        * Issue 36
        * Design and initial implementation from Jacob Biesinger
        * Takes one or more parameters which can be either booleans or functions or callable objects which return True / False
        * The expressions inside @active_if are evaluated each time 
          `pipeline_run`, `pipeline_printout` or `pipeline_printout_graph` is called.
        * Dormant tasks behave as if they are up to date and have no output.
    ==Command line parsing for pipelines==
        * From "Issue 44"
        * Added ruffus/cmdline.py
        * Supports both argparse (python 2.7) and optparse (python 2.6):
        * The following options are defined by default:
          {{{
    --verbose
    --version
    --log_file

-t, --target_tasks
-j, --jobs
-n, --just_print
    --flowchart
    --key_legend_in_graph
    --draw_graph_horizontally
    --flowchart_format
    --forced_tasks
           }}}
        * Usage with argparse (Python > 2.7):
          {{{
from ruffus import *

parser = cmdline.get_argparse(   description='WHAT DOES THIS PIPELINE DO?')

# for example...
parser.add_argument("--input_file")

options = parser.parse_args()

#  optional logger which can be passed to ruffus tasks
logger, logger_mutex = cmdline.setup_logging (__name__, options.log_file, options.verbose)

#_____________________________________________________________________________________
#   pipelined functions go here
#_____________________________________________________________________________________

cmdline.run (options)
          }}}
        * Usage with optparse (Python 2.6):
          {{{
from ruffus import *

parser = cmdline.get_optgparse(version="%prog 1.0", usage = "\n\n    %prog [options]")

# for example...
parser.add_option("-c", "--custom", dest="custom", action="count")

(options, remaining_args) = parser.parse_args()

#  logger which can be passed to ruffus tasks
logger, logger_mutex = cmdline.setup_logging ("this_program", options.log_file, options.verbose)

#_____________________________________________________________________________________
#   pipelined functions go here
#_____________________________________________________________________________________

cmdline.run (options)
          }}}
    ==Optionally terminate pipeline after first exception==
        * To have all exceptions interrupt immediately:
          {{{
pipeline_run(..., exceptions_terminate_immediately = True)
          }}}
        * From "Issue 43"
        * By default ruffus accumulates `NN` errors before interrupting the pipeline prematurely. `NN` is the specified parallelism for `pipeline_run(...)`. 
        * By default, a pipeline will only be interrupted immediately if exceptions of type `ruffus.JobSignalledBreak` are thrown.
    ==Display exceptions without delay==
        * To see exceptions as they occur:
          {{{
pipeline_run(..., log_exceptions = True)
          }}}
        * From "Issue 43"
        * By default, Ruffus re-throws exceptions in ensemble after pipeline termination.
        * `logger.error(...)` will be invoked with the string representation of the each exception, and associated stack trace.
        * The default logger prints to sys.stderr, but this can be changed to any class from the logging module or compatible object via `pipeline_run(..., logger = ???)`
    ==`@split` operations now show the 1->many output in pipeline_printout==
        * From "Issue 45"
        * New output
          {{{
Task = split_animals
     Job = [None
           -> cows
           -> horses
           -> pigs
            , any_extra_parameters]
          }}}
    ==Improved display from `pipeline_printout()`==
        * File date and time are displayed in human readable form and out of date
          files are flagged with asterisks. 


= v. 2.2=
    _21/July/2010_
    ==Parameter substitution for `inputs(...)` / `add_inputs(...)`==
        `glob`s and tasks can be added as the prerequisites / input files using
        `inputs(...)` and `add_inputs(...)`. `glob` expansions will take place when the task
        is run.
    ==Simplifying `@transform` syntax with suffix==
        Regular expressions within ruffus are very powerful, and can allow files to be moved
        from one directory to another and renamed at will.<br><br>
        However, using consistent file extensions and
        `@transform(..., suffix(...))` makes the code much simpler and easier to read. <br><br>
        Previously, `suffix(...)` did not cooperately well with `inputs(...)`.
        For example, finding the corresponding header file (``'.h'``) for the matching input
        required a complicated `regex(...)` regular expression and `input(...)`. This simple case,
        e.g. matching ``'something.c'`` with ``'something.h'``, is now much easier in Ruffus.<br><br>
        For example:
          {{{
source_files = ["something.c", "more_code.c"]
@transform(source_files, suffix(".c"), add_inputs(r"\1.h", "common.h"), ".o")
def compile(input_files, output_file):
    ( source_file,
      header_file,
      common_header) = input_files
    # call compiler to make object file
          }}}
          This is equivalent to calling:
          {{{
compile(["something.c", "something.h", "common.h"], "something.o")
compile(["more_code.c", "more_code.h", "common.h"], "more_code.o")
          }}}

        The `\1` matches everything *but* the suffix and will be applied to both `glob`s and file names.<br>
        For simplicity and compatibility with previous versions, there is always an implied `r"\1"` before
        the output parameters. I.e. output parameters strings are *always* substituted.<br>
        
    ==Advanced form of `@split`:==
        The standard `@split` divided one set of inputs into multiple outputs (the number of which
        can be determined at runtime).<br>
        This is a `one->many` operation.<br><br>
        An advanced form of `@split` has been added which can split each of several files further.<br>
        In other words, this is a `many->"many more"` operation.<br><br>
        For example, given three starting files:
        {{{
original_files = ["original_0.file",
                  "original_1.file",
                  "original_2.file"]
        }}}
        We can split each into its own set of sub-sections:
        {{{
@split(original_files,
   regex(r"starting_(\d+).fa"),                         # match starting files
         r"files.split.\1.*.fa"                         # glob pattern
         r"\1")                                         # index of original file
def split_files(input_file, output_files, original_index):
    """
        Code to split each input_file
            "original_0.file" -> "files.split.0.*.fa"
            "original_1.file" -> "files.split.1.*.fa"
            "original_2.file" -> "files.split.2.*.fa"
    """
        }}}
        This is, conceptually, the reverse of the @collate(...) decorator
    ==Ruffus will complain about unescaped regular expression special characters:==
        Ruffus uses ``'\1'`` and ``'\2'`` in regular expression substitutions. Even seasoned python
        users may not remember that these have to be 'escaped' in strings. The best option is
        to use 'raw' python strings e.g. `r"\1_substitutes\2correctly\3four\4times"`.<br>
        Ruffus will throw an exception if it sees an unescaped ``'\1'`` or ``'\2'`` in a file name,
        which should catch most of these bugs.
    ==Flowchart changes:==
        Changed to nicer colours, symbols etc. for a more professional look.
                Colours, size and resolution are now fully customisable. An svg bug in firefox has
                been worked around so that font size are displayed correctly
                {{{
pipeline_printout_graph( #...
                        user_colour_scheme = {
                                              "colour_scheme_index":1
                                              "Task to run"  : {"fillcolor":"blue"},
                                               pipeline_name : "My flowchart",
                                               size          : (11,8),
                                               dpi           : 120)})
                }}}
    ==Bug Fix:==
        * From "Issue 27"
            Previously, Ruffus paused for one second after each job.
            This accomodates poor (one second) timestamp precision in some older file systems (ext3?),  
            and makes sure that output from the previous tasks has a different
            timestamp from that of the following task.<br><br>
            Unfortunately, Ruffus (was too clever by half and) paused only when the jobs were less
            than a second in duration. 
            Output files may be created at the end of a task, and
            the timestamps checked at the beginning of the following task. We thus *always* need a 
            gap of > 1 seconds between tasks in older filesystems, whether the jobs are long or short.<br><br>
            The fix is to introduce a pause before the first job of each task.
            (See `one_second_per_job` in `task.py:make_job_parameter_generator(...)`)<br><br>
            As previously, if you are using a modern file system (e.g. ext4 / JFS / NTFS), you can avoid these unnecessary pauses by setting the `one_second_per_job` flag:
            {{{
pipeline_run(one_second_per_job=False)
            }}}
        * From "Issue 30"
            @split with empty input files crashes Ruffus 
    ==Documentation changes:==
        * New bioinformatics example
        * New contributed Gallery of flowcharts


= v. 2.1.1=
    _12/March/2010_
    ==Bug Fix:==
        * From "Issue 26"
          The code "with job_limit_semaphore" breaks compatability with python 2.5<br>
          Thanks to patch from S. Binet
        * From "Issue 25"
          @merge forwarding single arguments to @merge erroneously as lists<br>
          Thanks to A. heger.
        * @transform(..., suffix(...), inputs(...))
          Suffix substitution should not have been taking place within `inputs()`. This
          makes it pass a file name to `inputs()` without suffix substitution.
          `Regex()` regular expression substitution continues to take place within `inputs()`<br>
          However, see changes in v. 2.2
    ==Documentation changes:==
        * New step in tutorial emphasising the value of Pipeline_printout(...) in pipeline development
        * pipeline_printout discussion in the manual.
        * @jobs_limit directive described in the manual.
        * Advance uses of @split described in the manual.
        * touch_files_only parameter described in the manual.
        * `add_inputs(...)` parameter described in the manual.
    ==`@transform(.., add_inputs(...))`==
        * `inputs(...)` allowed the addition of extra input dependencies / parameters for each job.
          For example, compiling a source file might require pulling in a corresponding 
          header file.
          However, replacing all the input parameters always seemed a very blunt instrument
          just to inject an extra dependency (e.g. a header file).
        * `add_inputs(...)`, as the name suggests, just adds the additional items as the input parameter
          {{{
from ruffus import *
@transform(["a.input", "b.input"], suffix(".input"), add_inputs("just.1.more","just.2.more"), ".output")
def task(i, o):
  ""
          }}}
          produces:
          {{{
Job = [[a.input, just.1.more, just.2.more] ->a.output]
Job = [[b.input, just.1.more, just.2.more] ->b.output]
          }}}
        * like `inputs`, `add_inputs` accepts strings, tasks and globs
          This minor syntactic change promises to add much clarity to some of our
          Ruffus code.
        * `add_inputs()` is available for `@transform`, `@collate` and `@split`
   

= v. 2.1.0=
    _2/March/2010_
    ==Bug Fix:==
        * From "Issue 25".
          Regression for v. 2.0.10
          @files forwarding single arguments as lists.
          (Thanks to A. Heger)
    ==@jobs_limit directive==
        * Some tasks are resource intensive and too many jobs should not be run at the 
          same time. Examples include disk intensive operations such as unzipping, or 
          downloading from FTP sites. 
          Adding 
          {{{
@jobs_limit(4)
@transform(new_data_list, suffix(".big_data.gz"), ".big_data")
def unzip(i, o):
  "unzip code goes here"
          }}}
          would limit the unzip operation to 4 jobs at a time, even if the rest of the
          pipeline runs highly in parallel.
          (Thanks to R. Young for suggesting this.)

= v. 2.0.10=
    _27/February/2010_
    ==pipeline_run(..., touch_files_only = True)==
        * This will only `touch` output files for each job without running the 
          python function. I.e. The output files are updated if they are old, or created
          if missing.
          This can be useful for simulating a pipeline run so that all files look as
          if they are up-to-date.
        Caveats:
        * This may not work correctly where output files are only determined at runtime, e.g. with @split
        * Only the output from pipelined jobs which are currently out-of-date will be touched.
          In other words, the pipeline runs *as normal*, the only difference is that the
          output files are touched instead of being created by the python task functions
          which would otherwise have been invoked.
    ==parameter substitution for inputs(...)==
        * The inputs(...) parameter in @transform, @collate can now take tasks and globs,
          and these will be substituted appropriately (after regular expression replacement).
    ==Bug Fix:==
        * From "Issue 21".
          Empty @files specifications no longer throw exceptions.
          If verbose logging is on, a warning is printed.
          (Thanks to A. Heger)


= v. 2.0.9=
    _25/February/2010_
    ==Bug Fix:==
        * From "Issue 10".
          Source code directory under svn is now in /ruffus rather than src/ruffus
          (Thanks to P.J. Davis)
        * Better display of @split parameters when logging output
          The output parameters in @split should not be expanded 
          if they are wildcards. This was previously handled as a special case. Now
          all parameter factories return two sets of parameters:
          The first to go to jobs, the second for displaying in trace logs.
        * Pipeline_printout defaults to verbose of 1. Verbose of 0 does nothing
          (Thanks to L.S.G)
        * The "Start Task" log message at verbosity of 3 was misleading.
          This is only when the task enters the queue. 
          If there are multiple independent tasks, they may all enter the queue at the 
          same time even with multiprocess=1. Jobs will be run one at a time.                                     
          (Thanks to C. Nellaker.)
    ==Advanced form of split:==
        * Previously split only takes 1 set of input (tasks/files/globs) and split these into an indeterminate number of output.
          The new advanced form of split takes multiple input, and splits EACH of these
          further. I.e. it is like a combination of @split and @transform.
          For example:
          {{{
@split(get_files, regex(r"(.+).original"), r"\1.*.split")
def split_files(i, o): 
     pass
          }}}
          This experimental feature will be in beta without documentation. Caveat utilitor!
            

= v. 2.0.8=
    _22/January/2010_
    ==Bug Fix:==
        * Now accepts unicode file names: 
          Change `isinstance(x,str)` to `isinstance(x, basestring)`
          (Thanks to P.J. Davis for contributing this.)
        * inputs(...) now raises an exception when passed multiple arguments.
          If the input parameter is being passed a tuple, add an extra set of enclosing
          brackets. Documentation updated accordingly.
          (Thanks to Z. Harris for spotting this.)
        * tasks where regular expressions are incorrectly specified are a great source of frustration
          and puzzlement.
          Now if no regular expression matches occur, a warning is printed
          (Thanks to C. Nellaker for suggesting this)

= v. 2.0.7=
    _11/December/2009_
    ==Bug Fix:==
        * graph printout blows up because of missing run time data error
          (Thanks to A. Heger for reporting this!)


= v. 2.0.6=
    _10/December/2009_
    ==Bug Fix:==
        * several minor bugs
        * better error messages when eoncountering decorator problems when checking if the pipeline is uptodate
        * Exception when output specifications in @split were expanded (unnecessarily) in logging.
          (Thanks to N. Spies for reporting this!)

= v. 2.0.4=
    _22/November/2009_
    ==Bug Fix:==
        * task.get_job_names() dies for jobs with no parameters
        * JobSignalledBreak was not exported

= v. 2.0.3=
    _18/November/2009_
    ==Bug Fix:==
        * @transform accepts single file names. Thanks Chris N.

= v. 2.0.2=
    _18/November/2009_
    ==Better Logging:==
        * pipeline_printout output much prettier
        * pipeline_run at high verbose levels 

          Shows which tasks are being checked
          to see if they are up-to-date or not
    ==Documentation:==
        * New tutorial
        * New manual
        * pretty code figures

= v. 2.0.1=
    _18/November/2009_
    All unit tests passed
    ==Bug Fix:==
        * Numerous bugs to do with ordering of glob / job output consistency

= v. 2.0.1 beta4=
    _16/November/2009_
    ==Bug Fix:==
        * Fixed problems with tasks depending on @split

= v. 2.0 beta=
    _30/October/2009_
    With the experience and feedback over the past few months, I have reworked **Ruffus** 
    completely mainly to make the syntax more transparent, with fewer gotchas.
    Previous limitations to do with parameters have been removed.
    The experience with what *Ruffus* calls "Indicator Objects" has been very positive
    and there are more of them. 
    These are dummy classes with obvious names like "regex" or "suffix" which indicate the
    type of optional parameters much like named parameters.

    ==New Decorators:==
        * @split
        * @merge
        * @transform
        * @collate

    ==Deprecated Decorators:==
        * @files_re
          Functionality is divided among the new decorators
            
    ==New Features:==
        * Files can be chained from task to task, implicit dependencies are inferred automatically
        * Limitations on parameters removed. Any degree of nesting is allowed.
        * Strings contain glob letters ``[]?*`` automatically inferred as globs and expanded
        * input and output parameters containing strings assumed to be filenames, whatever the nested data structures they are found in

    ==Documentation:==
        * New documentation almost complete
        * New Simplified 7 step tutorial
        * New manual work in progress

    ==Bug Fix:==
        * Scheduling errors

= v. 1.1.4=
    _15/October/2009_

    ==New Feature:==
        * Tasks can get their input by automatically chaining to the output from one or more parent tasks using the `@files_re`
        * Added example showing how files can be split up into multiple jobs and then recombined
           # Run `test/test_filesre_split_and_combine.py` with `-v|--verbose` `-s|--start_again`
           # Run with `-D|--debug` to test.
        * Documentation to follow

    ==Bug Fix:==
        * Scheduling race conditions

= v. 1.1.3=
    _14/October/2009_

    ==Bug Fix:==
        * Minor (but show stopping) bug in task.generic_job_descriptor

= v. 1.1.2=
    _9/October/2009_
    
    ==Bug Fix:==
        * Nasty (long standing) bug for single job tasks only decorated with `@follows(mkdir(...))` to be caught in an infinite loop

    ==Code Changes:==
        * Add example of combining multiple input files depending on a regular expression pattern. 
           # Run `test/test_filesre_combine.py` with -v (verbose)
           # Run with -D (debug) to test.


= v. 1.1.1=
    _8/October/2009_
    
    ==New Feature:==
        * _Combine multiple input files using a regular expression_
        * Added `combine` syntax to `@files_re` decorators:
        * Documentation to follow...
        * Example from `src/ruffus/test/test_branching_dependencies.py`:
        {{{
@files_re('*.*', '(.*/)(.*\.[345]$)', combine(r'\1\2'), r'\1final.6')
def test(input_files, output_files):
  pass`
        }}}     
        * will take all files in the current directory
        * will identify files which end in `.3`,  `.4` and `.5` as input files
        * will use `final.6` as the output file
        * `input_files  == [a.3, a.4, b.3, b.5]`  (for example)
        * `output_files == [final.6]` 
           
    ==Bug Fix:==
        * All (known) bugs for running jobs from independent tasks in parallel


= v. 1.0.9=
    _8/October/2009_
    
    ==New Feature:==
        _Multitasking independent tasks_
        * In a major piece of retooling, jobs from independent tasks which do not         depend on each other will be run in parallel.
        * This involves major changes to the scheduling code. 
        * Please contact me asap if anything breaks.

    ==Code Changes:==
        * Add example of independent tasks running concurrently in
          `test/test_branching_dependencies.py`
          * Run with -v (verbose) and -j 1 or -j 10 to show the indeterminancy of multiprocessing.
          * Run with -D (debug) to test.

= v. 1.0.8=
    _12/August/2009_
    
    ==Documentation:==
        * Errors fixed. Thanks to Matteo Bertini!

    ==Code Changes:==
        * Added functions which print out job parameters more prettily.
        * `task.shorten_filenames_encoder`
        * `task.ignore_unknown_encoder`
        * Parameters which look like file paths will only have the file part printed
          (i.e. `"/a/b/c" -> 'c'`)
        * Test scripts `simpler_with_shared_logging.py` and `test_follows_mkdir.py`
          have been changed to test for this.


= v. 1.0.7=
    _17/June/2009_
    
    ==Code Changes:==
        * Added `proxy_logger` module for accessing a shared log across multiple jobs in
          different processes.

= v. 1.0.6=
    _12/June/2009_
    
    ==Bug fix:==
        * _Ruffus_ version module (`ruffus_version.py`) links fixed
          Soft links in linux do not travel well
        * `mkdir` now can take a list of strings
          added test case

    ==Documentation:==
        * Added history of changes

= v. 1.0.5=
    _11/June/2009_

    ==Bug fix:==
        * Changed "graph_printout" to `pipeline_printout_graph` in documentation.
          This function had been renamed in the code but not in the documentation :-(

    ==Documentation:==
        * Added example for sharing synchronising data between jobs.
          This shows how different jobs can write to a common log file while still leaveraging the full power of _ruffus_.
        

    ==Code Changes:==
        * The graph and print_dependencies modules are no longer exported by default from task.
          Please email me if this breaks anything.
        * More informative error message when refer to unadorned (without _Ruffus_ decorators) python functions as pipelined Tasks
        * Added Ruffus version module `ruffus_version.py`


= v. 1.0.4=
    _05/June/2009_
    ==Bug fix: ==
        * `task.task_names_to_tasks` did not include tasks specified by function rather than name
        * `task.run_all_jobs_in_task` did not work properly without multiprocessing (# of jobs = 1)
        * `task.pipeline_run` only uses multiprocessing pools if `multiprocess` (# of jobs)  > 1
    
    ==Changes to allow python 2.4/2.5 to run:==
        * `setup.py` changed to remove dependency
        * `simplejson` can be loaded instead of python 2.6 `json` module
        * Changed `NamedTemporaryFile` to `mkstemp` because delete parameter is not available before python 2.6

    ==Windows programmes==
        It is necessary to protect the "entry point" of the program under windows.
        Otherwise, a new process with be created recursively, like the magicians's apprentice
        See: http://docs.python.org/library/multiprocessing.html#multiprocessing-programming

= v. 1.0.3=
    _04/June/2009_
    ==Documentation ==
        
        Including SGE `qrsh` workaround in FAQ.

= v. 1.0.1=
    _22/May/2009_
    ==Add simple tutorial.==
    
        No major bugs so far...!!

= v. 1.0.0 beta =
    _28/April/2009_

    Initial Release in Oxford