forked from cgat-developers/ruffus
-
Notifications
You must be signed in to change notification settings - Fork 0
/
CHANGES.TXT
600 lines (509 loc) · 24.9 KB
/
CHANGES.TXT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
= v. 2.4=
_21/January/2014_
==Naive support for drmaa / running Ruffus on a cluster submission node
* Needs multithreading instead of multiprocessing added `pipeline_run(..., use_multi_threading = True)`
Multithreading allows a drmaa session to be shared and prevents the submission node from being overloaded by completion notifications from the cluster.
* ruffus.drmaa_wrapper.run_job() is a convenience function much like `qrsh` which can be used as a drop in replacement for `os.system()` or `subprocess.check_output()` etc.
* Takes drmaa queue name, priority, job name, session etc., creates a temp script file and runs specified command on the cluster, returns stdout and stderr lines only when the command is finished
= v. 2.3=
_03/October/2011_
==`@active_if` turns off tasks at runtime==
* Issue 36
* Design and initial implementation from Jacob Biesinger
* Takes one or more parameters which can be either booleans or functions or callable objects which return True / False
* The expressions inside @active_if are evaluated each time
`pipeline_run`, `pipeline_printout` or `pipeline_printout_graph` is called.
* Dormant tasks behave as if they are up to date and have no output.
==Command line parsing for pipelines==
* From "Issue 44"
* Added ruffus/cmdline.py
* Supports both argparse (python 2.7) and optparse (python 2.6):
* The following options are defined by default:
{{{
--verbose
--version
--log_file
-t, --target_tasks
-j, --jobs
-n, --just_print
--flowchart
--key_legend_in_graph
--draw_graph_horizontally
--flowchart_format
--forced_tasks
}}}
* Usage with argparse (Python > 2.7):
{{{
from ruffus import *
parser = cmdline.get_argparse( description='WHAT DOES THIS PIPELINE DO?')
# for example...
parser.add_argument("--input_file")
options = parser.parse_args()
# optional logger which can be passed to ruffus tasks
logger, logger_mutex = cmdline.setup_logging (__name__, options.log_file, options.verbose)
#_____________________________________________________________________________________
# pipelined functions go here
#_____________________________________________________________________________________
cmdline.run (options)
}}}
* Usage with optparse (Python 2.6):
{{{
from ruffus import *
parser = cmdline.get_optgparse(version="%prog 1.0", usage = "\n\n %prog [options]")
# for example...
parser.add_option("-c", "--custom", dest="custom", action="count")
(options, remaining_args) = parser.parse_args()
# logger which can be passed to ruffus tasks
logger, logger_mutex = cmdline.setup_logging ("this_program", options.log_file, options.verbose)
#_____________________________________________________________________________________
# pipelined functions go here
#_____________________________________________________________________________________
cmdline.run (options)
}}}
==Optionally terminate pipeline after first exception==
* To have all exceptions interrupt immediately:
{{{
pipeline_run(..., exceptions_terminate_immediately = True)
}}}
* From "Issue 43"
* By default ruffus accumulates `NN` errors before interrupting the pipeline prematurely. `NN` is the specified parallelism for `pipeline_run(...)`.
* By default, a pipeline will only be interrupted immediately if exceptions of type `ruffus.JobSignalledBreak` are thrown.
==Display exceptions without delay==
* To see exceptions as they occur:
{{{
pipeline_run(..., log_exceptions = True)
}}}
* From "Issue 43"
* By default, Ruffus re-throws exceptions in ensemble after pipeline termination.
* `logger.error(...)` will be invoked with the string representation of the each exception, and associated stack trace.
* The default logger prints to sys.stderr, but this can be changed to any class from the logging module or compatible object via `pipeline_run(..., logger = ???)`
==`@split` operations now show the 1->many output in pipeline_printout==
* From "Issue 45"
* New output
{{{
Task = split_animals
Job = [None
-> cows
-> horses
-> pigs
, any_extra_parameters]
}}}
==Improved display from `pipeline_printout()`==
* File date and time are displayed in human readable form and out of date
files are flagged with asterisks.
= v. 2.2=
_21/July/2010_
==Parameter substitution for `inputs(...)` / `add_inputs(...)`==
`glob`s and tasks can be added as the prerequisites / input files using
`inputs(...)` and `add_inputs(...)`. `glob` expansions will take place when the task
is run.
==Simplifying `@transform` syntax with suffix==
Regular expressions within ruffus are very powerful, and can allow files to be moved
from one directory to another and renamed at will.<br><br>
However, using consistent file extensions and
`@transform(..., suffix(...))` makes the code much simpler and easier to read. <br><br>
Previously, `suffix(...)` did not cooperately well with `inputs(...)`.
For example, finding the corresponding header file (``'.h'``) for the matching input
required a complicated `regex(...)` regular expression and `input(...)`. This simple case,
e.g. matching ``'something.c'`` with ``'something.h'``, is now much easier in Ruffus.<br><br>
For example:
{{{
source_files = ["something.c", "more_code.c"]
@transform(source_files, suffix(".c"), add_inputs(r"\1.h", "common.h"), ".o")
def compile(input_files, output_file):
( source_file,
header_file,
common_header) = input_files
# call compiler to make object file
}}}
This is equivalent to calling:
{{{
compile(["something.c", "something.h", "common.h"], "something.o")
compile(["more_code.c", "more_code.h", "common.h"], "more_code.o")
}}}
The `\1` matches everything *but* the suffix and will be applied to both `glob`s and file names.<br>
For simplicity and compatibility with previous versions, there is always an implied `r"\1"` before
the output parameters. I.e. output parameters strings are *always* substituted.<br>
==Advanced form of `@split`:==
The standard `@split` divided one set of inputs into multiple outputs (the number of which
can be determined at runtime).<br>
This is a `one->many` operation.<br><br>
An advanced form of `@split` has been added which can split each of several files further.<br>
In other words, this is a `many->"many more"` operation.<br><br>
For example, given three starting files:
{{{
original_files = ["original_0.file",
"original_1.file",
"original_2.file"]
}}}
We can split each into its own set of sub-sections:
{{{
@split(original_files,
regex(r"starting_(\d+).fa"), # match starting files
r"files.split.\1.*.fa" # glob pattern
r"\1") # index of original file
def split_files(input_file, output_files, original_index):
"""
Code to split each input_file
"original_0.file" -> "files.split.0.*.fa"
"original_1.file" -> "files.split.1.*.fa"
"original_2.file" -> "files.split.2.*.fa"
"""
}}}
This is, conceptually, the reverse of the @collate(...) decorator
==Ruffus will complain about unescaped regular expression special characters:==
Ruffus uses ``'\1'`` and ``'\2'`` in regular expression substitutions. Even seasoned python
users may not remember that these have to be 'escaped' in strings. The best option is
to use 'raw' python strings e.g. `r"\1_substitutes\2correctly\3four\4times"`.<br>
Ruffus will throw an exception if it sees an unescaped ``'\1'`` or ``'\2'`` in a file name,
which should catch most of these bugs.
==Flowchart changes:==
Changed to nicer colours, symbols etc. for a more professional look.
Colours, size and resolution are now fully customisable. An svg bug in firefox has
been worked around so that font size are displayed correctly
{{{
pipeline_printout_graph( #...
user_colour_scheme = {
"colour_scheme_index":1
"Task to run" : {"fillcolor":"blue"},
pipeline_name : "My flowchart",
size : (11,8),
dpi : 120)})
}}}
==Bug Fix:==
* From "Issue 27"
Previously, Ruffus paused for one second after each job.
This accomodates poor (one second) timestamp precision in some older file systems (ext3?),
and makes sure that output from the previous tasks has a different
timestamp from that of the following task.<br><br>
Unfortunately, Ruffus (was too clever by half and) paused only when the jobs were less
than a second in duration.
Output files may be created at the end of a task, and
the timestamps checked at the beginning of the following task. We thus *always* need a
gap of > 1 seconds between tasks in older filesystems, whether the jobs are long or short.<br><br>
The fix is to introduce a pause before the first job of each task.
(See `one_second_per_job` in `task.py:make_job_parameter_generator(...)`)<br><br>
As previously, if you are using a modern file system (e.g. ext4 / JFS / NTFS), you can avoid these unnecessary pauses by setting the `one_second_per_job` flag:
{{{
pipeline_run(one_second_per_job=False)
}}}
* From "Issue 30"
@split with empty input files crashes Ruffus
==Documentation changes:==
* New bioinformatics example
* New contributed Gallery of flowcharts
= v. 2.1.1=
_12/March/2010_
==Bug Fix:==
* From "Issue 26"
The code "with job_limit_semaphore" breaks compatability with python 2.5<br>
Thanks to patch from S. Binet
* From "Issue 25"
@merge forwarding single arguments to @merge erroneously as lists<br>
Thanks to A. heger.
* @transform(..., suffix(...), inputs(...))
Suffix substitution should not have been taking place within `inputs()`. This
makes it pass a file name to `inputs()` without suffix substitution.
`Regex()` regular expression substitution continues to take place within `inputs()`<br>
However, see changes in v. 2.2
==Documentation changes:==
* New step in tutorial emphasising the value of Pipeline_printout(...) in pipeline development
* pipeline_printout discussion in the manual.
* @jobs_limit directive described in the manual.
* Advance uses of @split described in the manual.
* touch_files_only parameter described in the manual.
* `add_inputs(...)` parameter described in the manual.
==`@transform(.., add_inputs(...))`==
* `inputs(...)` allowed the addition of extra input dependencies / parameters for each job.
For example, compiling a source file might require pulling in a corresponding
header file.
However, replacing all the input parameters always seemed a very blunt instrument
just to inject an extra dependency (e.g. a header file).
* `add_inputs(...)`, as the name suggests, just adds the additional items as the input parameter
{{{
from ruffus import *
@transform(["a.input", "b.input"], suffix(".input"), add_inputs("just.1.more","just.2.more"), ".output")
def task(i, o):
""
}}}
produces:
{{{
Job = [[a.input, just.1.more, just.2.more] ->a.output]
Job = [[b.input, just.1.more, just.2.more] ->b.output]
}}}
* like `inputs`, `add_inputs` accepts strings, tasks and globs
This minor syntactic change promises to add much clarity to some of our
Ruffus code.
* `add_inputs()` is available for `@transform`, `@collate` and `@split`
= v. 2.1.0=
_2/March/2010_
==Bug Fix:==
* From "Issue 25".
Regression for v. 2.0.10
@files forwarding single arguments as lists.
(Thanks to A. Heger)
==@jobs_limit directive==
* Some tasks are resource intensive and too many jobs should not be run at the
same time. Examples include disk intensive operations such as unzipping, or
downloading from FTP sites.
Adding
{{{
@jobs_limit(4)
@transform(new_data_list, suffix(".big_data.gz"), ".big_data")
def unzip(i, o):
"unzip code goes here"
}}}
would limit the unzip operation to 4 jobs at a time, even if the rest of the
pipeline runs highly in parallel.
(Thanks to R. Young for suggesting this.)
= v. 2.0.10=
_27/February/2010_
==pipeline_run(..., touch_files_only = True)==
* This will only `touch` output files for each job without running the
python function. I.e. The output files are updated if they are old, or created
if missing.
This can be useful for simulating a pipeline run so that all files look as
if they are up-to-date.
Caveats:
* This may not work correctly where output files are only determined at runtime, e.g. with @split
* Only the output from pipelined jobs which are currently out-of-date will be touched.
In other words, the pipeline runs *as normal*, the only difference is that the
output files are touched instead of being created by the python task functions
which would otherwise have been invoked.
==parameter substitution for inputs(...)==
* The inputs(...) parameter in @transform, @collate can now take tasks and globs,
and these will be substituted appropriately (after regular expression replacement).
==Bug Fix:==
* From "Issue 21".
Empty @files specifications no longer throw exceptions.
If verbose logging is on, a warning is printed.
(Thanks to A. Heger)
= v. 2.0.9=
_25/February/2010_
==Bug Fix:==
* From "Issue 10".
Source code directory under svn is now in /ruffus rather than src/ruffus
(Thanks to P.J. Davis)
* Better display of @split parameters when logging output
The output parameters in @split should not be expanded
if they are wildcards. This was previously handled as a special case. Now
all parameter factories return two sets of parameters:
The first to go to jobs, the second for displaying in trace logs.
* Pipeline_printout defaults to verbose of 1. Verbose of 0 does nothing
(Thanks to L.S.G)
* The "Start Task" log message at verbosity of 3 was misleading.
This is only when the task enters the queue.
If there are multiple independent tasks, they may all enter the queue at the
same time even with multiprocess=1. Jobs will be run one at a time.
(Thanks to C. Nellaker.)
==Advanced form of split:==
* Previously split only takes 1 set of input (tasks/files/globs) and split these into an indeterminate number of output.
The new advanced form of split takes multiple input, and splits EACH of these
further. I.e. it is like a combination of @split and @transform.
For example:
{{{
@split(get_files, regex(r"(.+).original"), r"\1.*.split")
def split_files(i, o):
pass
}}}
This experimental feature will be in beta without documentation. Caveat utilitor!
= v. 2.0.8=
_22/January/2010_
==Bug Fix:==
* Now accepts unicode file names:
Change `isinstance(x,str)` to `isinstance(x, basestring)`
(Thanks to P.J. Davis for contributing this.)
* inputs(...) now raises an exception when passed multiple arguments.
If the input parameter is being passed a tuple, add an extra set of enclosing
brackets. Documentation updated accordingly.
(Thanks to Z. Harris for spotting this.)
* tasks where regular expressions are incorrectly specified are a great source of frustration
and puzzlement.
Now if no regular expression matches occur, a warning is printed
(Thanks to C. Nellaker for suggesting this)
= v. 2.0.7=
_11/December/2009_
==Bug Fix:==
* graph printout blows up because of missing run time data error
(Thanks to A. Heger for reporting this!)
= v. 2.0.6=
_10/December/2009_
==Bug Fix:==
* several minor bugs
* better error messages when eoncountering decorator problems when checking if the pipeline is uptodate
* Exception when output specifications in @split were expanded (unnecessarily) in logging.
(Thanks to N. Spies for reporting this!)
= v. 2.0.4=
_22/November/2009_
==Bug Fix:==
* task.get_job_names() dies for jobs with no parameters
* JobSignalledBreak was not exported
= v. 2.0.3=
_18/November/2009_
==Bug Fix:==
* @transform accepts single file names. Thanks Chris N.
= v. 2.0.2=
_18/November/2009_
==Better Logging:==
* pipeline_printout output much prettier
* pipeline_run at high verbose levels
Shows which tasks are being checked
to see if they are up-to-date or not
==Documentation:==
* New tutorial
* New manual
* pretty code figures
= v. 2.0.1=
_18/November/2009_
All unit tests passed
==Bug Fix:==
* Numerous bugs to do with ordering of glob / job output consistency
= v. 2.0.1 beta4=
_16/November/2009_
==Bug Fix:==
* Fixed problems with tasks depending on @split
= v. 2.0 beta=
_30/October/2009_
With the experience and feedback over the past few months, I have reworked **Ruffus**
completely mainly to make the syntax more transparent, with fewer gotchas.
Previous limitations to do with parameters have been removed.
The experience with what *Ruffus* calls "Indicator Objects" has been very positive
and there are more of them.
These are dummy classes with obvious names like "regex" or "suffix" which indicate the
type of optional parameters much like named parameters.
==New Decorators:==
* @split
* @merge
* @transform
* @collate
==Deprecated Decorators:==
* @files_re
Functionality is divided among the new decorators
==New Features:==
* Files can be chained from task to task, implicit dependencies are inferred automatically
* Limitations on parameters removed. Any degree of nesting is allowed.
* Strings contain glob letters ``[]?*`` automatically inferred as globs and expanded
* input and output parameters containing strings assumed to be filenames, whatever the nested data structures they are found in
==Documentation:==
* New documentation almost complete
* New Simplified 7 step tutorial
* New manual work in progress
==Bug Fix:==
* Scheduling errors
= v. 1.1.4=
_15/October/2009_
==New Feature:==
* Tasks can get their input by automatically chaining to the output from one or more parent tasks using the `@files_re`
* Added example showing how files can be split up into multiple jobs and then recombined
# Run `test/test_filesre_split_and_combine.py` with `-v|--verbose` `-s|--start_again`
# Run with `-D|--debug` to test.
* Documentation to follow
==Bug Fix:==
* Scheduling race conditions
= v. 1.1.3=
_14/October/2009_
==Bug Fix:==
* Minor (but show stopping) bug in task.generic_job_descriptor
= v. 1.1.2=
_9/October/2009_
==Bug Fix:==
* Nasty (long standing) bug for single job tasks only decorated with `@follows(mkdir(...))` to be caught in an infinite loop
==Code Changes:==
* Add example of combining multiple input files depending on a regular expression pattern.
# Run `test/test_filesre_combine.py` with -v (verbose)
# Run with -D (debug) to test.
= v. 1.1.1=
_8/October/2009_
==New Feature:==
* _Combine multiple input files using a regular expression_
* Added `combine` syntax to `@files_re` decorators:
* Documentation to follow...
* Example from `src/ruffus/test/test_branching_dependencies.py`:
{{{
@files_re('*.*', '(.*/)(.*\.[345]$)', combine(r'\1\2'), r'\1final.6')
def test(input_files, output_files):
pass`
}}}
* will take all files in the current directory
* will identify files which end in `.3`, `.4` and `.5` as input files
* will use `final.6` as the output file
* `input_files == [a.3, a.4, b.3, b.5]` (for example)
* `output_files == [final.6]`
==Bug Fix:==
* All (known) bugs for running jobs from independent tasks in parallel
= v. 1.0.9=
_8/October/2009_
==New Feature:==
_Multitasking independent tasks_
* In a major piece of retooling, jobs from independent tasks which do not depend on each other will be run in parallel.
* This involves major changes to the scheduling code.
* Please contact me asap if anything breaks.
==Code Changes:==
* Add example of independent tasks running concurrently in
`test/test_branching_dependencies.py`
* Run with -v (verbose) and -j 1 or -j 10 to show the indeterminancy of multiprocessing.
* Run with -D (debug) to test.
= v. 1.0.8=
_12/August/2009_
==Documentation:==
* Errors fixed. Thanks to Matteo Bertini!
==Code Changes:==
* Added functions which print out job parameters more prettily.
* `task.shorten_filenames_encoder`
* `task.ignore_unknown_encoder`
* Parameters which look like file paths will only have the file part printed
(i.e. `"/a/b/c" -> 'c'`)
* Test scripts `simpler_with_shared_logging.py` and `test_follows_mkdir.py`
have been changed to test for this.
= v. 1.0.7=
_17/June/2009_
==Code Changes:==
* Added `proxy_logger` module for accessing a shared log across multiple jobs in
different processes.
= v. 1.0.6=
_12/June/2009_
==Bug fix:==
* _Ruffus_ version module (`ruffus_version.py`) links fixed
Soft links in linux do not travel well
* `mkdir` now can take a list of strings
added test case
==Documentation:==
* Added history of changes
= v. 1.0.5=
_11/June/2009_
==Bug fix:==
* Changed "graph_printout" to `pipeline_printout_graph` in documentation.
This function had been renamed in the code but not in the documentation :-(
==Documentation:==
* Added example for sharing synchronising data between jobs.
This shows how different jobs can write to a common log file while still leaveraging the full power of _ruffus_.
==Code Changes:==
* The graph and print_dependencies modules are no longer exported by default from task.
Please email me if this breaks anything.
* More informative error message when refer to unadorned (without _Ruffus_ decorators) python functions as pipelined Tasks
* Added Ruffus version module `ruffus_version.py`
= v. 1.0.4=
_05/June/2009_
==Bug fix: ==
* `task.task_names_to_tasks` did not include tasks specified by function rather than name
* `task.run_all_jobs_in_task` did not work properly without multiprocessing (# of jobs = 1)
* `task.pipeline_run` only uses multiprocessing pools if `multiprocess` (# of jobs) > 1
==Changes to allow python 2.4/2.5 to run:==
* `setup.py` changed to remove dependency
* `simplejson` can be loaded instead of python 2.6 `json` module
* Changed `NamedTemporaryFile` to `mkstemp` because delete parameter is not available before python 2.6
==Windows programmes==
It is necessary to protect the "entry point" of the program under windows.
Otherwise, a new process with be created recursively, like the magicians's apprentice
See: http://docs.python.org/library/multiprocessing.html#multiprocessing-programming
= v. 1.0.3=
_04/June/2009_
==Documentation ==
Including SGE `qrsh` workaround in FAQ.
= v. 1.0.1=
_22/May/2009_
==Add simple tutorial.==
No major bugs so far...!!
= v. 1.0.0 beta =
_28/April/2009_
Initial Release in Oxford