Skip to content

Add -j <max_jobs> and -m (all tasks multitasks) -- solution included #113

Closed
wants to merge 10 commits into from

9 participants

@michaeljbishop

PROBLEM SUMMARY (THE CASE FOR -j and -m)

Rake can be unusable for builds invoking large numbers of concurrent external processes.

PROBLEM DESCRIPTION

Rake makes it easy to maximize concurrency in builds with its "multitask" function. When using rake to build non-ruby projects quite often rake needs to execute shell tasks to process files. Unfortunately, when executing multitasks, rake spawns a new thread for each task-prerequisite. This shouldn't cause problems when the build code is pure ruby (for green threads), but when the tasks are executing external processes, the sheer number of spawned processes can cause the machine to thrash. Additionally ruby can reach the maximum number of open files (presumably because it's reading stdout for all those processes).

SOLUTION SUMMARY

This request includes the code to add support for a --jobs NUMBER (-j) command-line option to specify the number of simultaneous tasks to execute.

  • To maintain backward compatibility, not passing -j reverts to the old behavior of unlimited concurrent tasks.

As a nod to drake, a --multitask (-m) flag is also included which when supplied, changes tasks into multitasks.

SOLUTION

Rather than spawning a new thread per prerequisite MultiTask now sends its prerequisites to a WorkerPool object. WorkerPool.new(n).execute_blocks has the same semantics as Thread.new...join but caps the thread count at n.

Core Change

threads = @prerequisites.collect { |p|
  Thread.new(p) { |r| application[r, @scope].invoke_with_call_chain(args, invocation_chain) }
}
threads.each { |t| t.join }

...becomes...

@@wp ||= WorkerPool.new(application.options.thread_pool_size)

blocks = @prerequisites.collect { |r|
  lambda { application[r, @scope].invoke_with_call_chain(args, invocation_chain) }
}
@@wp.execute_blocks blocks

To support -m, the MultiTask implementation has moved to Task#invoke_prerequisites_concurrently and is called from MultiTask#invoke_prerequisites. This enables concurrent behavior for Task when -m is used.

Details

WorkerPool#execute_blocks adds the passed-in blocks to a queue, ensures there are enough threads to execute them (under the maximum), and sleeps the current thread until the blocks are processed.

This creates a few potential problems:

What if all of the blocks then called #execute_blocks? Wouldn't that sleep all the threads?

Yes it would. This is solved as #execute_blocks removes the current thread from the thread pool just before it sleeps and creates a new one in its place. When all the blocks are processed, the current thread is added back to the pool (adjusting for the max-size). There are always enough available threads in the thread pool for processing.

When do the threads shutdown?

WorkerPool#execute_blocks knows how many threads are waiting for their blocks to be processed. If, upon its awakening, it notices there are no threads waiting on blocks, it shuts down the thread pool.

Statistics

 ---LINES--         ----LOC---
  old   new  diff    old   new  diff  File Name
 ----------  ----   ----------  ----  ----------
  598   605    +7    477   484    +7  lib/rake/application.rb
   16    13    -3     11     8    -3  lib/rake/multi_task.rb
  327   341   +14    210   222   +12  lib/rake/task.rb
        111                 80        lib/rake/worker_pool.rb
 4264  4393         2696  2792        TOTAL
 ------------------------------------------------
             +129                +96  SUMMARY

TESTS

Tests are included for all new functionality

REQUIREMENTS

The Ruby version requirements remain the same. lib/rake/worker_pool.rb adds two new requirements: thread and set

michaeljbishop added some commits Apr 18, 2012
@michaeljbishop michaeljbishop Rake now supports a --jobs <n> command-line option.
DESCRIPTION
-----------
The new option: "--jobs number (-j)" specifies the maximum number of
concurrent tasks. The suggested value is equal to the number of CPUs.

Sample values:
  default: unlimited concurrent tasks (standard rake behavior)
  1: one task at a time

The code consists of two major edits, the first is a change to
`application.rb` to support the parsing of the option.

The second is a more substantial change to `multi_task.rb` which
replaces the multi-task scheduling algorithm. Instead of spawning a new
thread for every pre-requisite that needs to be executed, a block is
created which calls the pre-requisite and added to a Queue.

Additionally, a thread-pool is created to pull the blocks off the queue
and execute them. Finally, the MultiTask queueing up its prerequisites
will itself participate in the block-processing while waiting for its
prerequisites to finish processing.

It can tell when its prerequisites are finished by enveloping the
queued blocks in another block that adds a little bookkeeping.

VERSION REQUIREMENTS
--------------------
Rake ruby version requirements remain unchanged.
295c7a4
@michaeljbishop michaeljbishop Fixed a bug where the MultiTask tests were not testing the thread pool
This is because I left in the original code which just passed through
to spawn unlimited threads.

Now the code always uses the thread pool, and sets the initial
limit to be the maximum Fixnum (which means virtually unlimited)

This is nice because it means the rake MultiTask tests are now
stressing the thread pool implementation.

Additionally, the thread pool size can now be changed dynamically
to adjust to load by changing 'application.options.thread_pool_size'
while rake is running.

There is no code that observes load, but it certainly could and
adjust as it saw fit.

Notes:
  While the threads are in the their processing loop, they add
  other threads need to be added to the pool to meet the limit.
  Additionally, if the thread pool size is larger than the
  application preference the thread exits.
d1229f0
@michaeljbishop michaeljbishop Fixed bug where the stack was being blown
The previous "add blocks to queue" method worked, but had the
unnecessary side-effect of blowing the stack for large amounts of
prerequisites.

This is a new thread pool implementation which retains all the
advantages of the original, but keeps the stack size the same as the
pre-thread-pool rake.

It's closer to the pre-thread-pool rake implementation with the
addition of checking the thread pool size before spawning a new thread.
578b637
@michaeljbishop michaeljbishop Added -m option which forces every task to be a multitask
This change pulls the concurrent implementation of
'invoke_prerequisites' out of MultiTask into Task while changing the
method name to 'invoke_prerequisites_concurrently'

Then, if -m is passed, Task calls 'invoke_prerequisites_concurrently'
so everything is multithreaded.

MultiTask always calls 'invoke_prerequisites_concurrently'
0ee43db
@michaeljbishop michaeljbishop Forgot to update the rdoc for the -m option c3f5260
@michaeljbishop michaeljbishop Rake now has a throttle on simultaneous task count
This is implemented by adding Rake::WorkerPool which provides a way for
callers to synchronously execute blocks by a thread pool.

What has not changed (and is still problematic) is that Rake creates a
new thread for each and every MultiTask prerequisite.

NOTES: Passes all rake tests
faa1ff1
@michaeljbishop michaeljbishop MultiTask no longer spawns a thread for each prerequisite
WorkerPool:
- now has the ability to execute an array of blocks and wait for them
all to execute.
- only adds a new thread when there is no thread waiting for action.
This slows the ramp up and threads are better reused
- has a minimum and maximum size. By default, minimum is 1 and maximum
is the maximum fixnum
- removed unused #wait call

MultiTask:
- Now uses WorkerPool#execte_blocks to execute its prerequisites
7fa886d
@michaeljbishop michaeljbishop Fixed WorkerPool so it can synchronously execute a group of blocks am…
…ongst a thread pool.

Since the WorkerPool was only needed in MultiTask, I changed Task back to simply executing it's actions and calling its prerequisites.

Merge branch 'master' into everything_is_a_multitask

Conflicts:
	lib/rake/application.rb
	lib/rake/multi_task.rb
2a2bb2f
@michaeljbishop michaeljbishop Added tests. tidied up command-line ops. WorkerPool default size is F…
…IXNUM_MAX.

* The default WorkerPool maximum size was changed to FIXNUM_MAX. This
  is ok  because only enough blocks are added to the thread pool to
  cover the requested number of blocks (but not past the maximum).
* Added a #join call which clears the thread pool of all threads. This
  is called inside #execute_blocks when there are no more threads
  waiting on the thread pool.
* Suppressed "multithreads" output when specifying -m
* Removed unnecessary exception backtrace concatenating
* -j now is kinder when receiving bad input. If it has no parameter
  or the parameter can't be parsed, it defaults to 2.
* Added WorkerPool test
* added tests for -j and -m to the application options test.
f40087d
@michaeljbishop michaeljbishop -m hooked up. test included
Added a test for -m in task.rb. Hooked it up.
Concised code in application.rb
Added documentation for -j
b828747
@pkondzior

+1, is this going to be merged?

@michaeljbishop
@nexussays

Any movement on this? This is incredibly useful, however we depend on the distributed rake gem so we can't directly use @michaeljbishop's changes until they are integrated.

@jaisingh

this would be really helpful and give us a comparable feature to make

@amcbride

Please merge! This would be awesome

@psywombats

This feature would be very useful, hopefully it can be merged in shortly.

@nickma
nickma commented Jun 29, 2012

This will be really helpful for us.

@larsburgess

+1, this would be awesome!

@jimweirich
Owner

First, sorry for the delay in responding to this... I'm going through the backlog of Rake work and its taking some time.

Second, I like the functionality of the patch, but I find the worker pool logic a bit difficult to work through. I'm reluctant to merge this until I have a better handle on what's going on.

@michaeljbishop
@jimweirich
Owner

Excellent. Quick question: If multitask is waiting for the futures I'm assuming it won't be available for job processing. It is possible to get in a state where all the threads in the pool are waiting for futures and nothing is available to actually do the work?

This is the point I reached in my own simple-minded implementation of a thread pool. It seems your initial version side-stepped this problem, but I wasn't clear on how it did it.

@michaeljbishop
@michaeljbishop
@michaeljbishop

Hi Jim. As promised, I have a new implementation that I feel is a much improved version of this pull-request (and a tidier git history to boot), I have removed this repository, created a new one and added a pull request for that implementation here: #131

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.