Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Add -j <max_jobs> (solution included) #131

Merged
merged 2 commits into from Oct 22, 2012

Conversation

Projects
None yet
4 participants
Contributor

michaeljbishop commented Oct 19, 2012

PROBLEM SUMMARY (THE CASE FOR -j)

Rake can be unusable for builds invoking large numbers of concurrent external processes.

PROBLEM DESCRIPTION

Rake makes it easy to maximize concurrency in builds with its "multitask" function. When using Rake to build non-ruby projects quite often Rake needs to execute shell tasks to process files. Unfortunately, when executing multitasks, Rake spawns a new thread for each task-prerequisite. This shouldn't cause problems when the build code is pure ruby (for green threads), but when the tasks are executing external processes, the sheer number of spawned processes can cause the machine to thrash. Additionally ruby can reach the maximum number of open files (presumably because it's reading stdout for all those processes).

SOLUTION SUMMARY

This request includes the code to add support for a --jobs NUMBER (-j) command-line option to specify the number of simultaneous tasks to execute.

  • To maintain backward compatibility, not passing -j reverts to the old behavior of unlimited concurrent tasks.

SOLUTION

Rather than spawning a new thread per prerequisite MultiTask now sends its prerequisites to a Rake::ThreadPool object. ThreadPool#future(*args,&block) has the same semantics as Thread.new but the block passed to future is added to a queue consumed by a limited number of threads. The returned future can be waited on by issuing #call which halts the current thread until the future has completed and returns its value (much like calling Thread#value).

Core Change (in multi_task.rb)

threads = @prerequisites.collect { |p|
  Thread.new(p) { |r| application[r, @scope].invoke_with_call_chain(args, invocation_chain) }
}
threads.each { |t| t.join }

...becomes...

futures = @prerequisites.collect do |p|
  application.thread_pool.future(p) do |r|
    application[r, @scope].invoke_with_call_chain(args, invocation_chain)
  end
end
futures.each { |f| f.call }

Details

ThreadPool#future(*args,&block) adds the passed-in block to a queue, and returns a future that can be used to realize the result.

This creates a potential problem:

What if we pass 100 blocks through ThreadPool#future and each one is waiting on the value of a single previous future? Wouldn't that starve the pool of threads because rather than dequeuing blocks, they would all be asleep waiting on a single future?

Yes it would. This is solved as the future returned by ThreadPool#future knows when a thread is about to go to sleep waiting on it. If the thread is one that belongs to the pool, another is spawned in its place just before it sleeps, raising the maximum in the pool by one. When the sleeping thread is reawakened because its future is finished, the maximum in the thread pool is correspondingly decremented by one. There are always threads available in the pool for processing.

Accessing the thread pool

The thread pool is owned by the application and is accessible as the public read-only attribute: Rake.application.thread_pool. In this way Rakefile authors can create futures for their own use, sharing the pool with the Rake system.

TESTS

Tests are included for all new functionality and have been tested under ruby 1.9.3 and 1.8.7

REQUIREMENTS

The Ruby version requirements remain the same. lib/rake/thread_pool.rb adds two new requirements: thread and set

michaeljbishop added some commits Oct 15, 2012

Added -j support to rake.
Rake now has a thread_pool implementation which returns futures when passed args
and a block. MultiTask has been changed to ask the thread pool for a list of
futures in which inside each a prerequisite is completed. MultiTask then waits
on each future until it is complete.

The number of threads in the pool is controlled with the new -j option at the
command-line.

The thread pool is now a member of Rake.application and rakefile authors can request
futures for their own operations, participating in the pool.

The thread pool is special in that it will spawn a new thread when a thread in the pool
is sleeping because it is waiting for a future being completed by another thread. When
the new thread is finished, the pool size will shrink to where it was previously.

With this change, the pool always has a number of threads actively doing work (that
number being equal to the -j parameter).

This commit also includes documentation for the new -j parameter and a test for the
ThreadPool implementation.

OH wow I could totally use this on my project.

@jimweirich jimweirich merged commit a8bdcf0 into jimweirich:master Oct 22, 2012

1 check passed

default The Travis build passed
Details
Owner

jimweirich commented Oct 22, 2012

Ok, I've merged the -j option and have pushed a beta (0.9.3.beta.2) with the changes. I would appreciate if you take a look and give it a spin.

Contributor

michaeljbishop commented Oct 22, 2012

Hey, that's great news!

I've been the ThreadPool version at work where we execute thousands of multitasks which use Open3#capture3 and sh(). I was using the old WorkerPool code until I submitted my new ThreadPool version.

I'll upgrade to Rake 0.9.3.beta.2 and see how everything hangs together...

I also left out the -m option that I put in my original pull-request. If specified, it would make every task into a multitask (like drake). It looks like people on the dev list miss that functionality so I'll make a pull request for that feature. You'll then have the code and can decide how you want to move forward.

_ michael

On Oct 22, 2012, at 2:54 PM, Jim Weirich notifications@github.com wrote:

Ok, I've merged the -j option and have pushed a beta (0.9.3.beta.2) with the changes. I would appreciate if you take a look and give it a spin.


Reply to this email directly or view it on GitHub.

Owner

jimweirich commented Oct 23, 2012

On Oct 22, 2012, at 4:20 PM, Michael Bishop notifications@github.com wrote:

I also left out the -m option that I put in my original pull-request. If specified, it would make every task into a multitask (like drake). It looks like people on the dev list miss that functionality so I'll make a pull request for that feature. You'll then have the code and can decide how you want to move forward.

Yes, I would like a separate pull request for that feature.

-- Jim Weirich
-- jim.weirich@gmail.com

Contributor

michaeljbishop commented Oct 23, 2012

I added the pull request (for -m).

One other thing occurred to me: I wanted to make sure you knew that Rake::ThreadPool as it is checked in, is public and adds three api methods to Rake.

  • initialize( thread_count=nil )
  • future(*args,&block)
  • join

I'm not sure what the default behavior of ThreadPool.new should be. I don't know if it should be to have no limit to the threads in the pool since that's essentially not having a thread pool (yet this is what it currently does). On the other hand, I didn't know what I should assume the thread_pool size should be by default.

Perhaps the thread_count parameter should not be optional?

So glad to see this finally made it to mainline!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment