New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combined parfor chunking and caching PRs. #7625
Conversation
…re calling set_parallel_chunksize.
…size, 2) set chunksize back to the default of 0 and then after the gufunc returns, restore the chunksize back to the previously saved value. This way, the current thread gets its default chunksize behavior inside the parallel region but goes back to its previous value when the region is over.
Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
More details on how actual chunksize can differ from specification. Moved code examples in docs to tests/doc_examples/test_parallel_chunksize.py. Export (g,s)et_parallel_chunksize from numba.np.ufunc. Fix withcontext parallel_chunksize doc string. Change set_parallel_chunksize to return previous chunk size. Use that return value to remove need for get_parallel_chunksize in some places. Raise exception if negative value to set_parallel_chunksize.
…, the full reduction array is passed to all gufunc workers. They each get their threadid to work on just their slice of the full reduction array. This simplifies some of the internal reduction code. This frees the reduction array length from any association with the size of the schedule.
…e. Use the dynamic thread count when constructing the schedule so that the parallel=True function can be correctly cacheable.
…n array allocation.
The latest changes looks good to me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @DrTodd13. I just noticed there's a near duplicate cache test file name the needs addressing (see inline comment). I also reviewed all the outstanding queries that got lost in the long review and have commented on those, all are resolved with the exception of #7625 (comment) which is still of concern. I'm going to give this a run through the build farm now on the basis that public CI will be sufficient to deal with the minor change resulting in merging the cache test files. Thanks again!
numba/tests/parfor_cache_usecases.py
Outdated
@@ -0,0 +1,37 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed that there's another file called parfors_cache_usecases.py
(the difference is the s
after parfor
) I think this file should be merged into that one and the corresponding cache test updated to reflect the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See IntelLabs#72
@@ -464,3 +466,70 @@ def _mutate_with_block_callee(blocks, blk_start, blk_end, inputs, outputs): | |||
block=ir.Block(scope=scope, loc=loc), | |||
outputs=outputs, | |||
) | |||
|
|||
class _ParallelChunksize(WithContext): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think as it's private we can just move it later as needed.
@@ -648,6 +663,18 @@ def impl(): | |||
return impl | |||
|
|||
|
|||
@intrinsic | |||
def _iget_num_threads(typingctx): | |||
_launch_threads() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parfor lowering could use get_num_threads
but I'm reluctant to add more typing queries into lowering, it makes things harder to debug.
static void | ||
add_task(void *fn, void *args, void *dims, void *steps, void *data) | ||
{ | ||
add_task_internal(fn, args, dims, steps, data, 0); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree.
gufunc_txt += " " + param_dict[var] + \ | ||
"=" + param_dict[arr] + "[" + gufunc_thread_id_var + "]\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining.
|
||
get_num_threads = cgutils.get_or_insert_function( | ||
builder.module, | ||
llvmlite.ir.FunctionType(llvmlite.ir.IntType(types.intp.bitwidth), []), | ||
"get_num_threads") | ||
|
||
num_threads = builder.call(get_num_threads, []) | ||
current_chunksize = builder.call(get_chunksize, []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we covered this in #7625 (comment)
@DrTodd13 I've opened IntelLabs#72 to address the duplication of test files, please could you take a look and if you approve merge in, many thanks. |
RE the outstanding comment from: #7625 (comment), PR #8186 has 991a965 which removes the proposed unification method on the |
gpuci run tests (just running this as there are some changes to |
Refactor parfor cache tests to make use of existing code.
Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
one unresolved comment: https://github.com/numba/numba/pull/7625/files#r904748398 |
Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
A windows test failed:
|
It's strange that this should suddenly start failing. It looks like a minor numerical error, probably just from using reductions/accumulation in here: numba/numba/tests/parfors_cache_usecases.py Lines 9 to 25 in 33e94b0
I think the "fix" is to use |
…ly equal check for output of summations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this @DrTodd13!
smoketesting at BFID |
🎉 |
This replaces #6025 and #7522. There was overlap between these two PRs around using the dynamic thread count so rather than delaying the merge I went ahead and combined them.
This combined PR provides an API for selecting a parfor chunk size to deal with load balancing issues and it eliminates all use of static thread counts in generated parfor code. Thus, parfor code (even with reductions) is now cacheable and if you change the chunksize or thread count after reloading from cache then you will use the new values as they are applied correctly in the code now.
Closes #2556
Closes #3144