Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combined parfor chunking and caching PRs. #7625

Merged
merged 113 commits into from Jun 24, 2022

Conversation

DrTodd13
Copy link
Collaborator

@DrTodd13 DrTodd13 commented Dec 6, 2021

This replaces #6025 and #7522. There was overlap between these two PRs around using the dynamic thread count so rather than delaying the merge I went ahead and combined them.

This combined PR provides an API for selecting a parfor chunk size to deal with load balancing issues and it eliminates all use of static thread counts in generated parfor code. Thus, parfor code (even with reductions) is now cacheable and if you change the chunksize or thread count after reloading from cache then you will use the new values as they are applied correctly in the code now.

Closes #2556
Closes #3144

DrTodd13 and others added 30 commits July 22, 2020 15:34
…size, 2) set chunksize back to the default of 0 and then after the gufunc returns, restore the chunksize back to the previously saved value. This way, the current thread gets its default chunksize behavior inside the parallel region but goes back to its previous value when the region is over.
Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
More details on how actual chunksize can differ from specification.
Moved code examples in docs to tests/doc_examples/test_parallel_chunksize.py.
Export (g,s)et_parallel_chunksize from numba.np.ufunc.
Fix withcontext parallel_chunksize doc string.
Change set_parallel_chunksize to return previous chunk size.
Use that return value to remove need for get_parallel_chunksize in some places.
Raise exception if negative value to set_parallel_chunksize.
…, the full reduction array is passed to all gufunc workers. They each get their threadid to work on just their slice of the full reduction array. This simplifies some of the internal reduction code. This frees the reduction array length from any association with the size of the schedule.
…e. Use the dynamic thread count when constructing the schedule so that the parallel=True function can be correctly cacheable.
@DrTodd13 DrTodd13 added the 4 - Waiting on reviewer Waiting for reviewer to respond to author label Jun 22, 2022
@sklam
Copy link
Member

sklam commented Jun 22, 2022

The latest changes looks good to me

Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @DrTodd13. I just noticed there's a near duplicate cache test file name the needs addressing (see inline comment). I also reviewed all the outstanding queries that got lost in the long review and have commented on those, all are resolved with the exception of #7625 (comment) which is still of concern. I'm going to give this a run through the build farm now on the basis that public CI will be sufficient to deal with the minor change resulting in merging the cache test files. Thanks again!

@@ -0,0 +1,37 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that there's another file called parfors_cache_usecases.py (the difference is the s after parfor) I think this file should be merged into that one and the corresponding cache test updated to reflect the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/source/developer/threading_implementation.rst Outdated Show resolved Hide resolved
numba/core/types/functions.py Outdated Show resolved Hide resolved
@@ -464,3 +466,70 @@ def _mutate_with_block_callee(blocks, blk_start, blk_end, inputs, outputs):
block=ir.Block(scope=scope, loc=loc),
outputs=outputs,
)

class _ParallelChunksize(WithContext):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as it's private we can just move it later as needed.

@@ -648,6 +663,18 @@ def impl():
return impl


@intrinsic
def _iget_num_threads(typingctx):
_launch_threads()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parfor lowering could use get_num_threads but I'm reluctant to add more typing queries into lowering, it makes things harder to debug.

Comment on lines +526 to +531
static void
add_task(void *fn, void *args, void *dims, void *steps, void *data)
{
add_task_internal(fn, args, dims, steps, data, 0);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree.

Comment on lines +1316 to +1317
gufunc_txt += " " + param_dict[var] + \
"=" + param_dict[arr] + "[" + gufunc_thread_id_var + "]\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining.


get_num_threads = cgutils.get_or_insert_function(
builder.module,
llvmlite.ir.FunctionType(llvmlite.ir.IntType(types.intp.bitwidth), []),
"get_num_threads")

num_threads = builder.call(get_num_threads, [])
current_chunksize = builder.call(get_chunksize, [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we covered this in #7625 (comment)

@stuartarchibald
Copy link
Contributor

@DrTodd13 I've opened IntelLabs#72 to address the duplication of test files, please could you take a look and if you approve merge in, many thanks.

@stuartarchibald stuartarchibald mentioned this pull request Jun 23, 2022
@stuartarchibald
Copy link
Contributor

RE the outstanding comment from: #7625 (comment), PR #8186 has 991a965 which removes the proposed unification method on the ExternalFunctionPointer type and the tests at least pass in public CI. Is this sufficient evidence to suggest it is not needed?

@gmarkall
Copy link
Member

gpuci run tests (just running this as there are some changes to numba.core - I'm pretty sure it won't affect anything negatively in CUDA, but just double-checking here)

Refactor parfor cache tests to make use of existing code.
Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
@sklam
Copy link
Member

sklam commented Jun 23, 2022

one unresolved comment: https://github.com/numba/numba/pull/7625/files#r904748398

Co-authored-by: stuartarchibald <stuartarchibald@users.noreply.github.com>
@sklam
Copy link
Member

sklam commented Jun 23, 2022

A windows test failed:


======================================================================
FAIL: test_caller (numba.tests.test_parfors_caching.TestParforsCache)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\1\s\numba\tests\test_parfors_caching.py", line 45, in test_caller
    self.run_test(f, num_funcs=3)
  File "D:\a\1\s\numba\tests\test_parfors_caching.py", line 22, in run_test
    self.assertPreciseEqual(f(ary), f.py_func(ary))
  File "D:\a\1\s\numba\tests\support.py", line 390, in assertPreciseEqual
    self.fail("when comparing %s and %s: %s" % (first, second, failure_msg))
AssertionError: when comparing [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1] and [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]: 0.1 != 0.10000000000000003

----------------------------------------------------------------------

@stuartarchibald
Copy link
Contributor

A windows test failed:


======================================================================
FAIL: test_caller (numba.tests.test_parfors_caching.TestParforsCache)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\a\1\s\numba\tests\test_parfors_caching.py", line 45, in test_caller
    self.run_test(f, num_funcs=3)
  File "D:\a\1\s\numba\tests\test_parfors_caching.py", line 22, in run_test
    self.assertPreciseEqual(f(ary), f.py_func(ary))
  File "D:\a\1\s\numba\tests\support.py", line 390, in assertPreciseEqual
    self.fail("when comparing %s and %s: %s" % (first, second, failure_msg))
AssertionError: when comparing [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1] and [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1]: 0.1 != 0.10000000000000003

----------------------------------------------------------------------

It's strange that this should suddenly start failing. It looks like a minor numerical error, probably just from using reductions/accumulation in here:

@njit(parallel=True, cache=True)
def arrayexprs_case(arr):
return arr / arr.sum()
@njit(parallel=True, cache=True)
def prange_case(arr):
out = np.zeros_like(arr)
c = 1 / arr.sum()
for i in range(arr.size):
out[i] = arr[i] * c
return out
@njit(cache=True)
def caller_case(arr):
return prange_case(arrayexprs_case(arr))

I think the "fix" is to use np.testing.assert_allclose in the check_module and run_module methods in https://github.com/numba/numba/pull/7625/files#diff-89333c093ac43075778fe1a5bdd16ed10fd6380e4a76dc5e9e8fdb148596f679

Copy link
Contributor

@stuartarchibald stuartarchibald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work on this @DrTodd13!

@stuartarchibald stuartarchibald added 4 - Waiting on CI Review etc done, waiting for CI to finish and removed 4 - Waiting on reviewer Waiting for reviewer to respond to author labels Jun 24, 2022
@sklam
Copy link
Member

sklam commented Jun 24, 2022

smoketesting at BFID numba_smoketest_cpu_yaml_110

@sklam sklam added Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm 5 - Ready to merge Review and testing done, is ready to merge BuildFarm Passed For PRs that have been through the buildfarm and passed and removed 4 - Waiting on CI Review etc done, waiting for CI to finish Pending BuildFarm For PRs that have been reviewed but pending a push through our buildfarm labels Jun 24, 2022
@sklam sklam merged commit 2236cd2 into numba:main Jun 24, 2022
@gmarkall
Copy link
Member

🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to merge Review and testing done, is ready to merge BuildFarm Passed For PRs that have been through the buildfarm and passed Effort - long Long size effort needed ParallelAccelerator
Projects
None yet
6 participants