-
Notifications
You must be signed in to change notification settings - Fork 214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a chapter about threading layer into the docs #2848
Conversation
Pull the changes from main branch
@emmwalsh Please review this PR. As it mostly deals with oneDAL docs. |
Produces docs can be found here: https://dev.azure.com/daal/DAAL/_build/results?buildId=36713&view=artifacts&pathAsName=false&type=publishedArtifacts |
CONTRIBUTING.md
Outdated
|
||
### Threading Layer | ||
|
||
In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form so called [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms. | |
In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form are called the [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms. |
cpp/daal/src/threading/threading.h
Outdated
virtual ~tls() | ||
{ | ||
d->del(voidLambda); | ||
delete d; | ||
_daal_del_tls_ptr(tlsPtr); | ||
} | ||
|
||
/// Access a local data of a thread by value | ||
/// | ||
/// @return When first ionvoced by a thread, a lambda provided to the constructor is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// @return When first ionvoced by a thread, a lambda provided to the constructor is | |
/// @return When first invoked by a thread, a lambda provided to the constructor is |
threader_for | ||
************ | ||
|
||
Lets consider you need to compute an elementwise sum of two arrays and store the results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets consider you need to compute an elementwise sum of two arrays and store the results | |
Consider a case where you need to compute an elementwise sum of two arrays and store the results |
|
||
.. include:: ../includes/threading/sum-sequential.rst | ||
|
||
There are several options available in oneDAL's threading layer to let the iterations of this code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several options available in oneDAL's threading layer to let the iterations of this code | |
There are several options available in the threading layer of oneDAL to let the iterations of this code |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "oneDAL's" is not allowed by the namesdb.intel.com. There might be different rules if you are referring to the open source version, but for the most part, the open source name should follow the same rules as the branded name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a new information for me. Thank you!
I had fixed all the mentions in this PR.
Parallel computations can be performed in two steps: | ||
|
||
1. Compute partial dot product at each threaded. | ||
2. Perform a reduction: Sum the partial results from all threads to compute the final dot product. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of "sum the partial results" should it be "add the partial results"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably yes. Not sure as I am not a native speaker. Modified to "add the partial results".
by each iteration. | ||
|
||
In the cases when it is known that the iterations perform equal amount of work it | ||
might be benefitial to use pre-defined mapping of the loop's iterations to threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be benefitial to use pre-defined mapping of the loop's iterations to threads. | |
might be benefitial to use predefined mapping of the loop's iterations to threads. |
****************** | ||
|
||
It is allowed to have nested parallel loops within oneDAL. | ||
What is important to know is that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is important to know is that | |
It is important to know that: |
|
||
-- `oneTBB documentation <https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-13/work-isolation.html>`_ | ||
|
||
In practice this means that, for example, a thread-local variable might unexpectedly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice this means that, for example, a thread-local variable might unexpectedly | |
In practice, this means that a thread-local variable might unexpectedly |
calls another parallel algorithm, the inner one, within a parallel region. | ||
|
||
The inner algorithm in this case can also be called solely, without additional nesting. | ||
And we do not want to always make it isolated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we do not want to always make it isolated. | |
And we do not always want to make it isolated. |
The inner algorithm in this case can also be called solely, without additional nesting. | ||
And we do not want to always make it isolated. | ||
|
||
For the cases like that oneDAL provides ``daal::ls``. Its ``local()`` method always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the cases like that oneDAL provides ``daal::ls``. Its ``local()`` method always | |
For the cases like that, oneDAL provides ``daal::ls``. Its ``local()`` method always |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!!
cpp/daal/src/threading/threading.h
Outdated
/// Let ``t`` be the number of threads available to oneDAL. | ||
/// | ||
/// Then the number of iterations processed by each threads (except maybe the last one) | ||
/// is copmputed as: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// is copmputed as: | |
/// is computed as: |
cpp/daal/src/threading/threading.h
Outdated
/// is copmputed as: | ||
/// ``nI = (n + t - 1) / t`` | ||
/// | ||
/// Here is how the work in split across the threads: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Here is how the work in split across the threads: | |
/// Here is how the work is split across the threads: |
computations on CPU. | ||
|
||
But oneTBB is not used in the code of oneDAL algorithms directly. The algorithms rather | ||
use custom primitives that either wrap oneTBB functionality or are inhome developed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use custom primitives that either wrap oneTBB functionality or are inhome developed. | |
use custom primitives that either wrap oneTBB functionality or are developed in-house. |
|
||
The API of the layer is defined in | ||
`threading.h <https://github.com/oneapi-src/oneDAL/blob/main/cpp/daal/src/threading/threading.h>`_. | ||
Please be aware that those APIs are not publicly defined. So they can be changed at any time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be aware that those APIs are not publicly defined. So they can be changed at any time | |
Please be aware that those APIs are not publicly defined, so they can be changed at any time |
``daal::static_threader_for`` and ``daal::static_tls`` allow to implement static | ||
work scheduling within oneDAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
``daal::static_threader_for`` and ``daal::static_tls`` allow to implement static | |
work scheduling within oneDAL. | |
``daal::static_threader_for`` and ``daal::static_tls`` allow implementation of | |
static work scheduling within oneDAL. |
Nested parallelism | ||
****************** | ||
|
||
It is allowed to have nested parallel loops within oneDAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is allowed to have nested parallel loops within oneDAL. | |
OneDAL supports nested parallel loops. |
Thank you @bdmoore1 , @ethanglaser and @emmwalsh for thorough reviews. |
@keeranroth @rakshithgb-fujitsu - please check on this one - how clear topic is described and if you would like to get any additional details. And for the fire - topics that might worth adding beyond threading topic |
by each iteration. | ||
|
||
In the cases when it is known that the iterations perform an equal amount of work, it | ||
might be benefitial to use predefined mapping of the loop's iterations to threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be benefitial to use predefined mapping of the loop's iterations to threads. | |
might be beneficial to use predefined mapping of the loop's iterations to threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one more minor spelling adjustment mentioned above. Otherwise this looks great!
cpp/daal/src/threading/threading.h
Outdated
/// @tparam F Lambda function of type ``[/* captures */](int i) -> void``, | ||
/// where ``i`` is the loop's iteration index, ``0 <= i < n``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just double checking, but I've always used single backtick for monospaced in doxygen. Is the double backtick valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing this. We are using not only doxygen, but also restructured text + some python pre-processing.
Double backtick is valid but it results in italic text.
I'm going to replace it with single backtick to have monospace.
cpp/daal/src/threading/threading.h
Outdated
@@ -223,14 +223,33 @@ inline void threader_func_break(int i, bool & needBreak, const void * a) | |||
lambda(i, needBreak); | |||
} | |||
|
|||
/// Execute the for loop defined by the input parameters in parallel. | |||
/// The maximal number of iterations in the loop is 2^31 - 1. | |||
/// The work is scheduled dynamically across threads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment suggests that the work will be executed in parallel, but it may not be the case. It depends on the implementation behind the abstraction. This function passes the parameters to the threading layer, which may execute the work in parallel. And must the work be scheduled dynamically? Can the threading layer decide to just split n equally into however many threads there are (this is the static
schedule in OpenMP parlance). Rewording this to something like:
Pass a function to be executed in a for loop to the threading layer. The work is scheduled over threads by the threading layer.
Also, it would be good to mention loop carried dependencies. Is the assumption here that the work in each loop iteration is independent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea of this API is that it uses the default scheduling provided by the underlying implementation.
Of course, the execution might not be parallel for various reasons. For example, if only one thread is available.
It is expected that the iterations of the loop are logically independent. i.e. there are no recurrence among the iterations, but they might access the same data on read or write.
In the latter case, additional synchronization like the use of atomics, mutual execution or thread-local storage is required.
I will reword the description to make it more clear.
cpp/daal/src/threading/threading.h
Outdated
/// | ||
/// @param[in] n Number of iterations in the for loop. | ||
/// @param[in] reserved Parameter reserved for the future. Currently unused. | ||
/// @param[in] lambda Lambda function that defines iteration's body. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// @param[in] lambda Lambda function that defines iteration's body. | |
/// @param[in] lambda Lambda function that defines the loop body. |
cpp/daal/src/threading/threading.h
Outdated
/// Execute the for loop defined by the input parameters in parallel. | ||
/// The maximal number of iterations in the loop is 2^31 - 1. | ||
/// The work is scheduled dynamically across threads. | ||
/// The iteration space is chunked using oneTBB ``simple_partitioner`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is TBB always going to be the threading layer? It is now, but can this be different in the future? If this function or interface is TBB specific, is this documented somewhere? Might also be worth considering changing the name of the functions, e.g. tbb_for_simple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is impossible to say whether TBB will always be a threading layer or not. For now we do not have plans to migrate to other threading technologies like OpenMP, C++ STD threads and so on, but who knows what happens in the future?
I think I need to reword this and remove TBB mentioning from the description.
The idea of this API is that the iterations are not grouped together; each iteration is considered as a separate task.
Because other APIs can group iterations in the way that a consequent block of iterations is executed by one thread. This API tries to avoid this if possible.
cpp/daal/src/threading/threading.h
Outdated
@@ -321,10 +397,20 @@ class tls_deleter_ : public tls_deleter | |||
virtual void del(void * a) { delete static_cast<lambdaType *>(a); } | |||
}; | |||
|
|||
/// Thread-local storage (TLS). | |||
/// Can change its local variable after a nested parallel constructs. | |||
/// @note Use carefully in case of nested parallel regions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How common is it to have nested parallelism, and how much do we want users of oneDAL to use it? If at all possible, I recommend not using nested parallelism, as this gets confusing quickly, and requires a lot of care. I would suggest either putting more of a health warning on here, e.g.
@note Thread local storage in nested parallel regions is, in general, not thread local. Nested parallelism should not be used in most cases, and if it is, extra care must be taken with thread local values.
But maybe just leaving out the note is better, as this starts to open a discussion point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the confusion here is because those APIs are not intended to be used by oneDAL's users.
The APIs are just not available as part of oneDAL's release header files.
Those are only available to the developers.
Inside oneDAL we use nested parallelism in many cases. So it is important to warn the developers when using a certain constructs might lead to issues.
Will reformulate the note as well.
|
||
.. include:: ../includes/threading/dot-parallel-partial-compute.rst | ||
|
||
To compute the final result it is requred to reduce TLS's partial results over all threads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To compute the final result it is requred to reduce TLS's partial results over all threads | |
To compute the final result it is required to reduce each thread's partial results |
|
||
.. include:: ../includes/threading/dot-parallel-reduction.rst | ||
|
||
Local memory of the threads should also be released when it is no longer needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local memory of the threads should also be released when it is no longer needed. | |
Local memory of the threads should be released when it is no longer needed. |
|
||
Local memory of the threads should also be released when it is no longer needed. | ||
|
||
The complete parallel verision of dot product computations would look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The complete parallel verision of dot product computations would look like: | |
The complete parallel version of dot product computations would look like: |
This strategy is beneficial when it is difficult to estimate the amount of work performed | ||
by each iteration. | ||
|
||
In the cases when it is known that the iterations perform an equal amount of work, it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In all cases, where you know how much work there is to do in the parallel regions, it will be faster to map the work onto threads using a static schedule. Drop the use of it might be beneficial to use ...
in favour of it is more performant to use ...
|
||
-- `oneTBB documentation <https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-13/work-isolation.html>`_ | ||
|
||
In practice, this means that a thread-local variable might unexpectedly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you define what you mean by thread-local? I usually think of thread local data to mean that only a single thread can read or write the value to thread-local storage. The thread-local storage class tls
is intended to be used for thread local storage, but it cannot do so in the case of nested parallelism. The only thread local storage which is guaranteed to be thread local is in the leaf nodes of a graph of threads.
So something along the lines of:
If we define thread-local data to be data that can only be read and written by a single thread, then any thread which spawns other threads in general cannot have thread-local variables. This is true for variables which are defined as thread-local. This means that great care is required when using thread-local storage constructs in nested parallel regions to avoid issues caused by concurrent data access and ownership (e.g. data corruption due to races; segmentation faults; and deadlocks)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By thread-local data I mean the data that is managed by a particular thread.
The issue here is that there is a difference between the threads and logical tasks (loop's iterations here).
One thread can be assigned to execute several iterations of the loop.
In case of nested loops the specifics of oneTBB is that when the thread executes an outer iteration i
of a nested loop it becomes a so called waiting thread while it waits for a completion of the inner parallel loop. And then this waiting thread can be assigned to perform another iteration j
of an outer loop.
So, it might first complete the iteration j
of the outer loop which will change the state of thread-local values, and then return back to iteration i
.
I tried to reflect this behavior in the documentation's text.
In general I do not agree that the thread that spawns other threads cannot have thread-local data. But I agree that thread-local data should be used with great care in those cases.
@keeranroth, I think I have addressed most of your feedbacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. Mostly picked up on some typos, but wanted to double check on the documentation mentioning passing lambda functions to the threading layer. I don't think that is a restriction, and using a functor would have a similar effect, and there is no restriction on using a function that updates global state either (although that seems like a bad idea, it's not explicitly forbidden)?
|
||
### Threading Layer | ||
|
||
In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form are called the [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form are called the [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the layer to implement parallel algorithms. | |
In the source code of the algorithms, oneDAL does not use threading primitives directly. All the threading primitives used within oneDAL form the [threading layer](http://oneapi-src.github.io/oneDAL/contribution/threading.html). Contributors should leverage the primitives from the threading layer to implement parallel algorithms. |
Also, if you are defining the name, maybe Threading Layer
should be a proper noun with capitalization? I think this might be a nit too far, but may as well leave a comment whilst I'm in the area
cpp/daal/src/threading/threading.h
Outdated
/// Data dependencies between the iterations are allowed, but may requre the use | ||
/// of synchronization primitives. | ||
/// | ||
/// @tparam F Lambda function of type `[/* captures */](int i) -> void`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This doesn't need to be a lambda, right? It can be any callable object, so a function or functor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the mentions of 'lambda function' to more general 'callable object'. Is it more or less Ok?
To your main comment about the possibility of updating the global state.
Yes, there are currently no programmatical restrictions on updating the global state. But this PR is focused on documenting the present state of the things. Not on the fixing/improving the code.
Also, this API is not exposed to oneDAL users. Only oneDAL developers can use this functionality. And I hope it is used and will be used with care.
F local(size_t tid) | ||
{ | ||
if (_storage) | ||
if (_storage && tid < _nThreads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for this function to be called from within a nested parallel region, where the thread local storage has not yet been instantiated? If so, it could be that nested threads race to create the storage, and this function needs to be made thread safe.
Also, can this be used to get the thread local storage of another thread? Seems as though it might be possible from a quick glance.
If either of the above situations occur, there is a data race in here. I think the documentation should state that it is not safe to call this function from within a parallel region
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned previously, this PR is focused on documenting the present state of the things.
Yes, it is not a good idea to call this local(tid)
method from a nested parallel region.
I tried to describe this issue in threading.rst
("Nested Parallelism" chapter) to make it more clear.
The purpose of adding this comparison is not in changing the logic of the code, but just to prevent out-of-range memory access. I can revert it to the mainline version. Because the intension was not to touch the code at all - only to document it.
cpp/daal/src/threading/threading.h
Outdated
@@ -499,13 +668,23 @@ class ls : public tlsBase | |||
lsPtr = _isTls ? _daal_get_tls_ptr(a, tls_func<lambdaType>) : _daal_get_ls_ptr(a, tls_func<lambdaType>); | |||
} | |||
|
|||
/// Destroys the memory associated with local storage. | |||
/// | |||
/// @note `ls` does not release the memory allocated by a lambda-function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Make sure that uses of lambda-function
or lambda function
are consistent. I prefer lambda function
, but it's not a strong preference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thanks for the update
threading.h
)