Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenMP Support of NNPACK #26

Closed
Darwin2011 opened this issue Aug 22, 2016 · 7 comments
Closed

OpenMP Support of NNPACK #26

Darwin2011 opened this issue Aug 22, 2016 · 7 comments

Comments

@Darwin2011
Copy link

Hi, @Maratyszcza

Currently NNP uses self-implemented threadpool rather than OpenMP.
Why NNP chooses pthread rather than openmp? In the future, openmp parallel will be supported or not?
I think in some degree, openmp is much easier to use for me.

Thanks.
Best Regards

@Maratyszcza
Copy link
Owner

Here is the list of reason why NNPACK doesn't use OpenMP (ordered by importance):

  1. Not all platforms support OpenMP. In particular, the default compilers in XCode and (Portable) Native Client do not support it.
  2. pthreadpool is a small threading library, with well-documented source code. It is easy to modify it for additional needs of NNPACK. On the contrary, OpenMP implementations are huge, interleaved with compiler front-ends and internal representation, and have a steep learning curve.
  3. NNPACK uses size_t everywhere, but OpenMP requires int loop counters. Threading on top of OpenMP would be a source of sophisticated bugs.
  4. Unlike OpenMP, pthreadpool can parallelize 2D loops (and does it without division for each loop iteration).
  5. pthreadpool uses work stealing for balancing load between different threads. It is a more efficient scheduling strategy than the ones implemented in OpenMP, and it produces predictable memory access patterns, which NNPACK relies on.

@Darwin2011
Copy link
Author

Thanks!

@Darwin2011
Copy link
Author

Thanks for your explaination, @Maratyszcza

  • Now Openmp3 can support unsigned or signed as loop counter.
  • openmp can also collapse 2-fold loop even though it not graceful.
  • For me, real problem is that for dual-socket Haswell machine, the CPU utilization is low. NNPACK cannot fully utilize multicore for thread parallelism.
  • I am trying to tune NNPACK with openmp recently and hope you can give some help.

@Maratyszcza
Copy link
Owner

@Darwin2011 Poor dual-socket performance is not related to the threading library, but rather the result of the assumption in NNPACK that all cores share L3 cache. When this assumption doesn't hold, the cores evict each other's cache lines.

@Darwin2011
Copy link
Author

@Maratyszcza
Any plan to fix this? I can also work on this if you can give me some help.

@Maratyszcza
Copy link
Owner

@Darwin2011 I'm think on the plan to improve multi-socket scaling. Fundamentally, two problems need to be solved:

  1. NNPACK assumes that all threads in a thread pool share L3 cache. NNPACK arranges computations is such way that blocks of L3 cache prefetched by different cores are reused by all cores.
  2. NNPACK's memory allocation is not NUMA-aware, and all memory allocation is done on the calling thread, which means memory is allocated on the NUMA node that called the NNPACK function. Very likely, you could get better performance by running the NNPACK-linked application with numactl --interleave all.

@Darwin2011
Copy link
Author

Can I just separate input images into two streams(one stream per socket) and prepare two threads pool for those stream?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants