Parallel execution, Multithreading #255

breznak · 2019-02-09T10:56:44Z

Aim: run algorithms in parallel as much as possible, with maximal efficiency.

This will be an umbrella issue based on following idea:

I see 3 different levels of parallelisation that could be implemented:

low-level: , for-loops, implicit, by c++17 Parallel -> Enable use of Parallel TS of c++17 #214
- this reduces (if??) execution time of each,smallest code-unit
hand-coded/mid-level: efficient, makes code much more difficult -> Parallel SpatialPooler, Connections #254
high-level: NetworkAPI Regions, classes SP,TM,Encoder,Classifier in a pipeline could run in parallel
- reduces pipeline time to the min(time of each unit)
- tracked in NetworkAPI Regions run in parallel #253

from #214 (comment)

EDIT:

TODO:

can we estimate number of threads on system?
- at runtime
- at compile time with CMake?
- worst case we pass a param -DNUMTHREADS=8

The text was updated successfully, but these errors were encountered:

ctrl-z-9000-times · 2019-02-09T14:44:13Z

Thanks for starting a tracker issue for this subject!

A computer has a fixed number of threads and so it does not make sense to parallelize everything at every level. We should focus on 1 or 2 high-impact areas.

dkeeney · 2019-02-09T15:00:38Z

I agree, and NetworkAPI Regions seem to be a major candidate, mostly because there is very little interaction between them except for the data buffers.

breznak · 2019-02-09T15:04:12Z

so it does not make sense to parallelize everything at every level.

True, I was considering this too.
On the other hand, you can, and will now easily get some (Ryzen) CPU with 16/32+ threads. Or threre is specialized HW, clusters,...

I think this will be disadvantage (not to paralellize too much) of the low-level #214 approach.

For the latter 2, I'd suggest sth as:

-DNUM_PARALLEL_REGIONS
NUM_THREADS_PER_REGION (better name? not region, but SP, TM)

So if you have large network, you'll get best utilization by giving all threads to the simultanously running regions. But in the case like MNIST example (my motivation), we are 1 algo/region (SP), so one could try to give threads to that.

Now, giving it a priority would make the process seamless:

satisfy all running Regions.
- give rest to threads within an algo (SP)
- or a region could specify that (case where some region/load is significantly slower)
  - even rest for threads-per-loop (the low level TS PArallel case, which is not coming anytime soon)

and NetworkAPI Regions seem to be a major candidate, mostly because there is very little interaction between them except for the data buffers.

definitely, it would probably be also first thing we will achieve
But see the MNIST case, and Region(preferredNumThreads=N) for slow workloads

dkeeney · 2019-02-09T15:25:27Z

All of the algorithms are basically compute bound. This means that priorities will not get things done faster. But spreading the work over multiple threads would help. My machine has 6 cores (12 physical threads) and they could all be put to work. But even there the best I could expect is around 10 times as fast.

Ultimately the biggest advantage would be to reduce the amount of work that needs to be done. Replace loops with other ways of organizing data so loops are not needed. Reduce the number of times that data must be copied.

breznak · 2019-02-09T15:35:49Z

Yes, but even linear in #threads would be nice win.

Replace loops with other ways of organizing data so loops are not needed. Reduce the number of times that data must be copied.

yep, that goes in parallel (sic) with this, #3 . It would be nice if we can vectorize the for loops, it was my intention with SDR_t, but I'm not sure SDR is helping that (at least it prepares as for sparse data structures)

dkeeney · 2019-02-09T19:22:32Z

hmmm, 'vectorizing' the for loops does not get rid of the loops. They are just harder to find. What I mean is using things like map and set objects in places where it makes since to avoid having to iterate. Go from O(n) to O(log n) types of things. But I don't know, it might require a whole new way of looking at the algorithms to make that work.

breznak added question Further information is requested optimization code code enhancement, optimization, cleanup..programmer stuff labels Feb 9, 2019

breznak added this to the optimization milestone Feb 9, 2019

This was referenced Feb 9, 2019

Real-life benchmark: Hotgym example using C++ algorithms #30

Open

Spatial Pooler: separate Inhibition, Topology, Boosting into standalone classes #92

Open

breznak mentioned this issue Nov 1, 2019

Provide effective way to query presynaptic cells from Connections #668

Closed

3 tasks

fcr mentioned this issue Nov 1, 2019

NetworkAPI Regions run in parallel #253

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel execution, Multithreading #255

Parallel execution, Multithreading #255

breznak commented Feb 9, 2019 •

edited

ctrl-z-9000-times commented Feb 9, 2019

dkeeney commented Feb 9, 2019

breznak commented Feb 9, 2019

dkeeney commented Feb 9, 2019

breznak commented Feb 9, 2019

dkeeney commented Feb 9, 2019

Parallel execution, Multithreading #255

Parallel execution, Multithreading #255

Comments

breznak commented Feb 9, 2019 • edited

ctrl-z-9000-times commented Feb 9, 2019

dkeeney commented Feb 9, 2019

breznak commented Feb 9, 2019

dkeeney commented Feb 9, 2019

breznak commented Feb 9, 2019

dkeeney commented Feb 9, 2019

breznak commented Feb 9, 2019 •

edited