Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel execution, Multithreading #255

Open
4 tasks
breznak opened this issue Feb 9, 2019 · 6 comments
Open
4 tasks

Parallel execution, Multithreading #255

breznak opened this issue Feb 9, 2019 · 6 comments
Labels
code code enhancement, optimization, cleanup..programmer stuff optimization question Further information is requested
Milestone

Comments

@breznak
Copy link
Member

breznak commented Feb 9, 2019

Aim: run algorithms in parallel as much as possible, with maximal efficiency.

This will be an umbrella issue based on following idea:

I see 3 different levels of parallelisation that could be implemented:

from #214 (comment)

EDIT:

TODO:

  • can we estimate number of threads on system?
    • at runtime
    • at compile time with CMake?
    • worst case we pass a param -DNUMTHREADS=8
@breznak breznak added question Further information is requested optimization code code enhancement, optimization, cleanup..programmer stuff labels Feb 9, 2019
@breznak breznak added this to the optimization milestone Feb 9, 2019
@ctrl-z-9000-times
Copy link
Collaborator

Thanks for starting a tracker issue for this subject!

A computer has a fixed number of threads and so it does not make sense to parallelize everything at every level. We should focus on 1 or 2 high-impact areas.

@dkeeney
Copy link

dkeeney commented Feb 9, 2019

I agree, and NetworkAPI Regions seem to be a major candidate, mostly because there is very little interaction between them except for the data buffers.

@breznak
Copy link
Member Author

breznak commented Feb 9, 2019

so it does not make sense to parallelize everything at every level.

True, I was considering this too.
On the other hand, you can, and will now easily get some (Ryzen) CPU with 16/32+ threads. Or threre is specialized HW, clusters,...

I think this will be disadvantage (not to paralellize too much) of the low-level #214 approach.

For the latter 2, I'd suggest sth as:

  • -DNUM_PARALLEL_REGIONS
  • NUM_THREADS_PER_REGION (better name? not region, but SP, TM)

So if you have large network, you'll get best utilization by giving all threads to the simultanously running regions. But in the case like MNIST example (my motivation), we are 1 algo/region (SP), so one could try to give threads to that.

Now, giving it a priority would make the process seamless:

  • satisfy all running Regions.
    • give rest to threads within an algo (SP)
    • or a region could specify that (case where some region/load is significantly slower)
      • even rest for threads-per-loop (the low level TS PArallel case, which is not coming anytime soon)

and NetworkAPI Regions seem to be a major candidate, mostly because there is very little interaction between them except for the data buffers.

definitely, it would probably be also first thing we will achieve
But see the MNIST case, and Region(preferredNumThreads=N) for slow workloads

@dkeeney
Copy link

dkeeney commented Feb 9, 2019

All of the algorithms are basically compute bound. This means that priorities will not get things done faster. But spreading the work over multiple threads would help. My machine has 6 cores (12 physical threads) and they could all be put to work. But even there the best I could expect is around 10 times as fast.

Ultimately the biggest advantage would be to reduce the amount of work that needs to be done. Replace loops with other ways of organizing data so loops are not needed. Reduce the number of times that data must be copied.

@breznak
Copy link
Member Author

breznak commented Feb 9, 2019

Yes, but even linear in #threads would be nice win.

Replace loops with other ways of organizing data so loops are not needed. Reduce the number of times that data must be copied.

yep, that goes in parallel (sic) with this, #3 . It would be nice if we can vectorize the for loops, it was my intention with SDR_t, but I'm not sure SDR is helping that (at least it prepares as for sparse data structures)

@dkeeney
Copy link

dkeeney commented Feb 9, 2019

hmmm, 'vectorizing' the for loops does not get rid of the loops. They are just harder to find. What I mean is using things like map and set objects in places where it makes since to avoid having to iterate. Go from O(n) to O(log n) types of things. But I don't know, it might require a whole new way of looking at the algorithms to make that work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code code enhancement, optimization, cleanup..programmer stuff optimization question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants