Para

This project is a work in progress. Feel free to contribute.

The goal

Efficient parallelization of pipelined computations on multiple cores with a simple interface.

The interface should look something like:

let data = vec!(0, 1, 2);
let mut sum = 0;
pipeline!(4, data
          => (|x| (x * 2, x) => |(x2, x)| x2 - x,
              |x| -x)
          => |x| sum += x);

In this example, the tasks to be done are:

Iterate over the elements in data
Clone each element and pass it to both the |x| (x * 2, x) and |x| -x closures.
Apply the output of |x| (x * 2, x) to |(x2, x)| x2 - x
Sum all outputs of |(x2, x)| x2 - x and |x| -x into the sum variable.

This constructs a graph in which each node is a closure. Data flows between the closures and gets processed. Except for the first and the last nodes in this example (the iteration and the sum nodes), all nodes are completely parallelizable. The pipeline! macro will instantiate a scheduler which will use 4 threads to run all nodes and maximize computation throughput.

Current state

See the integration tests

TODO

Potential bugs

Make sure usage of atomics is correct

Interface

Support fanout in macro
Support multiple producers in macro
Support thread num in macro
Make it easier to pass producers to schedule() (why not consume? maybe if we want to stop before the end?)
Support adding priority for nodes
Support marking nodes as running on external HW (e.g. GPU)
Support stateless producers? Rayon-style splittable iterators?
Implement Fanout for any iterable of consumers

Optimization

When a local thread is overflown, dump half of tasks to global queue
Make circus sized with power of two, so instead of using modulus with can use & with a mask
Use a custom allocator for the tasks
Inline some funcs or enable automatic inlining
A LOT of time is spent on creating tasks and moving them around. If we could make them not dyn (enum?) we could save both the dynamic dispatch and allocation.
Circus reads and writes take time. We could probably optimize that.
Designate cores for threads
When running short tasks, threads spend significant time synchronizing pushing and popping from the task queue. The usual solution for that would be to implement work stealing.
Stateful consumers are held inside a mutex, so threads might spend time blocking. A possible improvement is to modify the consume interface to return a boolean and use try_lock when executing functions in a mutex. This might be less problematic once we implement work stealing. Also, we might wanna use a refcell instead of a mutex.
Tasks contain a pointer to a node, which is wasteful as it is has 64 bits while we usually have around 3 bits of nodes. We could use an id instead of reference, or collect tasks per-node. Collecting tasks per node might allow us to further optimize running of stateful tasks.
We want all cores to always work. There could be a situation in which there are many tasks, but all are for a specific Stateful node, so they can't be parallelized. For that reason, stateful nodes should always be prioritized.
Smartly manage the balance between having many ready tasks and memory usage, by knowing which nodes produce more work and which nodes consume more work, and prioritizing them according to current workload.

Testing

Performance regression tests
CI/CD
Add benchmark that can be optimized using work-stealing
Benchmark two producers

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
benches		benches
profiling		profiling
src		src
tests		tests
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Para

The goal

Current state

TODO

Potential bugs

Interface

Optimization

Testing

About

Releases

Packages

Languages

kmaork/para

Folders and files

Latest commit

History

Repository files navigation

Para

The goal

Current state

TODO

Potential bugs

Interface

Optimization

Testing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages