Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Picasso - A multithreading runtime for Nim #160

Open
mratsim opened this issue Aug 9, 2019 · 12 comments

Comments

@mratsim
Copy link

commented Aug 9, 2019

Project Picasso - a multithreading runtime for Nim

"Good artists borrow, great artists steal." -- Pablo Picasso

Introduction

The Nim destructors and new runtime were introduced
to provide a GC-less path forward for Nim libraries and applications
where it made sense. One of their explicit use case is making threading easier.

RFC goals

This RFC aims

  • to present the current challenges and the design space of
    multithreading runtime.
  • collect use-cases and discuss goals and non-goals of a multi-threaded runtime.
  • understand if we need compiler support for some features and if not:
    • discuss if we should allow competing runtimes and allow switching
      just like Nim allows multiple GCs (refcounting, mark-and-sweep, boehm, no gc).
  • gather some metrics ideas to benchmark runtime systems.
  • ultimately have people implementing a runtime system or part of (there are plenty of pieces needed)

The problem domain:

The word "thread" had many meanings in the past or words closely related (green threads vs heavy threads, coroutines, fibers, ...).

I.e. threading means how to interleave different routines and their contexts of execution.

This RFC focuses on "heavy" threads as used for computation on multi-core systems.

Why Project Picasso?

The new runtime introduced a borrow-checker and most successful
multithreading runtimes uses work-stealing for load balancing.
Now re-read the quote 😉.

Table of contents

Reading on Nim related concepts

Where are we now?

If you want to use multiple cores in Nim you can currently use

  • Raw threads via createThread (pthreads on Unix, Fibers on Windows)
  • Threadpool with
    • the spawn/^ functions
    • The parallel statement
    • channels for inter-thread communication
  • OpenMP with
    • The || OpenMP operator for parallel for-loops or task-loops
    • Emitting OpenMP blocks

However, I'd argue that

  • createThread is a too low-level abstraction for most.
  • The threadpool has contention issues due to using a global queue, and it has no load balancing either.
  • OpenMP does not supported nested parallelism. The implementation of tasks varies wildly (GCC's uses a global queue as well so load-balancing and contention are issues) and cannot be built upon (for example for task graphs).
    Furthermore, OpenMP requires going through C/C++ compilation
    and cannot be used with nlvm or projects that would want to JIT parallel code.

Brief overview of the types of parallelism

There are several kinds of parallelism, some addressed at the hardware level
and some addressed at the software level.

Let's start with hardware level not addressed by this RFC:

Instruction-Level Parallelism:

Modern superscalar processors have multiple execution ports and can schedule multiple instructions at the
same time if they don't use the same port and there is no data dependency

SIMD: Single Instruction Multiple Data:

Often called vectorization, this is SSE, AVX, etc: one instruction but that applies to a vector of 4x, 8x, 16x integers or floats.

SIMT: Single Instruction Multiple Threads:

That is the threading model of a GPU. Threads are organized at the level of a Warp (Nvidia) or Wavefront (AMD) and they all execute the same instructions (with obvious bad implications for branching code).

SMT: Simultaneous Multi-Threading:

In Intel speak "Hyperthreading". While a superscalar processor can execute multiple instructions in parallel,
it will sometimes get idle due to instruction latency or waiting for memory.
One way to reclaim performance for a limited increase in chip size is with
HyperThreading with each physical cores having 2 (usually) to 4 logical cores siblings (Xeon Phi) that can use the same hardware resources to execute multiple threads.

Further information: https://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html

What are we interested in?

Exploiting multiple cores:

Recent laptops now ship with 4 cores, even phones ship with 4 cores, we need to provides tools for the devs to use them.

At the software level

Data parallelism:

The easy part, you work on elements and your operation maps to the same operation on all elements. For, incrementing all elements of an array by one.

Task Parallelism:

The complex part, you have tasks (jobs) that are usually different in terms of computation, resources, time required but can be scheduled in parallel. Those can produce new tasks. For example, issuing a parallel search on an unbalanced tree data-structure.

What are we less interested in

Stream Parallelism:

You have a data stream and apply a pipeline of transformations on it, possibly with forks in the stream and joins. An example would be a parallel iterator library or a parallel stream program that takes an input compressed image archive, decompresses it, applies transformations to some images and then recompress those in a new archive.

I believe that stream parallelism is sufficiently similar to data parallelism and task graphs
that addressing data and task parallelism will make stream processing much easier.

Use-cases

I will need your help for this section.
Some obvious needs are:

  1. spawn computeIntensiveTask() (Task-parallelism)
  2. Array processing in numerical computing (Data parallelism)

In both cases parallelism can be nested if a parallel Nim library
calls another parallel Nim library. The system should behave properly
if a parallel GUI calls a parallel image library for example.

API

Having good features will draw people, having good APIs will make them stay.

Here is an overview of the design space.

Data parallelism only needs 5 primitives:

  • parallel section (to setup thread local values)
  • parallel for
  • parallel reduce
  • barrier
  • critical section

Task parallelism has much more needs:

  • spawning a new job
  • Representing a future value with Flowvar
  • blocking (^) until the child task is finished
  • alternatively polling with isReady
  • scheduling continuations
  • cancel a computation (user changed image on the GUI so compute is cancelled)

As you can see there is a lot of parallel with async/await IO. This is probably a good thing, i.e. use async/await for blocking IO and spawn/^ for non-blocking compute.

For the rest, I will assume that threads are too low-level of an abstraction
and that parallel annotation (for data parallelism) and tasks (for task parallelism) are much easier and more natural to manipulate for a developer.
A runtime system should figure how to distribute those on the hardware.

Furthermore, data parallel primitives can be expressed in terms of task primitives so I will focus on tasks.

On the non-obvious choices, there is:

  • How to communicate between threads
    • Message passing (i.e. Channels): Share by communicating instead of communicate by sharing (from Rust and Go)
    • Shared memory:
      • atomics and locks
  • For channels:
    • Have an object shared by producer(s) and consumer(s)
    • Have a Sender object and a Receiver object that statically ensure
      that it's correctly used
  • How to represent a task:
    • An object
    • A concept/interface/trait
    • A closure (that captures its context)
    • A pure function
    • Note that the choice may have impact on:
      • Nim DLLs
      • C interface, which is valuable for Nim as a Python backend
        or for JIT code to tie back to Nim.
      • Hot-code reloading
  • An error model:
    • No exceptions in the runtime, unless we know have thread-safe exceptions

    • Error codes

      • If yes, we need a spawn that accepts a Flowvar for in-place modification
    • Options?

    • A richer API like nim-result

    • Note that Nim enums can use strings

      type PicassoError = enum
        Ok = "All is well"
        ThreadMemError = "Could not allocate memory to create a thread"
        TaskMemError = "Could not allocate memory to create a task"
        AlreadyCancelledError = "Task was cancelled"

      And those can be preformatted for printf

      TaskmemError = "Thread %d: could not allocate memory to create a task"

  • How to ensure composition?
  • How to transfer ownership between threads?
  • Are there use cases where lower-level access to the threadpool is desirable?

In terms of robustness:

  • message passing benefits from CSP (Communicating Sequential Process), which provides a formal verification framework for concurrent system that communicates via channels
  • Haskell inspired C# with the Continuation Monad. If there is one thing that Haskell does well it's composition, and also having a solid type system.

Load-balancing

Work-stealing won both in theory and in practice. It has been proven asymptotically optimal in terms of performance.

However there are plenty
of implementation subtleties that can have heavy influence on workloads:

  • What to do after spawning work:
    • Help-first: continue on the current execution context (also called child-stealing). Breadth-first task creation: on a single-thread context, with a for loop for N tasks, N tasks will be created and live before the thread will do the job one by one.
    • Work-first: jump on the freshly spawned work (also called parent-stealing or continuation stealing). This requires compiler support similar to coroutines for restoring stackframes. Breadth-first task creation: on a single-thread context, with a for loop for N tasks, only 1 task will be live resolved before the thread goes to the next.
  • Steal one tasks vs Steal half tasks
  • Leapfrogging: work-stealing allows an idle worker to steal from a busy one, but what if a busy worker is blocked by an unresolved Flowvar? Allowing it to continue instead of blocking is called leapfrogging
  • Loop splitting: some tasks include loops which for efficiency reasons are not split in a task for each element. But when a loop is big, it might be worth it to split it to allow other worker threads to steal it. Except that the operation within a loop might be either very cheap or very costly so the "grain"-size matter, and adaptative splitting would be very nice.
  • Hierarchical work-stealing: high-end processors like AMD Threadripper or Intel Xeon Bronze/Silver/Gold/Platinum have a Non-Unified Memory Architecture (NUMA). Meaning they have significantly more affinity with the memory directly attached to their cores and accessing "far" memory causes a significant penalty. In that case it is important to only steal work corresponding to the local fast memory.
  • CPU consumption and latency: when a worker finds no work, does it poll, how frequently, does it yield?
  • How to select theft victims?
  • How to detect work termination?

Interested and not feeling overwhelmed yet? I have gathered an extensive litterature in my research repo.

Scheduler implementation

Like the choice of communication between threads, for synchronization
as scheduler needs to choose between:

  • Shared memory
  • Message passing
  • Software Transactional Memory (database like commits and rollback based on transaction logs)

While the traditional focus has been shared memory, involving atomics and locks. I read and ported the code of a very inspirational Message Passing based work-stealing scheduler thesis in my experimental repo.

Haskell is the only production grade user of Software Transactional Memory.
It has caught C++ interest, here is a good overview of the model and the C++ proposal sponsored by Michael and Scott (from the Michael-Scott concurrent queue fame). One of the main difficulties with STM is that you cannot replay side-effects.

Note that for scheduler implementation all three strategies can be formally verified as the synchronization between threads is done through a very specific data structure:

Also all 3 already had hardware support in the past (in either experimental hardware for message passing or buggy hardware for transactional memory).

Which brings us to ...

Hardware

The hardware we choose to target will greatly influence the runtime.

Scheduling for a weak memory model like ARM, strong memory model like x86,
a workstation with 2 CPUs or a cluster for distributed computing.

For example, the Cell processor (for Playstation 3) made it impossible to implement efficient concurrent data structure. Or shared memory is impossible for distributed computing or heterogeneous architecture with GPU nodes.

Messaging-passing is often associated with overhead.

Hardware transactional memory is only supported on recent Intel chips and GCC-only and was notoriously buggy for 3 chip generations (Ivy Bridge, Haswell, Broadwell).

Note that in all cases, implementation "details" matter a lot and message passing can be as fast as shared-memory as shown by my proof-of-concept channel-based work stealing scheduler.

Let's talk about the biggest implementation "detail".

Memory

For compute intensive operations the bottleneck is often not the CPU GFlop/s but the memory to keep the processor fed with data to process. This has been captured by the roofline model and the notion of arithmetic intensity (ratio of compute operations / bytes needed to carry it). Only operations with high arithmetic intensity can use the CPU at 100%, most are bottlenecked by memory and can use 10-20% of the compute.

This means that memory locality and efficient memory allocation and reuse is key: memory pools, object pools, stack arrays with alloca, ...

Also for NUMA architecture, a NUMA aware allocator would be helpful.

I.e. concurrent data structures should probably accept an "allocator" argument.

Extras

Some extras that are not in scope but interesting nonetheless

  • relation with the async/await event loops
  • fiber/coroutine pools as in Boost::fibers or the Naughty Dogs presentation (video and slides
  • Task Graphs
  • Dealing with GC types (as GC will still be useful)
  • Mapping with GPU: beyond the obvious offloading of for-loops to GPU, Cuda and OpenCL provides a async stream and event API to offload, provide continuations and then block or poll until the computation stream has finished.

Benchmarking

Once we have designed our unicorn™, we need to make sure it fits our performance requirements, its overhead, its scalability and how it fares against other close-to-metal language.

Here are a couple of ideas:

  • Runtime overhead (Task Parallelism):
    A recursive fibonacci benchmark will quickly tell
    how much overhead the framework has because the task is completely trivial.
    It will also tell us the scalability of the task system as the number
    of tasks grows at 2^N.
    Key for performance:

    • Memory allocators
    • Having distributed task queues/deques to limit contention
  • High-performance computing (Data Parallelism)
    I have implemented a matrix multiplication in pure Nim as fast
    as industry-standard OpenBLAS, which is Assembly + raw pthreads.
    It requires 2 nested parallel for loop and can also be called from
    outside parallel regions as it's a basic building block for
    many scientific and machine learning workloads.
    Key for performance:

    • As long as the matrix multiplication is well implemented it's an easy task
      as workload is completely balanced (no need for stealing), tasks are long-running (work is much bigger than overhead)
      and complex enough to maximize compute as long as memory is fast enough.
    • Thread pinning will help a lot as it is very memory intensive
      and optimizations are done to keep data in L1, L2 caches and the TLB
    • Being aware of and not using hyperthreading will help because
      otherwise the physical core will be bottlenecked by memory bandwith
      to retrieve data from 2 threads operating on different matrix sections.
      Extra: would be to test on a NUMA machine.
  • Load balancing (Task parallelism)
    Tree algorithms creates a lot of tasks but if the tree is unbalanced
    idle workers will need to find new work.
    An example use-case is Monte-Carlo Tree Search used in Decision Processes and Reinforcement Learning for games IA and recently in finance. In short,
    you launch simulation on diffrent branches on a tree, stopping if one is not deemed interesting but searching deeper on interesting branches.
    The Unbalanced Tree Search benchmarks is described in this paper.
    Key for performance:

    • load balancing
  • Energy usage (Task parallelism):

    When workers find no worker they should not uselessly consume CPU. A backoff mechanism is needed that still preserve latency if new work is suddenly available.
    A benchmark of energy usage while idle can be done by just checking the cpuTime (not epochTime/wallTime)
    of a workload with a single long task compared to serial.

  • Single loop generating tasks (Task Parallelism)

    Such a benchmark will challenge the runtime to bundle or potential
    split work with incoming steal requests. This stresses how many consumers
    a single producer can sustain, see Nim implementation.

  • A divide-and-conquer benchmark like parallel sort

  • Black-and-scholes: The Black-and-Scholes equation is the building block of financial modeling.

  • Wavefront scheduling (Task Graphs)
    wavefront is a pattern that often emerges in image processing when after computing pixel [i, j], you can compute pixels [[i+1, j], [i, j+1]], then [[i+2, j], [i+1, j+1], [i, j+2]]. This is also a key optimization for recurrent neural networks (Nvidia optimization blog - step 3).

See also: A Comparative Critical Analysis ofModern Task-Parallel Runtimes

Community challenges

Let's go back from the nitty-gritty details and look into the challenge for Nim.

  • Given the breadth of the needs and design space: do we want to allow multiple libraries, do we try our hands at a one-size fits all?
    • Example: real-time system and games might want scheduling with a priority queues which are hard to make concurrent and I'm not even sure about work-stealable.
  • Assuming we allow multiple libraries, how to make sure end-users can use one or the other with minimal cost, does the standard library enforce an interface/concept?
  • When do we ship it?

I hope you enjoyed the read.

TL;DR: Designing a multithreading runtime involve many choices, probably some conflicting ones in terms of performance, ergonomy, complexity, theoretical properties (formal verification) and hardware support.

@krux02

This comment has been minimized.

Copy link
Collaborator

commented Aug 9, 2019

Wow, I can see you spent a lot of work into this PR. Yes threading is something we should improve in Nim and this document is great work.

To my experience a good API only bubbles up, when we have a task where we want to use this API. Creating a multithreading API in isolation doesn't work. So my suggestion is that we also need to specify a problem that this API should be able to solve. This problem should be an interesting problem to solve, not these micro benchmarks that sort a list of random integers in parallel., and then we can see how well the patterns work out.

I am saying this, because the road to blender success were the open movie projects. These projects helped the blender developers to focus on what is really important. They were not made for the sake of making a movie, they were made to improve on Blender. So I think we need the equivalent of a Blender Open Moive for Nim to improve upon the Multithreading functionality.

@c-blake

This comment has been minimized.

Copy link

commented Aug 9, 2019

Minor pet peeve - in terms of binomial trees for option pricing, there really are many numerical methods that scale better (e.g. Broadie & Detemple 1996 https://sci-hub.tw/10.1093/rfs/9.4.1211, but there is a whole cottage industry of methods here - BBSR is just super-intuitive/easy to explain to anyone at all familiar with the basic problem).

So, to me, at best that feels more like the Fibonacci benchmark - "a well known but bad way to get an answer", (differing only in that it seems to be much less well known how bad a way, hence this comment).

@awr1

This comment has been minimized.

Copy link

commented Aug 9, 2019

Great proposal.

Asked this on Gitter but I thought I should restate it here: w/r/t the load-balancing and the hardware section, I'm curious what precisely do you intend here: do you want to approach some sort of compile-time/init-time fine-tuning for the underlying implementation or just an idealized general implementation?

@mikra01

This comment has been minimized.

Copy link

commented Aug 9, 2019

indeed very impressive. A "one-size fits all" will be hard to get. Think you will need very much metadata at compile-time to get the best out of the hardware and the business-case. But now I started to dream about a "RTOS-less" runtime - a lightweight, easy to portable HAL (in Nim?) and Nim....

@mratsim

This comment has been minimized.

Copy link
Author

commented Aug 10, 2019

Edited to add the table of contents and a section of the parallelism options that we have currently in Nim (createThread, threadpool, OpenMP).

@krux02 I do have non-toy uses:

  • Data Parallelism:
    • I can switch all OpenMP uses in Arraymancer and Laser to the Picasso runtime provided the performance is there.
  • Task Parallelism:
    • cryptographic signatures and verifications are our number one bottleneck (~30% processing time) for Ethereum 2 at Status, being able to process multiple in parallel would be very helpful. As they come asynchronously from the network an API similar to async/await but for compute would be a great match.
    • DAG and tree algorithms power several state-of-the-art machine learning and reinforcement learning algorithms:
      • Gradient Boosted Trees for everything but perception (vision/test/sound) i.e. predicting price, quantity, web visits, subscriptions, sales, ...
      • Beam search to improve natural language models
      • Computation graphs for neural networks
      • Monte-Carlo Tree Search for robotics, game AI and reinforcement learning

Regarding non-toy uses outside of my expertise, I'd say a parallel ray-tracer (see ray-tracing in one weekend) would be pretty nice and it's visual.

@c-blake noted. Your link is dead though.

@awr1 @mikra01: For now, I have an implementation idea that should cover from phones/raspberry Pi to laptops to single-socket workstations. I think a library with a set of APIs that you can import picasso is the most flexible (instead of compiler builtins like spawn and parallel) and when you want something suited for NUMA or distributed computing you can do import picasso_distributed.
Regarding compile-time fine-tuning, there are not a lot to fine-tune at compile-time for a mature library. There will be:

  • the array sizes, for example Nim threadpool and my PoC can hold up to 256 threads, but it's overkill in most cases
  • maybe the polling/yielding frequency.
  • the cache line size for padding to prevent false sharing. It's 64 bytes for almost all common platforms except Samsung phones that use 128 bytes
  • toggling profiling and multithreading asserts
  • memory allocator related (like use jemalloc, tcmalloc, a custom pool_allocator when dynamic is not allowed, the object pools size)

For research purposes it will be implementation dependent, in my PoC, you can tune: loop split and work stealing strategy, the number of outstanding steal requests.

At runtime there are some things to detect at init time:

  • an environment variable with the max number of threads to use
  • find hyperthread siblings, and allow at init a parameter for not using hyperthreads
  • NUMA (if we go that far)

Due to work-stealing adaptative nature, good defaults should cover from the current dual-core to 22-core single socket machines.

@c-blake

This comment has been minimized.

Copy link

commented Aug 10, 2019

@mratsim - huh, works for me. Taiwan may be getting blocked for you?

For the curious there's also a book 10 years later covering the paper and amazon's preview (for me) covers and mentions the basic idea https://www.amazon.com/dp/158488567X/ref=rdr_ext_tmb { which is simply use the closed form BS formula for the very last time period to smooth convergence to the point that Richardson extrapolation is workable..In a sense two distinct extrapolate-to-zero-stepsize tricks cooperate to squash error(stepsize|N steps) }. That paper also gives pseudocode for linear-in-memory storage -- also too uncommon, making even 1000 level "trees" fully L1-resident. It's by no means the last word in pricing models/algo efficiency, but it shows how almost trivial upgrades from a defining recursion shift scaling in time/space, much as Newton's method vs. binary search { or naive Fibonacci vs memoized/array/matrix/formulaic, etc., but I realize inefficiency may be "the point" as with Fibonacci }.

@krux02

This comment has been minimized.

Copy link
Collaborator

commented Aug 10, 2019

@c-blake I can't see the link either. I see the left border decoration with sci-hub, but everything on the right is like a broken link.

@mratsim Regarding the ray tracer. I only implemented toy ray tracers so far, and for that the fragment shader is enough. A computation kernel that runs for every pixel on the screen is exactly what you need for a ray tracer. If you want to go for photo realism with light bounces everywhere this approach has it's limits though.

@c-blake

This comment has been minimized.

Copy link

commented Aug 10, 2019

Huh. For me it auto-downloads the PDF (which is the real content, not any HTML). You can also try just http://sci-hub.tw/ and manually enter the DOI (which is "10.1093/rfs/9.4.1211"). You may also need javascript enabled for sci-hub.tw & maybe cyber.sci-hub.tw? It looks like this is a direct link to the pdf https://dacemirror.sci-hub.tw/journal-article/9bc250c4bfa2abe7c3ee89ed32b59609/broadie1996.pdf . It is a nice introductory survey paper for the field (as it was 25 years ago) with charts, graphs, pseudocode, etc. as well as a great practical example of Richardson extrapolation. Anyway, apologies if it's hard to access! I didn't think it would be.

@csajedi

This comment has been minimized.

Copy link

commented Aug 13, 2019

This is a great collection of knowledge. I've let my HPC knowledge decay but I'm still interested in it and as I learn nim I think about how powerful it could be for HPC. If I was going to take on a hobby project to refamiliarize myself I'd try to write nim bindings or a macro for OpenACC- I'm still such a greenhorn I don't even know which would be more appropriate!
In short, OpenACC is an accelerator primitive toolkit that works along certain compilers to parallelize C,C++ and I think Fortran code for accelerators and the CPU. It's like scripty CUDA (it even has async/await) but can run just as fast on AMD, NVIDIA and multicore CPU targets. Now is a great time to bring support to Nim as GCC 9 has really stepped up support for it and AMD recently contributed better backends for their newer cards.

As I find some free time I'll try to parse Picasso and think of a demonstration or experiment to build against. If you've got any thoughts in general I'm curious to hear them.

@kobi2187

This comment has been minimized.

Copy link

commented Aug 13, 2019

I have two thoughts here: one is that there are many abstractions nowadays, for example, channels, that can use the so called green threads, or fibers, instead of a full os thread. This is much easier for the user/dev, but perhaps you are speaking solely on the underlying implementation to enable these abstractions.
2) If you think of an operating system, let's say you have two hard drives, and copy files to both of them - the files moved to each can be done at the same time, but if it's to the same hard drive, it's better to be sequential. so this is true for IO of all kinds, be it network connections, storage, or memory operations. Is there a need for some "manager" to collect those requests, and build some kind of a dependency graph, to determine which resources can next be executed into action?
Maybe the design should include such an over-seer, as a more complete solution for optimal decisions.
3) ok I have a third thought, the world is trying to parallelize, and nim is compiled to c/c++ - surely there is some very optimized library that you can target. I know of libmill for example, surely there are many others.

@mratsim

This comment has been minimized.

Copy link
Author

commented Aug 13, 2019

@csajedi OpenACC is probably relatively easy.

You can reuse the OpenMP codepath for OpenACC for loops which I touched in those 2 PRs: https://github.com/nim-lang/Nim/pull/10891/files and https://github.com/nim-lang/Nim/pull/9493/files. And for pragma directives (without for loops) you can follow the techniques I use in my future HPC backend.

Anyway for HPC and scientific computing, I have something much better than Picasso planned, you can read more in the markdown files in Laser

@kobi2187

  1. So I indeed thought of fibers, especially due to the inspiration from the Naughty Dogs talk and slides and boost::fiber, however there are 2 caveats:

    • It makes the runtime more complex. The abstraction is Task -> scheduler -> thread instead of Task -> Fiber -> scheduler -> thread. I'm not sure there are gains for pure compute tasks.
    • The more focused the runtime is, the easier it is to make it play nice with the async/await for IO i.e. this is compute => Picasso and this is IO => async/await.
    • Historically fiber multiplexing on a threadpool is called M:N threading and was tried by many languages: Rust, Java, Glibc before being abandoned. See:
  2. The overseer is the Task Graph part I mentioned as out-of-scope. Mature C++ threading libraries offer them, like Intel's TBB Flowgraph and Cpp-Taskflow or even OpenMP via the depend clause.
    I think task graphs can be build on top of a good low-level task abstraction at a later time. Basically, the difference with the current proposal is that in the proposal you eagerly create tasks while with a task graph, you lazily describe your operations then execute the graph. Conceptually it's the same difference as a dynamic language vs statically compiled one.

  3. Yes there are very optimized C/C++ libraries but the main blocker would probably be how to pass Nim closures to those libraries, this probably would require compiler support. With a pure Nim implementation, the implementation can be done as a library.
    Furthermore, my proof-of-concept Picasso implementation has similar (on my 2-core laptop) to much less (18-core workstation) overhead than Intel TBB or LLVM OpenMP.
    This also avoids distribution issues of .dll/.so and would also help a WASM backend.

@csajedi

This comment has been minimized.

Copy link

commented Aug 15, 2019

@mratsim I'm pretty psyched about Laser. I am following along a bit out of sync. I was going to suggest you add it to "Are we scientists yet" as that page drew me in to learn more about Nim for my own HPC stuff. I see you're already in the mix there so I'll just wait patiently. WASM is going to be fire for nim once it matures past the V1 design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.