Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel statement timing model #1555

Closed
lriesebos opened this issue Dec 2, 2020 · 27 comments
Closed

Parallel statement timing model #1555

lriesebos opened this issue Dec 2, 2020 · 27 comments

Comments

@lriesebos
Copy link
Contributor

Bug Report

One-Line Summary

@sbourdeauducq , maybe I am having a misunderstanding, but the timing model of the parallel statement might be fundamentally off.

Issue Details

Steps to Reproduce

from artiq.experiment import *


class ParallelLoopExperiment(EnvExperiment):

    def build(self):
        self.setattr_device('core')
        self.delay = 100 * ms

    def prepare(self):
        print(f'Expected delay is {self.delay}')

    @kernel
    def run(self):
        self.core.reset()

        # Scenario 1
        t0 = now_mu()
        with parallel:
            for _ in range(2):
                delay(self.delay)
        print(self.core.mu_to_seconds(now_mu() - t0))

        # Scenario 2
        t0 = now_mu()
        with parallel:
            delay(self.delay)
            delay(self.delay)
        print(self.core.mu_to_seconds(now_mu() - t0))

Expected Behavior

I expect both scenario's to be equivalent from a timeline perspective. Expected delay 0.1 s for both scenario's.

Actual (undesired) Behavior

the for loop in the parallel statement is not parallel from a timeline perspective.

INFO:worker(139,parallel_loop.py):print:Expected delay is 0.1
INFO:worker(139,parallel_loop.py):print:0.2
INFO:worker(139,parallel_loop.py):print:0.1

Your System (omit irrelevant parts)

  • Operating System: Ubuntu 20.04 or 18.04
  • ARTIQ version: v5.7137.90fbc58a
  • Version of the gateware and runtime loaded in the core device: latest gateware, behavior reproduced on both kasli and KC705
@sbourdeauducq
Copy link
Member

A parallel statement resets the timeline at each of the statements immediately under it. In your first example there is only one such statement, the for loop. In your second example, there are two statements, the delays.

@lriesebos
Copy link
Contributor Author

Thanks for the quick reply, then it was definitely a misunderstanding from my side. Though I do have to admit that in my opinion, this limits the power of the parallel statement. I can imagine the semantics of the parallel statement will not be changed at this point in time, so I guess the alternative is to "simulate" parallel behaviour in the situation of a for loop by altering the timeline manually.

@sbourdeauducq
Copy link
Member

Where do you draw the line?
If you actually parallelize every RTIO operation in a parallel statement then function calls that rely on sequential execution, such as for instance SPI accesses to Urukul, will break in a parallel statement unless they contain an explicit sequential statement that is easy to forget.
Simply parallelizing the top-level statements was the compromise found.

@lriesebos
Copy link
Contributor Author

Sounds fair, I understand the interpretation that is implemented.

The model I had in mind was parallelizing every function call in the parallel statement. So in the timing model I had in mind, the contents inside the for loop are parallelized. In this model, "classical" constructs, such as a for loop do not have an influence on the timeline. I might be biassed by my own previous work, but to me such a model sounds more natural.

For example, the (incomplete) snippet below would do a parallel detection using an array of edge counters. Like I said, it might be that we have a different opinion on the timing model, though I would argue that drawing the line at function calls (hereby I am considering the delay() statement also as a function call) results in a more powerful parallel statement.

from artiq.experiment import *


class ParallelLoopExperiment(EnvExperiment):

    def build(self):
        self.setattr_device('core')
        self.pmt_array = [self.get_device(f'edge_counter{i}' for i in range(4))]

    @kernel
    def run(self):
        self.core.reset()

        with parallel:
            for pmt in self.pmt_array:
                pmt.gate_rising(100 * ms)

@sbourdeauducq
Copy link
Member

Well, technically that can be done easily, the issue is breaking legacy code. Let's open this for discussion as we may want to introduce it with the other breaking changes in the upcoming compiler rewrite.

@lriesebos
Copy link
Contributor Author

I appreciate the open mindset and I am definitely willing to contribute to such discussions. Two things I wanted to add:

  1. I am not a PI and I do not call the shots regarding finances, but I do know that my lab at Duke is open to discuss funding opportunities for ARTIQ development and especially the ARTIQ compiler.
  2. On the shorter term, it would be amazing if there would be compiler flags/configuration to configure this kind of behaviour. In this way we can test and use new compiling features while leaving legacy code untouched.

@dhslichter
Copy link
Contributor

dhslichter commented Dec 2, 2020

To me this indicates a misunderstanding of what parallel is intended to do (and not do), and I am strongly opposed a breaking change to ARTIQ, because the existing behavior is correct and desirable in a number of use cases.

There are sensible existing solutions for the kind of programmatic timing tasks you want to perform (which I can understand and appreciate the utility of, especially for larger systems). For example, using the at_mu() method, one could write the following very simple modified version of the code above to achieve exactly the functionality you desire, without having to change parallel:

from artiq.experiment import *


class ParallelLoopExperiment(EnvExperiment):

    def build(self):
        self.setattr_device('core')
        self.pmt_array = [self.get_device(f'edge_counter{i}' for i in range(4))]

    @kernel
    def run(self):
        self.core.reset()

        t_start = self.core.now_mu()
        for pmt in self.pmt_array:
            self.core.at_mu(t_start)
            pmt.gate_rising(100 * ms)

I am not sure that adding compiler flags will help the situation, it's a lot of clutter. Part of the issue here is a more fundamental one, that people expect Python programming to be "pythonic", while ARTIQ kernel programming needs to be a little more spelled out in a lot of instances. There are tough choices to be made here. See for example the discussion in #1542.

@dnadlinger
Copy link
Collaborator

dnadlinger commented Dec 2, 2020

Yes, just calling at_mu at the beginning of the loop (self.core not needed) is a simple way to achieve the desired outcome.

And as @dhslichter points out, the current behaviour is desirable in a lot of cases, and very simple to explain (time is reset at the beginning of each top-level statement, i.e. usually line underneath the with statement).

Another type of context along the lines of what you are suggesting could be added in addition to the current ones, but it would be more complex to specify, and more complex to implement in the compiler than the current types (with the exception of interleave, which is problematic and which nobody seems to use anyway). As such, I'm really not sure it would be worth it – in fact, personally I'd gladly write out one extra line to explicitly call at_mu here and there if it means I don't have to explain complicated semantics to people.

Clarifying the with parallel documentation, e.g. by explicitly mention the (non-)interaction with loops, might be a good idea, though – do you want to submit a PR?

@dnadlinger
Copy link
Collaborator

(Note aside: In many cases, it is preferable to add a coarse RTIO clock cycle (typically 8 ns) of delay between similar events submitted in a loop, as it is easy to exhaust the number of available RTIO lanes otherwise.)

@lriesebos
Copy link
Contributor Author

Thanks for contributing to the discussion. I agree with @dhslichter that breaking changes should not be taken lightly, but that does not mean that my suggestion is invalid. Especially when there is a plan to make breaking changes anyway, the interpretation of parallel could be reconsidered.

For the current compiler infrastructure, there are multiple options mentioned here to implement such a change without influencing existing code (i.e. a compiler flag added to the arguments of the core device, a separate context), and I think some of these options could be reasonable but I do not decide about that.

Regarding the semantics, I still stand with my idea that the implicit sequential context after each top-level statement is less powerful and intuative than my interpretation. Though I can imagine it might be challenging to convince people that are already used to the current semantics. I added some snippets to fuel the discussion.

"""One or three parallel pulses.
For current semantics, we need multiple parallel contexts
"""
# current semantics
with parallel:
    self.device_a.pulse(10*ms)
    if self.device_b_enabled:
        with parallel:  # Have to enter parallel context again...
            self.device_b0.pulse(10*ms)
            self.device_b1.pulse(10*ms)

# alternative semantics
with parallel:
    self.device_a.pulse(10*ms)
    if self.device_b_enabled:
        self.device_b0.pulse(10*ms)
        self.device_b1.pulse(10*ms)
"""One or three pulses of which the two additional ones are sequential.
Current semantics is more compact due to the "implicit" sequential timing
"""
# current semantics
with parallel:
    self.device_a.pulse(10*ms)
    if self.device_b_enabled:  # Implicit sequential
        self.device_b0.pulse(10*ms)
        self.device_b1.pulse(10*ms)

# alternative semantics
with parallel:
    self.device_a.pulse(10*ms)
    if self.device_b_enabled:
        with sequential:  # Explicit sequential, improves readability
            self.device_b0.pulse(10*ms)
            self.device_b1.pulse(10*ms)
"""Parallel detection.
For current semantics, the parallel context lacks power and can not be used
"""
# current semantics
t = now_mu()
for pmt in self.pmt_array:
    at_mu(t)
    pmt.gate_rising(10*ms)
    
# alternative semantics
with parallel:
    for pmt in self.pmt_array:
        pmt.gate_rising(10*ms)

I would argue that the current parallel context lacks power which limits its usability. Though I would like to see examples where the current semantics are more powerful and/or improves readability.

Regarding explaining the semantics of the parallel statement, I would argue that my interpretation is easier to explain as it is closer to the semantics of the Python with statement. The entered context applies until 1) we leave the scope of the with statement 2) we explicitly change the timing of the scope using sequential 3) we leave the local scope due to a function call. Point 1 and 3 hold for both semantic interpretations but in the current semantics, point 2 became implicit which seems to defeat the purpose and limit the power of the parallel context.

Finally I wanted to mention that the current ARTIQ kernel language is designed as a subset of Python, and staying as close as possible to the Python semantics would therefore make sense. The ARTIQ manual does not describe every aspect of the kernel language because the implemented language subset is supposed to work similar to Python. Besides from that, "pythonic" programming does not conflict with low-level programming in kernels and even encourages to be explicit.

@dnadlinger
Copy link
Collaborator

staying as close as possible to the Python semantics would therefore make sense

This argument doesn't support your suggested interpretation either, as Python with statements/context managers don't at all have semantics of "modify every statement in the block". That would be AST macros. Rather, the rewrite for with ctx: body() is (very loosely) ctx.enter(); try: body(); except: ctx.exit(exc_info()); finally: ctx.exit(None). As such, both the current and the proposed interpretation are outside host Python semantics, making appeals to that a moot point. Neither semantics, whether it's "modify the meaning of each top-level statement", or "modify the meaning of each statement lexically in the same block", is representable in regular Python.

In general, I'd avoid introducing multiple language variants (e.g. via compiler flags) as much as possible, since it carries a hefty cost, not only in terms of compiler maintenance (an exponential number of combinations now needs to be tested), but also in terms of language complexity affecting teachability (documentation needs to mention that there are two distinct meanings that look visually identical) and programmer workload (reading code, one needs to watch out for the different meanings), as well as shareability/reusability of code (one might be unable to use two libraries that require disparate language variants).

In this particular instance, most of these issues (except for the increase in overall compiler/language complexity) could be addressed by simply making the distinction local, e.g. by locally specifying a language variant (@kernel(features={"deep_parallel"}), or specifying it at the site of use (with parallel(deep=True): …) – I just wanted to mention this, as variants specified via core device arguments/… were suggested.

@dnadlinger
Copy link
Collaborator

I've just had a quick look through part of our code base (some 60k LOC), and there was only a single case where the behaviour would actually change, as we tend to write out with sequential: for clarity anyway.

As such, I'm not necessarily opposed to this because of backwards compatibility (although other code bases might look different); rather, I'm mostly just not sure whether the added complexity is worth it.

@lriesebos
Copy link
Contributor Author

I think we interpret the application of the with statement a bit different. Let me try to clarify that. I am not arguing that statements in a with block are modified by the parallel context. I see entering the with parallel context as adding a time context to your "timing context stack". It means that timeline alterations within the context are interpreted differently as the top of the stack decides how a delay interacts with the timeline. This is by the way exactly described in https://github.com/m-labs/artiq/blob/master/artiq/sim/time.py . In that interpretation, the semantics of the parallel statement I describe match how it functions in that simulation. And just to finalize the explanation, each time we call an other kernel function, a sequential context is pushed on the stack. That would exactly implement the semantics I described. Hope that clarified my interpretation and why I think it is closer to the Python language.

Regarding legacy code, I have the impression that the change is not as invasive as it sounds, though obviously other users will have to agree with that. As you mention, the semantics change would require to add some with sequential: lines, which I can imagine people might have already added for clarity anyway.

I do not think I am in a position to talk about implementation, though I do agree with the points you made about "multiple language variants".

@dnadlinger
Copy link
Collaborator

The time manager example, however, doesn't actually model the semantics as currently implemented (and neither as proposed here). Really, we should probably delete artiq/sim entirely; I haven't seen it used in the wild, and it hasn't been kept up to date in years.

As for which interpretation is closer to regular Python, semantically, in both cases implicit with sequential: blocks are inserted – in one, on each top-level statement, and in the other, on each function call. Neither are implementable in regular Python (without horrid hacks making use of interpreter trace hooks anyway), so it doesn't seem at all obvious to me why one should be more natural than the other. This line of argument doesn't seem productive to me.

What is, however, the case, is that the proposed semantics are more complex to implement in the compiler, and define in a hypothetical formal language specification, than the current ones. For the current "shallow" case, the implementation for with parallel statements literally boils down to inserting a call to at_mu before every statement in the block body. In the proposed "deep" case, with parallel would need to save some global context information, such that e.g. the implementation for emitting the body of an if statement contained within the with would also know to insert those at_mu calls before the statement in its body.

That being said, "deep" with parallel would still fall under the category "sufficiently easy to implement", so it's not a strong argument against those semantics. However, I think the conceptual simplicity/locality does slightly point towards preferring the "shallow" semantics, just as the potential intuition of less technical users regarding the behaviour of with parallel: for …: … does towards the "deep" one.

@sbourdeauducq
Copy link
Member

Neither are implementable in regular Python (without horrid hacks making use of interpreter trace hooks anyway)

It's probably not that horrid, but yes, nobody seems to care about artiq/sim anymore (it dates from the first draft of the with parallel/with sequential ideas...).

@lriesebos
Copy link
Contributor Author

Based on your explanation, it does seem that the timing model with a "timing stack" is not how it is currently implemented in the compiler. Though even if the artiq/sim files are very outdated, the conceptual model with the timing stack behaves very close to the timing semantics currently implemented in ARTIQ. I would argue the only difference is the semantics of with parallel, as we are discussing here.

The implicit adding of with sequential for each function call is actually trivial in a simulation. The @kernel decorator forwards each function call to the run() function of the core device. There the actual call to the kernel function is done inside a with sequential statement. We have actually implemented such a functional simulator and that works without any hacks. It is actually the current behaviour of the parallel statement that requires a hack, which was for me the reason to start this discussion. Without a hack, the simulator handles the parallel context naturally with the "deep" semantics. Besides from that, it seems that the timing stack mimics the ARTIQ timing model perfectly. (fyi, we are willing to share access to this code if anyone is interested)

I do not know enough about the implementation of the ARTIQ compiler to argue if a deep parallel would be easy to implement, you would know better. Though I would guess it would not be too hard from a compiler perspective. I would guess that in the AST/IR, each function call under a with parallel statement is followed by an at_mu call while this behaviour does not propagate through function calls themselves. Probably no global data structure needed, only passing of "time context information" while traversing the graph.

Finally, I wanted to state that with either semantics for parallel, experiments can be implemented just fine. I do still think that the deep parallel would be the more powerful and natural implementation and that it would be worth considering for the near or further future. But if I can not convince readers of this thread, I am fine with letting it go, though a healthy discussion never hurts!

@sbourdeauducq
Copy link
Member

we are willing to share access to this code if anyone is interested

Yes, please post it. That helps us know which parts of ARTIQ are important or not... before your messages we thought nobody was using artiq/sim anymore.

@dnadlinger
Copy link
Collaborator

Yes, please post it.

Or, even better, consider developing such general components with the idea to submit them back to upstream ARTIQ to begin with. :)


As for implementability, I was referring to similar behaviour in regular host Python. If every function you call is annotated with a decorator (@kernel) such that you can modify global state there, of course you don't need to hook function calls in the interpreter (because that's already what you are effectively doing).

That being said, if simulation in this fashion is actually useful to people for work on real experiments, easily being able to simulate "deep" parallel is indeed an argument in favour of the latter. For us, we'd typically need to mock out a considerable amount of extra hardware interfaces and complexity in order to be able to usefully simulate kernels, so we tend just to test on actual hardware.

There is still the question of backwards-compatibility of course, which I can't answer for the wider ARTIQ community. For our code bases, quickly auditing every with parallel wouldn't be too bad, but perhaps others might be less happy about this?


Probably no global data structure needed, only passing of "time context information" while traversing the graph.

Global state as in state kept while translating the AST into IR, yes.

I wonder where the timeline resets/duration tracking would best be inserted in the deep model. Before/after each CallT node, as each interaction with the timeline (syntactically) happens through function calls?

each function call under a with parallel statement is followed by an at_mu call

(Just for the sake of completeness: There is an at_mu call before every statement, and an assignment to a local variable keeping track of the thus-far longest duration after each one.)

@dhslichter
Copy link
Contributor

dhslichter commented Dec 4, 2020

Changing to "deep" parallel necessarily requires the use of explicit sequential. Basically none of our code uses explicit sequential, so it would be a huge rewrite. I am sure that there are other groups in the same boat. Implicit sequential saves extra lines of code and extra levels of indenting. Personally, I find it preferable from a logical-flow standpoint, but others may feel differently.

I think if anything is to be considered, it should be the addition of a new timing context deep_parallel, in addition to (and not to replace) the existing parallel. If the use case is really there that one wants to implement this distinct new context, then I don't have an objection, although it does represent some additional clutter. A few comments:

  • the way that SED is set up, having a number of events at the same timestamp can cause sequence errors (if more events than lanes, it's guaranteed). Having the deep_parallel context available thus invites people to do lots of things in parallel without realizing the underlying fragility of SED to having lots of things done at once. The addition of deep_parallel might then require some rethinking of how SED works, or at a minimum, larger default numbers of SED lanes (a more fragile fix).
  • Furthermore, deep_parallel increases the chances that a subtle code issue that was not a bug with parallel (failing to insert an explicit sequential) will end up causing this kind of sequence error, confusing the living daylights out of lots of new users just finding their way. This could be especially pernicious if one calls a function inside a deep_parallel context, which itself calls a subfunction, etc. Unless ALL of your code is done with explicit sequential, the deep_parallel will move down through functions and parallel-ize pulses you did not intend to have parallel. I don't see how one can really support deep_parallel without eliminating implicit sequential.
  • No matter what, the documentation needs to be improved to describe better the current parallel behavior. If deep_parallel ends up being implemented, there needs to be explicit and bright red bold discussion next to it that if you want to use this, you should (must?) use explicit sequential everywhere in your code.

In short, the fact that ARTIQ has been used in dozens of experimental setups in many different laboratories over 5+ years without issues being filed about how parallel behaves suggests that, at least for most people, the "shallow" parallel works just fine.
Given that there are simple ways of implementing deep parallel as described above, I don't see that we should be in a rush to replace the existing parallel, but I would say we can consider adding in a special deep_parallel if that is important to some people.

@lriesebos
Copy link
Contributor Author

As mentioned before, I am ok with the current implementation of parallel though I do think changing its semantics to "deep" would make it more useful/powerful. We could consider adding this functionality to the existing compiler or decide to keep the idea for a future breaking change to the compiler.

For the existing compiler, I understand we can not suddenly change the semantics of parallel due to legacy code. I think having deep_parallel as an alternative could be interesting and would allow us to play with the concept.
At this moment I am not capable of adding such a feature to the ARTIQ compiler, so if the ARTIQ maintainers agree to give this idea a swing, someone else would have to implement it.

@dnadlinger regarding the implementation in the compiler, I also think the timeline "reset" would be inserted before each call node inside a parallel block. (thanks for the correction)

For the "deep" vs "shallow" parallel discussion:

  1. @dhslichter I think we still have a misunderstanding on "deep" parallel. It does not propagate through functions. At entry of any kernel function, the timing context will always be sequential. So it is not required to use sequential "everywhere" in your code. As mentioned below, maybe give the plugin a try to see which lines of code potentially would be affected by a "deep" parallel implementation.
  2. Both deep and shallow parallel could take extra lines or extra indenting based on the desired functionality. I guess they kind of break even (see the examples above)
  3. That "deep" is easier to simulate is definitely a pro, though of course only if you use a simulator. I don't think the "shallow" parallel can be simulated without parsing the code of the kernel.

@sbourdeauducq regarding artiq/sim, we are not using those files.
Instead we have rewritten the simulation backend, though the timing model is pretty much the same as described in artiq/sim.
Our functional ARTIQ simulator is part of a larger project known as Duke ARTIQ Extensions.
At this moment, our project is private and requires Python 3.7+.
We are planning to open-source various parts of the project in 2021.
We are definitely considering to submit some components upstream once the development on our side stabilizes and the ARTIQ stable release fully moves to Python 3.7.
I will invite people that have shown interest to our project.

Just as a bonus, I quickly wrote a flake8 plugin that can check ARTIQ code and can flag lines that are potentially relying on the "shallow" parallel semantics. It also checks for a couple of other potential bugs specific to ARTIQ code. See https://gitlab.com/duke-artiq/flake8-artiq .
I checked our own code and only found a handful of situations where "shallow" and "deep" parallel would have behaved differently.
@sbourdeauducq feel free to add flake8-artiq to the ARTIQ nix channels if you like the plugin.

@sbourdeauducq
Copy link
Member

@sbourdeauducq regarding artiq/sim, we are not using those files.

OK to delete them then?

We are definitely considering to submit some components upstream once the development on our side stabilizes and the ARTIQ stable release fully moves to Python 3.7.

This should not be connected to ARTIQ stable - in fact we do not accept new features in release branches, only bug fixes.

@lriesebos
Copy link
Contributor Author

@sbourdeauducq if it was up to me, feel free to delete artiq/sim.

Once we are ready, we can contribute components of our code to the master/development branch. Though in our lab we run on the ARTIQ (5) stable channel. Hence we decided to set up our project as an extension to ARTIQ.

@sbourdeauducq
Copy link
Member

Hmm, the SAWG gateware simulation uses artiq.sim...

@sbourdeauducq
Copy link
Member

@sbourdeauducq feel free to add flake8-artiq to the ARTIQ nix channels if you like the plugin.

Looks good - I have added it to Nix, conda since that seems simple enough (but not tested, I'm just running the generic conda package builder in Nix), and https://m-labs.hk/experiment-control/resources/

@dhslichter
Copy link
Contributor

If deep_parallel does not propagate through functions then that addresses a lot of my concern. I would still strongly prefer that deep_parallel be implemented as a separate context, rather than clobbering shallow parallel.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Sep 20, 2021

Simulator can probably use this: https://rushter.com/blog/python-bytecode-patch/
The bytecode module https://github.com/MatthieuDartiailh/bytecode looks better maintained than the others.

AST macros are perhaps unwieldy as we have to handle function calls in expressions e.g.

with parallel:
  x = a() + b()

would have to be transformed to:

with parallel:
  with sequential:
    a_result = a()
  with sequential:
    b_result = b()
  x = a_result + b_result

With lazy-evaluated boolean operators (and, or) it gets worse. This is not an issue with the stack-based bytecode.

The debugger doesn't look like it can be used as I don't see an API for setting breakpoints to specific places in the AST or bytecode.

@sbourdeauducq
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants