RTIO collision fails to produce an exception #1644

evilmav · 2021-03-25T13:18:09Z

Bug Report

One-Line Summary

Updating two Fastino DACs at once causes a silent RTIO collision which does not rise any exceptions and lets the kernel continue.

Issue Details

Steps to Reproduce

Run the experiment with artiq_run:

from artiq.experiment import kernel, EnvExperiment, parallel, ms, ns, delay, now_mu;


class Issue1644(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("fastino") 

    @kernel
    def run(self):
        self.core.reset()

        for i in range(10):
            delay(10 * ms)

            # set 2 channels with two commands
            self.fastino.set_dac(0, 0.) 
            self.fastino.set_dac(1, 1.)
            self.core.wait_until_mu(now_mu()) 
            
            # should have failed by now
            print("Not crashed at cycle ", i)

        print("Kernel finished")

Expected Behavior

Kernel should crash with some uncaught exception during cycle 1.

Actual (undesired) Behavior

Kernel runs all the way to the finish even though Fastino outputs are not updated
Log shows 10x ERROR(runtime::rtio_mgt): RTIO collision involving channel 36

Your System (omit irrelevant parts)

Operating System: NixOS 20.09
ARTIQ version: v6.7592.3f6840b7
Hardware involved: Kasli v1.1 with Fastino

The text was updated successfully, but these errors were encountered:

sbourdeauducq · 2021-03-25T14:29:36Z

Log shows 10x ERROR(runtime::rtio_mgt): RTIO collision involving channel 36

...which is how the error in your kernel is reported.

Expected Behavior
Kernel should crash with some uncaught exception during cycle 1.

Exceptions would either:

be asynchronous, which cannot be done safely in Python
cause a slowdown of every RTIO operation since the gateware is unable to determine if there is a collision before several cycles after an event is submitted. On DRTIO it is even worse.

hartytp · 2021-03-25T14:44:08Z

c.f. https://github.com/m-labs/artiq/blob/9214e0f3e25dc8f0036970213acf90ee0cc56fb5/doc/manual/rtio.rst#collisions

evilmav · 2021-03-25T14:47:12Z

Thank you for the docs; my bad, I've indeed missed it. Wouldn't it be possible to set a flag that there were errors during a last kernel/experiment execution and make artiq_run at least throw a "there were errors" into the console if there were any?
From a beginner end-user perspective, who will usually use artiq_run as recommended in the docs for "the first steps", you execute you kernel, you get no explicit errors and will not realize analog channels were not updated.

hartytp · 2021-03-25T14:49:12Z

IMO in an ideal world, the real time guarantee that ARTIQ provides would be "everything happens exactly as your program dictates or there is an exception". In practice, there are some kinds of error that are hard to catch synchronously without incurring significant performance (latency) penalties (the same is true of sequence errors).

asynchronous exceptions would be ...messy

Thank you for the docs; I've indeed missed it. Wouldn't it be possible to set a flag that there were errors during a last kernel/experiment execution and make artiq_run at least throw a "there were errors" into the console if there were any?
From a beginner end-user perspective, who will usually use artiq_run as recommended in the docs for "the first steps", you execute you kernel, you get no explicit errors and will not realize analog channels were not updated.

Yes, I did wonder whether an API with a flag that one can query would be helpful here. In which case, one can add some api that raises exceptions synchronously but after the fact.

sbourdeauducq · 2021-03-25T14:55:43Z

I'm aware that the current behavior isn't ideal, and I was considering terminating the kernel and running a (possibly user-defined) error handler asynchronously. There could be critical sections in the main kernel where execution of the error handlers is suspended, so race conditions could be avoided.
Note that if you have DRTIO with a lot of latency, or if you have collisions with timestamps far in the future, errors could even happen after your kernel has terminated. It's not so easy to reliably report them at the end of artiq_run, you'd basically have to query and synchronize every DRTIO node.

hartytp · 2021-03-25T14:58:18Z

Note that if you have DRTIO with a lot of latency, errors could even happen after your kernel has terminated. It's not so easy to reliably report them at the end of artiq_run, you'd basically have to query and synchronize every DRTIO node.

Indeed, thank you for the reminder

sbourdeauducq · 2021-03-25T14:58:58Z

And this reliable reporting of collisions would also not play nice with seamless kernel handover.

cjbe · 2021-03-25T16:50:25Z

How about having a run-time managed async exception set-on-error, cleared-on-read flag - this could be polled by the user at sensible times (eg after each measurement, to chuck away data on error), and returned to the host when the experiment ends (the host could then report the error to the log / dashboard, or be reported by artiq_run).

This would mean that users see that a bad thing happened without having to do anything, while allowing more sophisticated users to handle this in a custom fashion to exclude bad data-points / report / exit on error

dnadlinger · 2021-03-25T16:58:17Z

Agreed, I was thinking something like this as well, as it nicely avoids the "exception origin problem". Unless specifically disabled, this would also be checked at the end of experiments by default.

(Or maybe the default should be on for artiq_run, where seamless handover seems like a very rare use case, and off for regular master experiments unless enabled in the device db such as not to break backwards-compatibility with experiments actually using seamless handover. I can't really comment on the latter, as we aren't (intentionally) using seamless handover here at all.)

From a user's perspective, this would also integrate nicely with the event submission performance improvements we discussed a year ago or so, where we would add a context manager/… that can be used to temporarily disable synchronous RTIO exceptions until end of the block (increasing the steady-state event rate by avoiding the CDC round-trip latency penalty on each write), and only query for errors at the end.

dnadlinger · 2021-03-25T17:01:31Z

I should add that, in my experience, this isn't just a new user's problem; even people who have been working with ARTIQ experiments daily for a few months and have read the RTIO manual invariably run into mysterious "disappearing events"/… at some point. I would personally find it much more ergonomic if my first reaction to "something weird is happening" wouldn't need to be "check if the core log controller is still working" too.

sbourdeauducq · 2021-03-25T22:28:42Z

Collisions are detected by the sorting network in RTIO, i.e. only when events are executed. So your API will still be tricky to use and your automatic check at the end of the experiment will not be reliable.

cjbe · 2021-03-25T22:35:29Z

In practise I think the above outlined approach will cover most cases. Normally experiments generate a lot of output events, then block on input events (of one kind or another) to read back the results, hence all events will have executed before the experiment ends.

IMHO having a mental model where, if we don't get any errors, we know that "all events that have executed more the 10us before the experiment ends completed successfully" this would cover the overwhelming majority of use cases.
(10us here is a random choice for an upper-bound on DRTIO error propagation latency)

dnadlinger · 2021-03-25T23:24:41Z

Collisions are detected by the sorting network in RTIO, i.e. only when events are executed. So your API will still be tricky to use and your automatic check at the end of the experiment will not be reliable.

The API would synchronize the CPU with the RTIO timeline, that is, wait for the counter to reach the last event (plus enough slack to get all the exceptions).

Sure, there are cases where you don't want to do this (e.g. seamless handover), but it seems to be an eminently sensible default.

dhslichter · 2021-09-14T15:27:47Z

ping @sbourdeauducq @dnadlinger @cjbe has there been agreement/implementation of a solution here?

sbourdeauducq · 2021-09-15T03:09:18Z

We can probably print the warning message on the host at the end of kernels as @cjbe suggested.

Closes #1644

sbourdeauducq closed this as completed Mar 25, 2021

sbourdeauducq reopened this Sep 15, 2021

sbourdeauducq assigned stevefan1999 Sep 15, 2021

sbourdeauducq added good first issue area:runtime labels Sep 15, 2021

stevefan1999 mentioned this issue Dec 1, 2021

Host report for async error upon kernel termination #1791

Merged

8 tasks

sbourdeauducq closed this as completed in #1791 Dec 4, 2021

sbourdeauducq pushed a commit that referenced this issue Dec 4, 2021

Host report for async error upon kernel termination (#1791)

4a6bea4

Closes #1644

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RTIO collision fails to produce an exception #1644

RTIO collision fails to produce an exception #1644

evilmav commented Mar 25, 2021 •

edited

Loading

sbourdeauducq commented Mar 25, 2021

hartytp commented Mar 25, 2021

evilmav commented Mar 25, 2021 •

edited

Loading

hartytp commented Mar 25, 2021

sbourdeauducq commented Mar 25, 2021 •

edited

Loading

hartytp commented Mar 25, 2021

sbourdeauducq commented Mar 25, 2021

cjbe commented Mar 25, 2021

dnadlinger commented Mar 25, 2021

dnadlinger commented Mar 25, 2021

sbourdeauducq commented Mar 25, 2021

cjbe commented Mar 25, 2021

dnadlinger commented Mar 25, 2021

dhslichter commented Sep 14, 2021

sbourdeauducq commented Sep 15, 2021

RTIO collision fails to produce an exception #1644

RTIO collision fails to produce an exception #1644

Comments

evilmav commented Mar 25, 2021 • edited Loading

Bug Report

One-Line Summary

Issue Details

Steps to Reproduce

Expected Behavior

Actual (undesired) Behavior

Your System (omit irrelevant parts)

sbourdeauducq commented Mar 25, 2021

hartytp commented Mar 25, 2021

evilmav commented Mar 25, 2021 • edited Loading

hartytp commented Mar 25, 2021

sbourdeauducq commented Mar 25, 2021 • edited Loading

hartytp commented Mar 25, 2021

sbourdeauducq commented Mar 25, 2021

cjbe commented Mar 25, 2021

dnadlinger commented Mar 25, 2021

dnadlinger commented Mar 25, 2021

sbourdeauducq commented Mar 25, 2021

cjbe commented Mar 25, 2021

dnadlinger commented Mar 25, 2021

dhslichter commented Sep 14, 2021

sbourdeauducq commented Sep 15, 2021

evilmav commented Mar 25, 2021 •

edited

Loading

evilmav commented Mar 25, 2021 •

edited

Loading

sbourdeauducq commented Mar 25, 2021 •

edited

Loading