Long dead time between kernels on Windows #407

r-srinivas · 2016-04-25T23:39:10Z

I'm using 1.0 rc3 on Windows. I have the following experiment IonMonitor in the background,

from artiq.experiment import *

class IonMonitor(EnvExperiment):

    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl2")
        self.setattr_device("scheduler")

    def run(self):
        while True:
            self.core.comm.close()
            self.scheduler.pause()
            self.pulse()

    @kernel
    def pulse(self):
        self.core.break_realtime()
        for i in range(1):
            self.ttl2.pulse(10*ms)
            delay(10*ms)

And then I run another experiment TTL_test with a higher priority,

from artiq.experiment import *

class TTL_test(EnvExperiment):

    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl1")

    @kernel
    def run(self):
        self.ttl1.pulse(10*us)

This experiment executes and the previous experiment IonMonitor turns on again. However, between the last IonMonitor pulse and the TTL_test pulse, there's a gap of about 1.2s, and 1.1s between the TTL_test pulse and the next IonMonitor pulse. Is there a way to shorten this gap?

Using 1.0rc3 on linux this gap is about 140 ms and 170 ms for before and after TTL test respectively.

The text was updated successfully, but these errors were encountered:

whitequark · 2016-04-25T23:42:51Z

I would expect this to result from an additional constant latency of running the linker on Windows.

Generally, you should expect a relatively high latency every time kernel mode is entered. Type checking and compilation take a while, especially once you move past trivial experiments.

r-srinivas · 2016-04-25T23:45:04Z

Is there a better way to code the background experiment then?

whitequark · 2016-04-25T23:48:20Z

Yes. The loop that performs the pulses should not preempt itself after every pulse. Instead, it should just run forever in the background, and be preempted once there is a higher priority experiment scheduled.

However, I think this is not currently possible. @sbourdeauducq ?

r-srinivas · 2016-04-26T00:04:46Z

Could programming the idle kernel solve this? Would that turn on when there's a dead time between experiments or only when the ethernet cable is disconnected?

whitequark · 2016-04-26T00:08:08Z

Yes, I think so. There's very little dead time between stopping the idle kernel and starting the newly uploaded one.

sbourdeauducq · 2016-04-26T03:03:16Z

What about adding a function self.scheduler.has_higher_priorities() that returns whether a higher priority experiment is present, and self.scheduler.pause() will preempt the current one?

This way IonMonitor can RPC it in its kernel and the kernel will only return when needed, and there will be a smaller gap (~millisecond) when there is no preemption.

There will still be a large gap when it is preempted, though.

sbourdeauducq · 2016-04-26T03:04:04Z

And the problem with the idle kernel is it cannot access datasets.

whitequark · 2016-04-26T03:09:22Z

Sounds reasonable. And we can remove that gap easily by only preempting after compilation is done.

sbourdeauducq · 2016-04-26T03:16:41Z

How exactly? Add a feature to pre-compile kernels, which can be invoked in the prepare stage?

r-srinivas · 2016-04-26T03:21:26Z

That seems reasonable. How long could the gap be reduced to if the preempting is only done after compilation? But this wouldn't reduce the dead time for two non-looping experiments sequenced one after another, right?

whitequark · 2016-04-26T03:32:05Z

@sbourdeauducq I thought about some kind of hack to return from the has_higher_priorities RPC early from the kernel's perspective, let it do a few more loops while we're compiling, and then finally request termination. But this sounded much better in my head, we probably shouldn't actually do that.

jordens · 2016-04-30T08:04:05Z

200 ms of the dead time are due to #398

jordens · 2016-05-03T16:01:58Z

And the rest of the delay is likely due to windows generally being slow with new processes and especially python that opens many files.

I would shape the behavior of yielding from a kernel by checking for higher priorities a bit differently. I would handle both TerminationRequested and a new YieldRequested through that channel. And preempt() from a kernel would RPC, potentially raise either of those, potentially return, and then the host side can handle them or convert them, e.g. into pause(), maybe with a nice and compact context manager idiom. (Suggestions for better names welcome).

sbourdeauducq · 2016-05-03T16:20:36Z

windows generally being slow with new processes and especially python that opens many files

The pipelining should absorb the Python process creation. Could be the linker as @whitequark suggests.

whitequark · 2016-05-03T16:36:40Z

Now that we're on LLVM 3.8 we can use the LLVM linker, and rip out a whole bunch of cruft at the same time as making it all faster.

whitequark · 2016-05-03T16:39:44Z

That will also:

remove the binutils dependency;
make runtime smaller as we can make use of LTO;
improve error messages as LLVM's symbolizer works better than GNU's. (this is not strictly related to the linker but it's an adjacent API that would be convenient to make use of at the same time.)

jordens · 2016-05-03T18:15:34Z

Pipelining does not absorb anything significant for the experiments in question. They have a few us/ms of duration in run(). It is not a secret that Python startup is dead slow on Windows.

jordens · 2016-05-03T19:09:37Z

Maybe the pipelining can play a role in some pause() cases. But in general, for a queue of experiments lasting a few ms each, it won't help.

sbourdeauducq · 2016-05-05T14:00:50Z

Pipelining does not help if the desired experiment rate is higher than the process creation rate. But for pauses, the scheduler only preempts the low-priority experiment when the high-priority one is ready to execute run (and build and prepare have been executed), so it does absorb process creation.

r-srinivas · 2016-05-26T19:23:35Z

We tried loading an idle_kernel but pulses a ttl but it doesn't seem to load in between kernels of the IonMonitor loop. Is that something that should happen?

sbourdeauducq · 2016-05-26T20:00:03Z

Is the IonMonitor code still exactly as above?

r-srinivas · 2016-05-26T20:14:56Z

No, but similar.

from artiq.experiment import *

class Ion_Monitor_Test(EnvExperiment):

    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl_bdd")
        self.setattr_device("ttl_pmt")
        self.setattr_device("scheduler")

    def run(self):
        while True:
            self.scheduler.pause()
            self.pulse()

    @kernel
    def pulse(self):
        self.core.break_realtime()
        self.ttl_bdd.pulse(100*ms)
        with parallel:
            self.ttl_pmt.gate_rising(100*us)
            self.ttl_bdd.pulse(100*us)
        counts = self.ttl_pmt.count()
        self.set_dataset("counts", counts, broadcast=True)

Our idle_kernel is,

from artiq.experiment import *

class IdleKernel(EnvExperiment):

    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl_bdd")

    @kernel
    def run(self):
        while True:
            self.ttl_bdd.pulse(990*us)
            delay(10*us)

If you have a look at the BDD on the scope, there's about 1.5 s dead time on the scope between ttl_bdd pulses.

r-srinivas · 2016-05-26T20:17:20Z

Okay, looks like putting in self.core.comm.close() is still required for the idle_kernel to run. I thought scheduler.pause called that automatically.

sbourdeauducq · 2016-05-26T20:17:28Z

With this code then it is expected that the idle kernel is not run. The idle kernel is only run when there is no valid TCP connection with the control PC, and on ARTIQ 1.x scheduler.pause does not implicitly disconnect.

sbourdeauducq · 2016-05-26T20:18:50Z

@whitequark 1.5s is particularly long. Could you do a quick profiling of the compiler on Windows to see if there are any low-hanging fruits?

r-srinivas · 2016-05-26T20:23:03Z

The desired behavior of the idle kernel would be to have it run when there is no other kernel running, not only at times where there is no TCP connection. This is less of an issue if the delay time between kernels were ~100 ms, but at 1-2 seconds this is an issue for keeping ions in surface traps between experiments.

With self.core.comm.close() the delay is still 750 ms before the idle kernel starts.

sbourdeauducq · 2016-05-26T20:24:27Z

No, use seamless handover. That will even cover your ~140ms delay on Linux.

whitequark · 2017-10-02T01:57:46Z

@r-srinivas Can you please recheck using latest builds? We've done a lot of work making the networking side more robust.

r-srinivas · 2017-10-02T22:12:20Z

Had a look at it using artiq 3.0 py_0+git04a9a0ce on Windows. The following experiment,

from artiq.experiment import *


class TTL_Test(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl4")

    def run(self):
        self.reset()
        for i in range(100):
            self.pulse()

    @kernel
    def reset(self):
        self.core.reset()
       
    @kernel
    def pulse(self):
        self.core.break_realtime()
        self.ttl4.pulse(10*ms)

Has about 680 ms between pulses on the scope. It's better than it was since we had last checked it when it was about 1s. What's it like on your system?

dhslichter · 2017-10-03T13:53:36Z

Definitely an improvement over the ~3.5 s that was being seen with 3.0 previously. See post in this issue from @r-srinivas on 12/1/16. Still would be nice to understand how this compares to what you see @whitequark on your Linux and Windows machines.

sbourdeauducq · 2017-10-03T13:58:55Z

Isn't the "normal" Windows performance 500-750 ms and due mostly to AV slowdown?

r-srinivas · 2017-10-03T15:33:08Z

@r-srinivas OK, this is definitely the AV in NIST. But that's not the whole story. Even with the AV slowdown it should give you not much more than 300ms (I estimate maybe 100ms more taken by stripping, not accounted for in perf_embedding) delay between pulses, whereas you observe 500-700ms. Something else is afoot.

I'm not sure if that's entirely right. Something else seems to be causing 200-450 ms delay?

whitequark · 2017-10-03T18:22:06Z

@r-srinivas Can you please recheck with the AV off? It's clear that performance is not going to be acceptable with it on anyway, so all the speculation on exactly how much delay it adds is unhelpful.

dhslichter · 2017-10-04T15:52:38Z

AV off isn't an option for us at NIST, unfortunately, without jumping through a large number of administrative hoops, so we can't do any tests without AV in the near term. @whitequark what do you see on your Windows machine without AV running the same code?

whitequark · 2017-10-04T16:39:51Z

AV off isn't an option for us at NIST, unfortunately

Then, as per #407 (comment), you won't get any lower than ~280ms between kernel runs until #733 is done.

dhslichter · 2017-10-04T16:55:14Z

ack. Is there a timeline estimate for #733? Has it been funded yet @sbourdeauducq @jordens ?

whitequark · 2017-10-04T17:06:33Z

@dhslichter Not currently. We'll need to move to LLVM 4.0 first (not a lengthy process). Then implement and test an LLD backend. I'll need to look at it in-depth to give an estimate.

dhslichter · 2017-10-04T17:10:02Z

What kind of an improvement would you expect from #733? If it's nontrivial, it's worth putting an estimate together for a contract.

jordens · 2017-10-04T17:14:58Z

@dhslichter See #407 (comment) , #407 (comment) for an old measurement on our VM. You have somewhere between 100ms and 160ms. But keep in mind that NIST AV probably also penalizes the current (external) linker quite a bit beyond those numbers. We can't measure that easily.
I don't know how much faster the llvm linker would be.

dhslichter · 2017-10-04T17:17:36Z

@jordens ack. Would #733 address this AV penalty for external linker too?

whitequark · 2017-10-04T17:30:52Z

I don't know how much faster the llvm linker would be.

The LLVM linker should link (per se) about as quickly.

Would #733 address this AV penalty for external linker too?

This is the main reason to implement it. The LLVM linker (and stripper) run in-process so there is no penalty for running them. Think in the range of 5ms at most for each step for small kernels.

dhslichter · 2017-10-04T17:39:08Z

@whitequark aside from #733 and the AV, are there any other low-hanging fruit for reducing this dead time?

whitequark · 2017-10-04T21:54:38Z

Not that I recall.

sbourdeauducq · 2018-01-27T16:54:21Z

@r-srinivas @dhslichter I suppose the following experiment can be used instead of a scope:

import time

from artiq.experiment import *


class KernelLoadTime(EnvExperiment):
        def build(self):
                self.setattr_device("core")

        def run(self):
                n = 100
                t0 = time.monotonic()
                for _ in range(n):
                        self.dummy_kernel()
                t1 = time.monotonic()
                print((t1-t0)/n)

        @kernel
        def dummy_kernel(self):
                pass

What is your result on Windows? I'm getting 62ms with Kasli, ARTIQ master, and Linux.

sbourdeauducq · 2018-01-28T03:40:40Z

And 68ms with KC705, ARTIQ master and Windows (no AV).

dhslichter · 2018-01-29T18:09:28Z

@sbourdeauducq what changed between your original measurement of 83 ms (before you edited the comment) and the edited value of 62 ms above?

sbourdeauducq · 2018-01-30T01:44:06Z

It fluctuates, I had 83ms the first time but it is usually more like 60-65. Anyway, either 83 or 62 is a big difference from 500.

dhslichter · 2018-01-31T15:41:04Z

As a benchmark for comparison, @r-srinivas ran the test on our ARTIQ 2 setup which runs the magtrap experiment (Windows 7 with NIST AV), and we got 450-470 ms. Will run on the new ARTIQ 3 setup today and let you know how that looks.

jonaskeller · 2018-01-31T17:02:12Z

With 3.2 py_8+gitaa64b8ad (the most recent flashable gateware), it takes between 520ms and 570ms on our Windows system.

sbourdeauducq · 2018-02-01T02:01:00Z

Okay, so this is most likely the AV and #733 should address this.

sbourdeauducq changed the title ~~Long dead time between experiments on Windows~~ Long dead time between kernels on Windows Apr 26, 2016

jordens added area:coredevice type:feature labels May 3, 2016

jordens mentioned this issue May 11, 2017

Use the llvm linker #733

Closed

jordens added the area:speed label May 11, 2017

dhslichter mentioned this issue Jun 6, 2017

TBool -> TInt casts/conversions #736

Closed

whitequark moved this from next to waiting in dashboard-whitequark Oct 2, 2017

sbourdeauducq added a commit that referenced this issue Jan 27, 2018

test: add kernel overhead test (#407)

e8ed347

sbourdeauducq closed this as completed Feb 1, 2018

whitequark moved this from waiting to done in dashboard-whitequark Feb 22, 2018

Long dead time between kernels on Windows #407

Long dead time between kernels on Windows #407

Comments

r-srinivas commented Apr 25, 2016

whitequark commented Apr 25, 2016

r-srinivas commented Apr 25, 2016

whitequark commented Apr 25, 2016

r-srinivas commented Apr 26, 2016

whitequark commented Apr 26, 2016

sbourdeauducq commented Apr 26, 2016

sbourdeauducq commented Apr 26, 2016 • edited

whitequark commented Apr 26, 2016

sbourdeauducq commented Apr 26, 2016

r-srinivas commented Apr 26, 2016

whitequark commented Apr 26, 2016

jordens commented Apr 30, 2016

jordens commented May 3, 2016 • edited

sbourdeauducq commented May 3, 2016

whitequark commented May 3, 2016

whitequark commented May 3, 2016

jordens commented May 3, 2016

jordens commented May 3, 2016

sbourdeauducq commented May 5, 2016 • edited

r-srinivas commented May 26, 2016

sbourdeauducq commented May 26, 2016

r-srinivas commented May 26, 2016

r-srinivas commented May 26, 2016

sbourdeauducq commented May 26, 2016

sbourdeauducq commented May 26, 2016 • edited

r-srinivas commented May 26, 2016

sbourdeauducq commented May 26, 2016

whitequark commented Oct 2, 2017 • edited

r-srinivas commented Oct 2, 2017

dhslichter commented Oct 3, 2017

sbourdeauducq commented Oct 3, 2017 • edited

r-srinivas commented Oct 3, 2017 • edited

whitequark commented Oct 3, 2017 • edited

dhslichter commented Oct 4, 2017 • edited

whitequark commented Oct 4, 2017

dhslichter commented Oct 4, 2017 • edited

whitequark commented Oct 4, 2017

dhslichter commented Oct 4, 2017

jordens commented Oct 4, 2017 • edited

dhslichter commented Oct 4, 2017

whitequark commented Oct 4, 2017

dhslichter commented Oct 4, 2017

whitequark commented Oct 4, 2017

sbourdeauducq commented Jan 27, 2018 • edited

sbourdeauducq commented Jan 28, 2018

dhslichter commented Jan 29, 2018

sbourdeauducq commented Jan 30, 2018

dhslichter commented Jan 31, 2018

jonaskeller commented Jan 31, 2018

sbourdeauducq commented Feb 1, 2018

sbourdeauducq commented Apr 26, 2016 •

edited

jordens commented May 3, 2016 •

edited

sbourdeauducq commented May 5, 2016 •

edited

sbourdeauducq commented May 26, 2016 •

edited

whitequark commented Oct 2, 2017 •

edited

sbourdeauducq commented Oct 3, 2017 •

edited

r-srinivas commented Oct 3, 2017 •

edited

whitequark commented Oct 3, 2017 •

edited

dhslichter commented Oct 4, 2017 •

edited

dhslichter commented Oct 4, 2017 •

edited

jordens commented Oct 4, 2017 •

edited

sbourdeauducq commented Jan 27, 2018 •

edited