Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long dead time between kernels on Windows #407

Closed
r-srinivas opened this issue Apr 25, 2016 · 91 comments
Closed

Long dead time between kernels on Windows #407

r-srinivas opened this issue Apr 25, 2016 · 91 comments

Comments

@r-srinivas
Copy link

I'm using 1.0 rc3 on Windows. I have the following experiment IonMonitor in the background,

from artiq.experiment import *

class IonMonitor(EnvExperiment):

    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl2")
        self.setattr_device("scheduler")

    def run(self):
        while True:
            self.core.comm.close()
            self.scheduler.pause()
            self.pulse()

    @kernel
    def pulse(self):
        self.core.break_realtime()
        for i in range(1):
            self.ttl2.pulse(10*ms)
            delay(10*ms)

And then I run another experiment TTL_test with a higher priority,

from artiq.experiment import *

class TTL_test(EnvExperiment):

    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl1")

    @kernel
    def run(self):
        self.ttl1.pulse(10*us)

This experiment executes and the previous experiment IonMonitor turns on again. However, between the last IonMonitor pulse and the TTL_test pulse, there's a gap of about 1.2s, and 1.1s between the TTL_test pulse and the next IonMonitor pulse. Is there a way to shorten this gap?

Using 1.0rc3 on linux this gap is about 140 ms and 170 ms for before and after TTL test respectively.

@whitequark
Copy link
Contributor

I would expect this to result from an additional constant latency of running the linker on Windows.

Generally, you should expect a relatively high latency every time kernel mode is entered. Type checking and compilation take a while, especially once you move past trivial experiments.

@r-srinivas
Copy link
Author

Is there a better way to code the background experiment then?

@whitequark
Copy link
Contributor

Yes. The loop that performs the pulses should not preempt itself after every pulse. Instead, it should just run forever in the background, and be preempted once there is a higher priority experiment scheduled.

However, I think this is not currently possible. @sbourdeauducq ?

@r-srinivas
Copy link
Author

Could programming the idle kernel solve this? Would that turn on when there's a dead time between experiments or only when the ethernet cable is disconnected?

@whitequark
Copy link
Contributor

Yes, I think so. There's very little dead time between stopping the idle kernel and starting the newly uploaded one.

@sbourdeauducq sbourdeauducq changed the title Long dead time between experiments on Windows Long dead time between kernels on Windows Apr 26, 2016
@sbourdeauducq
Copy link
Member

What about adding a function self.scheduler.has_higher_priorities() that returns whether a higher priority experiment is present, and self.scheduler.pause() will preempt the current one?

This way IonMonitor can RPC it in its kernel and the kernel will only return when needed, and there will be a smaller gap (~millisecond) when there is no preemption.

There will still be a large gap when it is preempted, though.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Apr 26, 2016

And the problem with the idle kernel is it cannot access datasets.

@whitequark
Copy link
Contributor

Sounds reasonable. And we can remove that gap easily by only preempting after compilation is done.

@sbourdeauducq
Copy link
Member

How exactly? Add a feature to pre-compile kernels, which can be invoked in the prepare stage?

@r-srinivas
Copy link
Author

That seems reasonable. How long could the gap be reduced to if the preempting is only done after compilation? But this wouldn't reduce the dead time for two non-looping experiments sequenced one after another, right?

@whitequark
Copy link
Contributor

@sbourdeauducq I thought about some kind of hack to return from the has_higher_priorities RPC early from the kernel's perspective, let it do a few more loops while we're compiling, and then finally request termination. But this sounded much better in my head, we probably shouldn't actually do that.

@jordens
Copy link
Member

jordens commented Apr 30, 2016

200 ms of the dead time are due to #398

@jordens
Copy link
Member

jordens commented May 3, 2016

And the rest of the delay is likely due to windows generally being slow with new processes and especially python that opens many files.

I would shape the behavior of yielding from a kernel by checking for higher priorities a bit differently. I would handle both TerminationRequested and a new YieldRequested through that channel. And preempt() from a kernel would RPC, potentially raise either of those, potentially return, and then the host side can handle them or convert them, e.g. into pause(), maybe with a nice and compact context manager idiom. (Suggestions for better names welcome).

@sbourdeauducq
Copy link
Member

windows generally being slow with new processes and especially python that opens many files

The pipelining should absorb the Python process creation. Could be the linker as @whitequark suggests.

@whitequark
Copy link
Contributor

Now that we're on LLVM 3.8 we can use the LLVM linker, and rip out a whole bunch of cruft at the same time as making it all faster.

@whitequark
Copy link
Contributor

That will also:

  • remove the binutils dependency;
  • make runtime smaller as we can make use of LTO;
  • improve error messages as LLVM's symbolizer works better than GNU's. (this is not strictly related to the linker but it's an adjacent API that would be convenient to make use of at the same time.)

@jordens
Copy link
Member

jordens commented May 3, 2016

Pipelining does not absorb anything significant for the experiments in question. They have a few us/ms of duration in run(). It is not a secret that Python startup is dead slow on Windows.

@jordens
Copy link
Member

jordens commented May 3, 2016

Maybe the pipelining can play a role in some pause() cases. But in general, for a queue of experiments lasting a few ms each, it won't help.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented May 5, 2016

Pipelining does not help if the desired experiment rate is higher than the process creation rate. But for pauses, the scheduler only preempts the low-priority experiment when the high-priority one is ready to execute run (and build and prepare have been executed), so it does absorb process creation.

@r-srinivas
Copy link
Author

We tried loading an idle_kernel but pulses a ttl but it doesn't seem to load in between kernels of the IonMonitor loop. Is that something that should happen?

@sbourdeauducq
Copy link
Member

Is the IonMonitor code still exactly as above?

@r-srinivas
Copy link
Author

No, but similar.

from artiq.experiment import *

class Ion_Monitor_Test(EnvExperiment):

    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl_bdd")
        self.setattr_device("ttl_pmt")
        self.setattr_device("scheduler")

    def run(self):
        while True:
            self.scheduler.pause()
            self.pulse()

    @kernel
    def pulse(self):
        self.core.break_realtime()
        self.ttl_bdd.pulse(100*ms)
        with parallel:
            self.ttl_pmt.gate_rising(100*us)
            self.ttl_bdd.pulse(100*us)
        counts = self.ttl_pmt.count()
        self.set_dataset("counts", counts, broadcast=True)

Our idle_kernel is,

from artiq.experiment import *

class IdleKernel(EnvExperiment):

    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl_bdd")

    @kernel
    def run(self):
        while True:
            self.ttl_bdd.pulse(990*us)
            delay(10*us)

If you have a look at the BDD on the scope, there's about 1.5 s dead time on the scope between ttl_bdd pulses.

@r-srinivas
Copy link
Author

Okay, looks like putting in self.core.comm.close() is still required for the idle_kernel to run. I thought scheduler.pause called that automatically.

@sbourdeauducq
Copy link
Member

With this code then it is expected that the idle kernel is not run. The idle kernel is only run when there is no valid TCP connection with the control PC, and on ARTIQ 1.x scheduler.pause does not implicitly disconnect.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented May 26, 2016

@whitequark 1.5s is particularly long. Could you do a quick profiling of the compiler on Windows to see if there are any low-hanging fruits?

@r-srinivas
Copy link
Author

The desired behavior of the idle kernel would be to have it run when there is no other kernel running, not only at times where there is no TCP connection. This is less of an issue if the delay time between kernels were ~100 ms, but at 1-2 seconds this is an issue for keeping ions in surface traps between experiments.

With self.core.comm.close() the delay is still 750 ms before the idle kernel starts.

@sbourdeauducq
Copy link
Member

No, use seamless handover. That will even cover your ~140ms delay on Linux.

@whitequark
Copy link
Contributor

whitequark commented Oct 2, 2017

@r-srinivas Can you please recheck using latest builds? We've done a lot of work making the networking side more robust.

@whitequark whitequark moved this from next to waiting in dashboard-whitequark Oct 2, 2017
@r-srinivas
Copy link
Author

Had a look at it using artiq 3.0 py_0+git04a9a0ce on Windows. The following experiment,

from artiq.experiment import *


class TTL_Test(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl4")

    def run(self):
        self.reset()
        for i in range(100):
            self.pulse()

    @kernel
    def reset(self):
        self.core.reset()
       
    @kernel
    def pulse(self):
        self.core.break_realtime()
        self.ttl4.pulse(10*ms)

Has about 680 ms between pulses on the scope. It's better than it was since we had last checked it when it was about 1s. What's it like on your system?

@dhslichter
Copy link
Contributor

Definitely an improvement over the ~3.5 s that was being seen with 3.0 previously. See post in this issue from @r-srinivas on 12/1/16. Still would be nice to understand how this compares to what you see @whitequark on your Linux and Windows machines.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Oct 3, 2017

Isn't the "normal" Windows performance 500-750 ms and due mostly to AV slowdown?

@r-srinivas
Copy link
Author

r-srinivas commented Oct 3, 2017

@r-srinivas OK, this is definitely the AV in NIST. But that's not the whole story. Even with the AV slowdown it should give you not much more than 300ms (I estimate maybe 100ms more taken by stripping, not accounted for in perf_embedding) delay between pulses, whereas you observe 500-700ms. Something else is afoot.

I'm not sure if that's entirely right. Something else seems to be causing 200-450 ms delay?

@whitequark
Copy link
Contributor

whitequark commented Oct 3, 2017

@r-srinivas Can you please recheck with the AV off? It's clear that performance is not going to be acceptable with it on anyway, so all the speculation on exactly how much delay it adds is unhelpful.

@dhslichter
Copy link
Contributor

dhslichter commented Oct 4, 2017

AV off isn't an option for us at NIST, unfortunately, without jumping through a large number of administrative hoops, so we can't do any tests without AV in the near term. @whitequark what do you see on your Windows machine without AV running the same code?

@whitequark
Copy link
Contributor

AV off isn't an option for us at NIST, unfortunately

Then, as per #407 (comment), you won't get any lower than ~280ms between kernel runs until #733 is done.

@dhslichter
Copy link
Contributor

dhslichter commented Oct 4, 2017

ack. Is there a timeline estimate for #733? Has it been funded yet @sbourdeauducq @jordens ?

@whitequark
Copy link
Contributor

@dhslichter Not currently. We'll need to move to LLVM 4.0 first (not a lengthy process). Then implement and test an LLD backend. I'll need to look at it in-depth to give an estimate.

@dhslichter
Copy link
Contributor

What kind of an improvement would you expect from #733? If it's nontrivial, it's worth putting an estimate together for a contract.

@jordens
Copy link
Member

jordens commented Oct 4, 2017

@dhslichter See #407 (comment) , #407 (comment) for an old measurement on our VM. You have somewhere between 100ms and 160ms. But keep in mind that NIST AV probably also penalizes the current (external) linker quite a bit beyond those numbers. We can't measure that easily.
I don't know how much faster the llvm linker would be.

@dhslichter
Copy link
Contributor

@jordens ack. Would #733 address this AV penalty for external linker too?

@whitequark
Copy link
Contributor

I don't know how much faster the llvm linker would be.

The LLVM linker should link (per se) about as quickly.

Would #733 address this AV penalty for external linker too?

This is the main reason to implement it. The LLVM linker (and stripper) run in-process so there is no penalty for running them. Think in the range of 5ms at most for each step for small kernels.

@dhslichter
Copy link
Contributor

@whitequark aside from #733 and the AV, are there any other low-hanging fruit for reducing this dead time?

@whitequark
Copy link
Contributor

Not that I recall.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Jan 27, 2018

@r-srinivas @dhslichter I suppose the following experiment can be used instead of a scope:

import time

from artiq.experiment import *


class KernelLoadTime(EnvExperiment):
        def build(self):
                self.setattr_device("core")

        def run(self):
                n = 100
                t0 = time.monotonic()
                for _ in range(n):
                        self.dummy_kernel()
                t1 = time.monotonic()
                print((t1-t0)/n)

        @kernel
        def dummy_kernel(self):
                pass

What is your result on Windows? I'm getting 62ms with Kasli, ARTIQ master, and Linux.

@sbourdeauducq
Copy link
Member

And 68ms with KC705, ARTIQ master and Windows (no AV).

@dhslichter
Copy link
Contributor

@sbourdeauducq what changed between your original measurement of 83 ms (before you edited the comment) and the edited value of 62 ms above?

@sbourdeauducq
Copy link
Member

It fluctuates, I had 83ms the first time but it is usually more like 60-65. Anyway, either 83 or 62 is a big difference from 500.

@dhslichter
Copy link
Contributor

As a benchmark for comparison, @r-srinivas ran the test on our ARTIQ 2 setup which runs the magtrap experiment (Windows 7 with NIST AV), and we got 450-470 ms. Will run on the new ARTIQ 3 setup today and let you know how that looks.

@jonaskeller
Copy link

With 3.2 py_8+gitaa64b8ad (the most recent flashable gateware), it takes between 520ms and 570ms on our Windows system.

@sbourdeauducq
Copy link
Member

Okay, so this is most likely the AV and #733 should address this.

@whitequark whitequark moved this from waiting to done in dashboard-whitequark Feb 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

No branches or pull requests

6 participants