corelog chokes coredevice #986

jordens · 2018-04-23T12:46:51Z

from artiq.experiment import *

class TestPause(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.setattr_device("scheduler")

    def run(self):
        for i in range(10):
            print(self.k()*1e-9)

    @kernel
    def k(self):
        self.core.break_realtime()
        t0 = self.core.get_rtio_counter_mu()
        self.scheduler.check_pause()
        return self.core.get_rtio_counter_mu() - t0

On a freshly started coredevice with artiq_run and nothing else running:

(py35) rj@murray:~/src/artiq-setup (master)$ artiq_run repository/test_pause.py
WARNING:artiq.coredevice.comm_kernel:Mismatch between gateware (4.0.dev+849.g4fe09fdd.dirty) and software (4.0.dev+850.gada0a278) versions
0.00244876
0.00230096
0.002301528
0.00260904
0.002303112
0.002300456
0.002300416
0.002300376
0.002609592
0.002606632

Now starting aqctl_corelog:

(py35) rj@murray:~/src/artiq-setup (master)$ aqctl_corelog -p 1068 --bind ::1 10.0.16.118 &
[2] 3890
(py35) rj@murray:~/src/artiq-setup (master)$ INFO:firmware.runtime:ARTIQ runtime starting...
INFO:firmware.runtime:software version 4.0.dev+849.g4fe09fdd
INFO:firmware.runtime:gateware version 4.0.dev+849.g4fe09fdd.dirty
INFO:firmware.runtime:log level set to INFO by default
INFO:firmware.runtime:UART log level set to INFO by default
INFO:firmware.runtime:press 'e' to erase startup and idle kernels...
INFO:firmware.runtime:continuing boot
INFO:firmware.board_artiq.si5324:Si5324 is locked
INFO:firmware.runtime:using MAC address 54-10-ec-aa-09-72
INFO:firmware.runtime:using IP address 10.0.16.118
INFO:firmware.runtime.rtio_mgt:using external startup RTIO clock
INFO:firmware.runtime.mgmt:management interface active
INFO:firmware.runtime.session:accepting network sessions
INFO:firmware.runtime.session:running startup kernel
INFO:firmware.runtime.kern_hwreq:resetting RTIO
INFO:firmware.runtime.session:startup kernel finished
INFO:firmware.runtime.session:no connection, starting idle kernel
INFO:firmware.runtime.session:no idle kernel found
INFO:firmware.runtime.session:new connection from 10.0.16.10:56778
INFO:firmware.runtime.session:no connection, starting idle kernel
INFO:firmware.runtime.session:no idle kernel found
INFO:firmware.runtime.mgmt:new connection from 10.0.16.10:41518

(py35) rj@murray:~/src/artiq-setup (master)$ ps ax | grep aqctl
 3890 pts/4    Sl     0:00 /home/rj/src/conda/envs/py35/bin/python /home/rj/src/conda/envs/py35/bin/aqctl_corelog -p 1068 --bind ::1 10.0.16.118
 4046 pts/4    S+     0:00 grep --color=auto aqctl
(py35) rj@murray:~/src/artiq-setup (master)$ ps ax | grep artiq
 4051 pts/4    S+     0:00 grep --color=auto artiq
(py35) rj@murray:~/src/artiq-setup (master)$

And reattempting the experiment:

(py35) rj@murray:~/src/artiq-setup (master)$ artiq_run repository/test_pause.py
INFO:firmware.runtime.session:new connection from 10.0.16.10:56786
WARNING:artiq.coredevice.comm_kernel:Mismatch between gateware (4.0.dev+849.g4fe09fdd.dirty) and software (4.0.dev+850.gada0a278) versions
0.454824784
0.4547608
0.454760808
0.454776616
0.454777624
0.454917736
0.454914816
0.454761288
0.454760688
0.45493312

Persists after killing aqctl_corelog:

(py35) rj@murray:~/src/artiq-setup (master)$ kill 3890
[2]-  Terminated              aqctl_corelog -p 1068 --bind ::1 10.0.16.118
(py35) rj@murray:~/src/artiq-setup (master)$ artiq_run repository/test_pause.py
WARNING:artiq.coredevice.comm_kernel:Mismatch between gateware (4.0.dev+849.g4fe09fdd.dirty) and software (4.0.dev+850.gada0a278) versions
0.454673616
0.454611728
0.45461164
0.454633296
0.454632704
0.454683088
0.454769352
0.454617472
0.454620032
0.45477808

But after a while (in this case ~5 minutes) it recovers:

(py35) rj@murray:~/src/artiq-setup (master)$ artiq_run repository/test_pause.py
WARNING:artiq.coredevice.comm_kernel:Mismatch between gateware (4.0.dev+849.g4fe09fdd.dirty) and software (4.0.dev+850.gada0a278) versions
0.002502056
0.002502744
0.002680816
0.002681608
0.002683608
0.002680888
0.002681624
0.00268192
0.002680776
0.002683944

Maybe or may not be identical or related to Device slowdown after extended run time #979
No dashboard, no master, running in the repro above.
Experiments submitted through artiq_master and artiq_run are also different. On artiq_master the RPCs take ~5ms while on artiq_run they are around 2.5ms
I haven't checked whether the break_realtime() and/or the fact that the scheduler.check_pause() is used as an RPC matter.

The text was updated successfully, but these errors were encountered:

whitequark · 2018-04-23T13:16:57Z

Experiments submitted through artiq_master and artiq_run are also different. On artiq_master the RPCs take ~5ms while on artiq_run they are around 2.5ms

That's probably because of moninj. Moninj, session, and corelog all compete for CPU time, and since we use a round-robin scheduler without any kind of readiness notification mechanism, adding one more thread increases the average latency by poll time plus schedule time plus (in case of corelog) two context switch times.

jordens · 2018-04-23T18:53:03Z

And poll time plus schedule time plus two ctx switches are 400k cycles? Or is that the scheduler tick?

whitequark · 2018-04-23T18:59:51Z

The scheduler is cooperative so it doesn't have ticks. It switches to the next thread once the current one blocks on something.

whitequark · 2018-05-05T01:20:43Z

400k cycles were spent rotating the log ring buffer in place. I am now observing exact same performance with or without aqctl_corelog.

sbourdeauducq · 2018-05-06T03:09:19Z

@whitequark Please fix this in release-3 as well.

Closes #986.

whitequark · 2018-05-06T14:31:10Z

Ah sorry, didn't notice the milestone.

jordens added prio:3-serious area:runtime area:coredevice area:speed labels Apr 23, 2018

jordens added this to the 4.0 milestone Apr 23, 2018

jordens assigned whitequark Apr 23, 2018

jordens added this to pain in dashboard-jordens Apr 23, 2018

jordens modified the milestones: 4.0, 3.7 Apr 23, 2018

whitequark closed this as completed in 4f29d91 May 5, 2018

sbourdeauducq reopened this May 6, 2018

whitequark added a commit that referenced this issue May 6, 2018

Update log_buffer.

9589dec

Closes #986.

whitequark closed this as completed May 6, 2018

jordens moved this from pain to done in dashboard-jordens May 9, 2018

jordens removed this from done in dashboard-jordens May 9, 2018

jordens mentioned this issue Jul 9, 2018

FPGA refuses connection #1100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corelog chokes coredevice #986

corelog chokes coredevice #986

jordens commented Apr 23, 2018 •

edited

whitequark commented Apr 23, 2018

jordens commented Apr 23, 2018

whitequark commented Apr 23, 2018

whitequark commented May 5, 2018

sbourdeauducq commented May 6, 2018

whitequark commented May 6, 2018

corelog chokes coredevice #986

corelog chokes coredevice #986

Comments

jordens commented Apr 23, 2018 • edited

whitequark commented Apr 23, 2018

jordens commented Apr 23, 2018

whitequark commented Apr 23, 2018

whitequark commented May 5, 2018

sbourdeauducq commented May 6, 2018

whitequark commented May 6, 2018

jordens commented Apr 23, 2018 •

edited