Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Urukul sync issues (SMP_ERR) #1692

Closed
airwoodix opened this issue Jun 4, 2021 · 30 comments · Fixed by #2374
Closed

Urukul sync issues (SMP_ERR) #1692

airwoodix opened this issue Jun 4, 2021 · 30 comments · Fixed by #2374

Comments

@airwoodix
Copy link
Contributor

Bug Report

One-Line Summary

When using DRTIO, inter-channel synchronization with Urukul doesn't work, although the same configuration produces the expected results on a standalone crate.

Issue Details

Steps to Reproduce

  1. A single crate with the following hardware configuration:
{
    "_description": "The description",
    "target": "kasli",
    "variant": "my-variant",
    "hw_rev": "v2.0",
    "base": "master",
    "peripherals": [
        {
            "type": "urukul",
            "ports": [3, 4],
            "hw_rev": "v1.4",
            "dds": "ad9910",
            "refclk": 125e6,
            "clk_sel": 2,
            "clk_div": 0,
            "pll_n": 32,
            "synchronization": true
        }
    ]
}

(+ more, hopefully irrelevant peripherals)

The Urukul v1.4 card has IFC mode 1010 (en_ad9910 and en_eem1 activated).

  1. device_db generated by artiq_ddb_template
  2. Run artiq_sinara_tester to calibrate the io_update and sync delays and write the results to EEPROM
  3. The following experiment:
from artiq.experiment import *

class DDSPhase(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.rf0 = self.get_device("urukul0_ch1")
        self.rf1 = self.get_device("urukul0_ch2")

    @kernel
    def run(self):
        self.core.reset()
        self.rf0.cpld.init()

        self.rf0.init()
        self.rf0.set_att_mu(160)
        self.rf0.set_phase_mode(2)

        self.rf1.init()
        self.rf1.set_att_mu(160)
        self.rf1.set_phase_mode(2)

        self.rf0.set(100*MHz)
        self.rf1.set(100*MHz, phase=0.5)

        self.rf0.sw.on()
        self.rf1.sw.on()

Expected Behavior

100 MHz sines on CH1 and CH2 have 180° phase difference

Actual (undesired) Behavior

  • The phase difference is not always 180°. Occurrence of the correct phase difference between re-executions of the experiment seems random. The behavior is the same when a satellite is present (and also on a Urukul on the satellite)
  • When substituting "base": "standalone", rebuilding and reflashing, the system behaves as expected all the time.

Questions / comments

  • Is this a known, reproducible issue with DRTIO master/satellite configurations?
  • We seem to have an issue with the SYNC tree: with Urukul v1.4 and v1.5, channel 0 is always faulty because the PLL doesn't lock and the SMP_ERR flag is up (according to the CPLD status register). The other channels are fine though. And in standalone mode, the phase relationships between channels 1,2,3 are correct. This seems unrelated to this particular issue though. Is there something I missed on referencing CH0 to the FPGA sync and not using it as master for the SYNC tree?

Your System (omit irrelevant parts)

  • Operating System: Linux / nix
  • ARTIQ version: v7.7636.ea1dd2da.beta
  • Version of the gateware and runtime loaded in the core device: same as software, built using Vivado 2020.1
  • Hardware involved: Kasli 2.0.1, Urukul 1.4
  • Urukul CPLD image: v1.4.0 (build 169624)
@airwoodix
Copy link
Contributor Author

@dnadlinger @jordens Is this a known issue? or a problem with our particular setup?

@jordens
Copy link
Member

jordens commented Jun 21, 2021

I think:

  • The phase difference of 180 deg isn't worrying per se, as long as it's deterministic. The latter is the problem.
  • en_eem1 isn't useful here. The signals are not used on Kasli and might cause crosstalk.
  • There is no actual satellite here. But if there were, to use the eeprom seeds, the eeprom read would have to work over drtio aux. Maybe that's a bit unexplored. But shouldn't be relevant here.
  • Could you try using the seed values the calibration routine spits out directly? I.e. pass them in device_db.
  • I have seen that SMP_ERR failure on ch0 as well with factory-flashed devices. I have no idea what was flashed exactly and haven't looked how that's compiled. But the PLL always locks. But this may or may not be a different problem. Could you try the released binaries? https://github.com/quartiq/urukul/releases/tag/v1.4.0
  • I don't see why ch0 would be different in the code. Maybe try a few ms or 100 ms delay between cpld.init() and rf0.init() or swap rf0 and rf1.
  • If this holds up against the tests above, we should look at whether the clocking of the sync pulse generator on kasli matches between standalone and master.

@dnadlinger
Copy link
Collaborator

@airwoodix: We've definitely tested this on DRTIO master builds in the past, but it's been a few years since the initial bring-up.

@HarryMakes
Copy link
Contributor

Hi @airwoodix and all, have you been able to replicate this issue again with ARTIQ-7 (e.g. 43eab14)?

With ARTIQ-7 and ARTIQ-6 I have tried your code on a DRTIO master setup with a single Urukul card, but I have never seen absurd variance in terms of phase difference across power cycles and reboots. The worst STD I can get on the DRTIO master is ~0.01 rad. I also compared the phase with a standalone setup on the same hardware, and I can't see significant difference in the results.

On the other hand, I noted one of @airwoodix's observation was that the red LED for CH0 turned on when sync'ed. I confirm that this can be reproduced, and I get the following observation (NB the criteria of the channel fault indicator is SYNC_SMP_ERR | ~PLL_LOCK):

  1. Given calibrated sync, as long as set_att() is not called on any channel, DDS0 SYNC_SMP_ERR stays 0, and DDS0 PLL_LOCK stays 1.
  2. Given calibrated sync, if set_att() is called on any channel, the first read of the CPLD status register (using urukul.CPLD.sta_read()) right after urukul.CPLD.init() always returns a value where DDS0 PLL_LOCK is 0, but every subsequent read returns DDS0 PLL_LOCK = 1. PLL_LOCK = 1 for DDS1 to DDS3.

I suspect there might be some issue doing SPI transactions within a Urukul card. I could look into that, but that will not be related to this very issue about Urukul AD9910 sync phase errors.

@systemofapwne
Copy link

Hello @HarryMakes, thanks for your feedback.
This issue is still open in our group. Due to being a bit of a low-priority in the past months, I did not work on this topic that much but it sounds like a good Debugging-Friday job ;)

@HarryMakes

This comment has been minimized.

@HarryMakes
Copy link
Contributor

Correction: I have not found issues with writing and reading the sync_delay_seed and io_update_delay values. There was a mistake in my test code as I forgot to call init() on the channels to read from EEPROM before printing the values.

Therefore, my current findings show that this Urukul sync issue cannot be replicated on DRTIO master on ARTIQ-7 or ARTIQ-6.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Dec 13, 2021

With ARTIQ-7 and ARTIQ-6 I have tried your code on a DRTIO master setup with a single Urukul card, but I have never seen absurd variance in terms of phase difference across power cycles and reboots.

Which channels did you test? It seems there is an issue with ch0 which does not have deterministic phase wrt the others (also on non-DRTIO systems). @Spaqin

@HarryMakes
Copy link
Contributor

HarryMakes commented Dec 14, 2021

With ARTIQ-7 and ARTIQ-6 I have tried your code on a DRTIO master setup with a single Urukul card, but I have never seen absurd variance in terms of phase difference across power cycles and reboots.

Which channels did you test? It seems there is an issue with ch0 which does not have deterministic phase wrt the others (also on non-DRTIO systems). @Spaqin

I did test the phase stability involving Ch0 of a single Urukul AD9910 card (e.g. between Ch0 and Ch1, between Ch0 and Ch2), which is configured on a Kasli 2.0 DRTIO master.

In my previous comment, I mentioned the worst standard deviation (std) I have seen was ~0.01 rad, but this applied only to @airwoodix's original code where the two Urukul channels (which involves Ch0) are offset by 180°. I also separately tested with 0° offset involving Ch0, but the std is significantly lower at ~0.001 rad.

@Spaqin
Copy link
Collaborator

Spaqin commented Dec 14, 2021

I have redone the tests today, and I could not reproduce any issues with phase. Tested it with a system that has 3 AD9910s and a Kasli 2.0; on both ARTIQ 6 and ARTIQ 7. With the system configured both as DRTIO Master, and as Standalone.

Once calibrated, using the simple experiment code from OP (and even modifying it to include more channels across all three cards), I got very consistent results, even after power cycling. To be exact: two cards had very close phases, third one was using longer cables and thus had a significant phase shift, but still consistent across the reboots.

Which is weird - I could've sworn we couldn't get consistent results last week with ch0... All I did is tightened the screws?

Also, red light on ch0 is more like a red herring. It pops up after set_mu function is called. It causes a SYNC_SMP_ERROR - PLL_LOCK is at 1 at all times. Modifying the AD9910 code to clear smp flag after that causes the light to go green again and no other issues pop up. However, the code is the same for all channels. I could not figure out why only first channel was affected. It may be a bug with communication with the CPLD as @HarryMakes mentioned - but other than being slightly misleading, the red LED does not change anything in the behavior.

@Spaqin
Copy link
Collaborator

Spaqin commented Dec 15, 2021

Previously I was worried that the calibration data was lost or incorrect and on multiple power cycles it would cause a shift.
The phase shift between power cycles, on both cold and hot devices, is constant. But that's the key word here - between power cycles. So, with the same system as previously (3 Urukul AD9910 cards, Kasli 2.0), on ARTIQ 6, when I do the following, I get consistent results:

  1. Power on the system
  2. Run the test code (no calibration)
  3. Verify phase shift
  4. Power off the system
  5. Repeat 1-4

Again, phase differences seem to be identical every time. That is true for both DRTIO Master and Standalone configurations.

However, I managed to reproduce the issue - it pops up within a power cycle, e.g.

  1. Power on the system
  2. Run the test code (no calibration)
  3. Verify phase shift
  4. Repeat 2-3 ...

Phase shift may differ between each time, but only on DRTIO Masters - Standalone systems seem not to be affected. Running either cpld init or channel init more than once within one powerup may change the phase shift. Still not sure why exactly, but I can reproduce the issue more reliably.

@airwoodix
Copy link
Contributor Author

Thanks @Spaqin and @HarryMakes for the investigation! Sorry for the very, very delayed feedback.

I got back into this because of issues observed on standalone systems with Urukul v1.5 and v1.5.1 (ARTIQ v7.0.b02abc2.beta, Kasli v1.1). The observation is the same as that of @Spaqin (changes in the intra-card phase offset without one power cycle) except that it systematically fails lock the urukul2_ch0 (v1.5.1) PLL:

urukul1: lock=11, smp_err=00, sync_sel=0, proto_rev=8
urukul2: lock=10, smp_err=01, sync_sel=0, proto_rev=8

Experiment code (urukul1 is v1.3, urukul2 is v1.5.1):

from artiq.experiment import *
from artiq.coredevice import ad9910, urukul


class TwoPulses(EnvExperiment):
    def build(self):
        self.setattr_device("core")

        self.ddses = [
            # urukul1 is v1.3
            self.get_device("urukul1_ch0"),
            self.get_device("urukul1_ch1"),
            # urukul 2 is v1.5
            self.get_device("urukul2_ch0"),
            self.get_device("urukul2_ch1"),
        ]

        self.trigger = self.get_device("ttl4")

        self.urukul_status = [0, 0]

    @kernel
    def run(self):
        self.core.reset()

        for dds in self.ddses:
            dds.cpld.init()
            dds.init()
            dds.set_att(10.0 * dB)

            dds.set(180.0 * MHz, phase_mode=ad9910.PHASE_MODE_TRACKING)

        self.trigger.pulse(1.0 * us)
        t_pulse = now_mu()

        for dds in self.ddses:
            at_mu(t_pulse)
            dds.sw.pulse(10.0 * us)

        self.store_urukul_status()

    @kernel
    def store_urukul_status(self):
        for i in range(2):
            cpld = self.ddses[2 * i].cpld
            self.urukul_status[i] = cpld.sta_read()
            delay(10 * us)

    def analyze(self):
        for i, sta in enumerate(self.urukul_status):
            lock = urukul.urukul_sta_pll_lock(sta) & 3
            smp_err = urukul.urukul_sta_smp_err(sta) & 3
            proto_rev = urukul.urukul_sta_proto_rev(sta)
            sync_sel = (self.ddses[i * 2].cpld.cfg_reg >> urukul.CFG_SYNC_SEL) & 1
            print(f"urukul{i+1}: {lock=:02b}, {smp_err=:02b}, {sync_sel=}, {proto_rev=}")

Relevant part of device_db (same for urukul2_cpld):

device_db["urukul1_cpld"] = {
    "type": "local",
    "module": "artiq.coredevice.urukul",
    "class": "CPLD",
    "arguments": {
        "spi_device": "spi_urukul1",
        "sync_device": "ttl_urukul1_sync",
        "io_update_device": "ttl_urukul1_io_update",
        "refclk": 125000000.0,
        "clk_sel": 2
    }
}

WIth same-length cables, the scope picture looks like:

image

The phase offset between urukul1_ch0 (CH1) / urukul1_ch1 (CH2) and urukul2_ch0 (CH3) is always the same over repeated calls to artiq_run (no power cycle, no reboot) but not that between urukul2_ch0 (CH3) and urukul2_ch1 (CH4) which oscillates with unstable period between the previous configuration and this one:

image

The non-deterministic phase between channels (intra- and inter-Urukul) is very problematic.

  • Could it be a hardware issue @gkasprow ? I have observed this on at least three different Urukul v1.4 and v1.5 boards.
  • Could it be an issue with the CPLD gateware @jordens ? The v1.4 boards in the first post were flashed manually, those in this message by Creotech.
  • Has this been reported by other users?

Thanks!

@jordens
Copy link
Member

jordens commented Apr 5, 2022

Could it be an issue with the CPLD gateware @jordens ? The v1.4 boards in the first post were flashed manually, those in this message by Creotech.

Seems unlikely but it's certainly not impossible.

@marmeladapk
Copy link
Contributor

marmeladapk commented Jun 6, 2023

I did a few more tests with Urukul v1.5.4 and Kasli standalone.

  • It's always the 0th channels that has phase difference to others, channels 1-3 were always synced in my tests
  • 0th channel delay to other channels is a multiple of some value, dependent on output frequency:
    • n*4 ns for 25, 31.25, 62.5, 125, 250 MHz
    • n*1 ns for 100, 200 MHz
    • no delay for 250 MHz
    • for odd frequencies like 127 or 255 MHz the delay also jumps by some multiple, for example for 255 MHz this multiple was in the neighbourhood of 200 ps
  • as others said, 0th channel reports no pll lock and smp error
  • Now the best part - if I reverse the order in which channels are set then they are all in sync, always. This was tested across many power cycles, output frequencies and with iterations without power cycles. Pll lock and smp error remain on 0th channel (also after power cycling)
  • Order of initialisation does not matter
  • adding ~30 us delay in the loop which sets outputs causes pll lock status to be high on 0th channel
from artiq.experiment import *
from artiq.coredevice import ad9910, urukul

class SystemExample(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.urukul_cpld = [self.get_device("urukul{}_cpld".format(i)) for i in range(1)]
        self.urukul_9910 = [self.get_device("urukul{}_ch{}".format(i // 4, i % 4)) for i in range(4)]

        self.urukul_status = [0]

    @kernel
    def run(self):
        self.core.break_realtime()
        self.core.reset()
        self.init()
        self.core.break_realtime()

        t = now_mu()
        for i in range(4):
        # for i in range(3, -1, -1):
            self.urukul_set_output(self.urukul_9910[i], 31.25 * MHz, 2.0, t)

        self.store_urukul_status()

    @kernel
    def init(self):
        for urukul_cpld in self.urukul_cpld:
            urukul_cpld.init()
        for i in range(4):
        # for i in range(3, -1, -1):
            channel = self.urukul_9910[i]
            channel.init()
            channel.sw.off()
            channel.set_phase_mode(1)
        self.core.break_realtime()

    @kernel
    def urukul_set_output(self, channel, freq, attenuation, t):
        channel.set_att(attenuation)
        channel.sw.on()
        channel.set(freq, ref_time_mu=t)

    @kernel
    def store_urukul_status(self):
        for i in range(1):
            cpld = self.urukul_cpld[i]
            self.urukul_status[i] = cpld.sta_read()
            delay(10 * us)

    def analyze(self):
        for i, sta in enumerate(self.urukul_status):
            lock = urukul.urukul_sta_pll_lock(sta)
            smp_err = urukul.urukul_sta_smp_err(sta)
            proto_rev = urukul.urukul_sta_proto_rev(sta)
            sync_sel = (self.urukul_cpld[0].cfg_reg >> urukul.CFG_SYNC_SEL) & 1
            print(f"urukul{i + 1}: {lock=:04b}, {smp_err=:04b}, {sync_sel=}, {proto_rev=}")

@marmeladapk
Copy link
Contributor

marmeladapk commented Jun 6, 2023

Passing values from tune_sync_delay and tune_io_update_delay to device_db did not change anything.

@jordens
Copy link
Member

jordens commented Jun 6, 2023

From the top of my head, places where I'd look:

  1. Check whether the sync tuning and io_update tuning are stable. What's the value it finds? Is that stable? Do you write it to the eeprom, i.e. does it use the last value as the seed for next time?
  2. Which ARTIQ version is this? Which gateware on Urukul?
  3. Check io_update and especially SYNC SI through the chain from Kasli through the CPLD (io_update)/the fanout (SYNC) to the DDS. Signal levels safe? Terminations ok?
  4. May well also be just some DDS state machinery sequencing issue that we unintentionally or unknowingly violate. The order-dependence hints to that. Add delays in init() and its consituents.
  5. Reduce the code (also in the coredevice driver) to just reproduce the PLL lock failure or the SYNC failure on one channel. Minimize the amount of commands. Get an analyzer dump.

@marmeladapk
Copy link
Contributor

  1. Yes, they are. Values are: sync delay: 18 on all, io_update 0 on all. I don't write them too eeprom, I set them in device_db. These values seem stable.
  2. artiq-7, 3a9213d, urukul 1.4.0 cpld code
  3. Pending.
  4. Adding delays between inits does not change anything. So right now the status register shows that the pll is locked, but the smp error still remains. It can be cleared, but the any set to this channel will set it again. I'll add delays in the cpld init and channel init itself.
  5. adding ~30 us delay in the loop which sets outputs causes pll lock status to be high on 0th channel. Other points are pending.

@jordens jordens changed the title Urukul sync issues with DRTIO Urukul sync issues (SMP_ERR) Jun 15, 2023
@jordens
Copy link
Member

jordens commented Jun 15, 2023

@marmeladapk
Copy link
Contributor

marmeladapk commented Jun 15, 2023

I found out that setting attenuators is what affects the phase of the output signal on channel 0. It's also correlated with data that is sent to attenuators: 0xffffffff very rarely triggers the glitch, 0xaaaaaaaa does it almost every time.

SPI bus on attenuators changed between 1.3 and 1.4, now there are dedicated buses to each attenuator, however it seems that they are driven all at once. These signals are routed in parallel over long distances, however they are not in direct vicinity of sync signals.

Also it seems that spi signals do not affect sync in clk on channel 0 - I had a differential probe on sync in clk and I didn't notice any glitches on it, despite jumping phase of the output signal. Same for the DDS CLK.

I also noticed that DDS SPI bus has the same clock and data as attenuators and I thought that maybe activity on the digital side of the DDS was causing some glitches. But when I set 3rd DDS channel in loop instead of attenuators there are no glitches in phase.

This suggests crosstalk issues, but I still have to pinpoint where exactly is this happening.

from artiq.experiment import *
from artiq.coredevice import ad9910, urukul, spi2

class SystemExample(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.urukul_cpld = [self.get_device("urukul{}_cpld".format(i)) for i in range(1)]
        self.urukul_9910 = [self.get_device("urukul{}_ch{}".format(i // 4, i % 4)) for i in range(4)]
        self.urukul_status = [0]

    @kernel
    def run(self):
        self.core.break_realtime()
        self.core.reset()
        for i in range(4):
            self.urukul_9910[i].set_att(4.0)
        delay(10*ms)
        self.init()
        self.core.break_realtime()

        t = now_mu()
        for i in range(4):
            self.urukul_9910[i].set(31.25 * MHz, ref_time_mu=t)
            delay(300 * us)
       # No glitch up to this point
        delay(2*s)
        self.urukul_cpld[0].bus.set_config_mu(urukul.SPI_CONFIG | spi2.SPI_END, 32,
                                              urukul.SPIT_ATT_WR, urukul.CS_ATT)
        for i in range(10000):
            # Glitch if writing to attenuators, no glitch if writing to DDS
            self.urukul_cpld[0].bus.write(0xf0f0f0f0)
            # self.urukul_9910[3].set(31.25 * MHz, ref_time_mu=t)
            delay(1 * ms)

        self.store_urukul_status()

    @kernel
    def init(self):
        for urukul_cpld in self.urukul_cpld:
            urukul_cpld.init()
        for i in range(4):
            channel = self.urukul_9910[i]
            channel.init()
            channel.sw.on()
            channel.set_phase_mode(1)
        self.core.break_realtime()
        delay(10*ms)

    @kernel
    def store_urukul_status(self):
        for i in range(1):
            cpld = self.urukul_cpld[i]
            self.urukul_status[i] = cpld.sta_read()
            delay(10 * us)

    def analyze(self):
        for i, sta in enumerate(self.urukul_status):
            lock = urukul.urukul_sta_pll_lock(sta)
            smp_err = urukul.urukul_sta_smp_err(sta)
            proto_rev = urukul.urukul_sta_proto_rev(sta)
            sync_sel = (self.urukul_cpld[0].cfg_reg >> urukul.CFG_SYNC_SEL) & 1
            print(f"urukul{i + 1}: {lock=:04b}, {smp_err=:04b}, {sync_sel=}, {proto_rev=}")

@marmeladapk
Copy link
Contributor

I modified CPLD code to single out signals responsible for this issue. I disconnected miso from attenuators and drove them from cpld one by one. It seems that problematic signals are:

  • att s out 0
  • att s in 1
  • att s out 1
  • att s in 2
  • att s out 2
  • att le 2

From those att s in 2 is the one that causes phase shifts. Other signals only cause SMP err to go high, but I didn't notice any changes in phase. I tried setting slow slew rate (and verified that it slows down edges) but it didn't help. One thing that all these signal have in common is that they all run parallel to SMP err for a few millimeters at some point. However I couldn't observe any crosstalk on this signal.

Everything seems to point to crosstalk issues, but I still can't pinpoint it. I also don't know why att s in 2 is the one that affects dds0 the most.

@dnadlinger
Copy link
Collaborator

dnadlinger commented Jun 19, 2023

This all seems a bit strange. If I understand correctly, from the single-signal tests, it really it is the signals to the attenuators that causes the issues, not some power/RF glitch from the attenuators switching?

@marmeladapk
Copy link
Contributor

It seems so, because if I disable all of the above signals then attenuator 0 still switches (since its mosi, clk and le are active) and I don't observe any problems on channel 0.

@marmeladapk
Copy link
Contributor

It seems that I found the culprit or at least some part of it. All of aformentioned 6 lines run under pll filter of dds0. I was able to induce smp error and phase glitches on other channels by probing their filters with an oscilloscope probe (8 pF).

I lifted loop filter pin from the pad and assembled the filter in air, connecting it to capacitor nearby, so electrically it's the same connection. By lifting the pin I should avoid pickup from antenna that is now formed by floating pads.

It helped to some extent. Now att s in 2 and att le 2 no longer cause smp error or phase glitches. This means that there shouldn't be any phase glitches but smp error will still remain. However, s out 0, s in 1, s out 1 and s out 2 still cause smp error to occur.

I'm fairly certain that fixing the stackup, rerouting these signals to run somewhere else and giving them a proper, continuous reference plane (sinara-hw/Urukul#72) to couple into will help. I think it's safe to close this issue, or transfer it to Urukul repository (@jordens, do you have such powers?).

@jordens
Copy link
Member

jordens commented Jun 20, 2023

Doesn't look like it. Either I lack the perms on Urukul or it can't cross orgs.

@jordens
Copy link
Member

jordens commented Jun 20, 2023

I was about to point out that the SMP errors can be cause indirectly by PLL glitches/transients. The P1V8_SYSCLK net is also a sensitive node here.

@marmeladapk
Copy link
Contributor

I think this issue can be closed.

@jordens jordens closed this as completed Jun 20, 2023
@nkrackow
Copy link
Contributor

nkrackow commented Mar 21, 2024

We (and I think quite a few other people) still saw this issue over a large range of hardware circumstances:

  • different Kasli versions (several 2.0.2 and Kasli-SoCs)
  • different Urukul versions (several 1.3, 1.5, 1.5.4 1.5.6)
  • often seemed stable one day, not stable the next etc.
  • almost never stable for the first ~30 seconds after a power cycle

The root of the problem is the way the "eye" center find algorithm in tune_sync_delay() in init() works:
It progressively widens the setup/hold violation detection window in the AD9910 and checks for a sync_delay that still doesn't yield a violation error signal. While it does start the search around a seed that deterministically (across kernel restarts and power cycles) lies in the same "eye", the following condition can sporadically occur:

0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
1 [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1]
2 [1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1]
3 [1, 1, 1, 1, 1, 0*, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
4 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
5 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

0 [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
1 [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1]
2 [1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1]
3 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0*, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
4 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
5 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

This is the output of a sync_delay scan with progressively larger windows. The first integer is the window size and the following list is the output of the s/h violation detection circuit. The dynamics of the jitter and detection circuit are such that in very rare cases there is a valid delay for a wider window on the other side of the edge and the algorithm will happily jump over. I have marked the problematic 0s with a *. If the seed would be eg. delay tap 5 in both cases, we would jump over the edge in the second case and land at tap 16.

This is why you can never land in a problematic region and see the s/h error LED come up but still get the phase jumps. Note that just init(blind=True) also yields deterministic phases.

@gkasprow
Copy link
Collaborator

Sync issue caused by trace crosstalk was fixed in 1.5.5

@gkasprow
Copy link
Collaborator

sinara-hw/Urukul#73

@nkrackow
Copy link
Contributor

Yes, but this is an independent issue. I also saw it (and verified that init(blind=True) or a tweaked algorithm fixes it) on 1.5.6. Sorry, initially had the wrong rev in the comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants