Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kasli intermittent SDRAM failure #1149

Closed
sbourdeauducq opened this issue Sep 9, 2018 · 17 comments
Closed

Kasli intermittent SDRAM failure #1149

sbourdeauducq opened this issue Sep 9, 2018 · 17 comments

Comments

@sbourdeauducq
Copy link
Member

Sometimes happens on one of the boards:

Bootloader CRC passed
Gateware ident 4.0.dev+1405.gdf61b859.dirty;satellite
Initializing SDRAM...
Read leveling scan:
Module 1:
00000000000000000000111111000000
Module 0:
00000000000000000000111111000000
Read leveling: 22+-3 Small window: 0: 21-25 (5)
SDRAM initialization failed
Halting.
@jordens
Copy link
Member

jordens commented Sep 9, 2018

I think I have seen the window reduce from about +-5 to about +-4 to systematically +-3 around the time of each of the two frequency changes. I didn't narrow it down.

@sbourdeauducq
Copy link
Member Author

How many affected boards do you have? I've only seen it on one so far.

@cjbe
Copy link
Contributor

cjbe commented Sep 9, 2018

I see this on 4 out of 4 boards I tested (2x hw-rev 1.1, 2x hw-rev 1.0). On all boards this error occurred on every reboot. Moving back to a build just before the CPU frequency changes (i.e. just before ea71a04) this problem is not present.
(FWIW all my Kasli are running rather warm)

@jordens
Copy link
Member

jordens commented Sep 9, 2018

All boards I have access to (~6) currently have window size 6. All are cooled.

@cjbe
Copy link
Contributor

cjbe commented Sep 9, 2018

@jordens interesting - do you know what temperature? Mine are around 85 deg. C.

On all the Kasli I tried, if I flash pre ea71a04 gateware I have a window of 8, whereas with the new gateware I have 5.

@jordens
Copy link
Member

jordens commented Sep 10, 2018

Around 70 °C

@enjoy-digital
Copy link
Contributor

Please find my results and patches for this here: https://github.com/enjoy-digital/kasli_sdram

I did a first test with a simple LiteX design and a Misoc design. The LiteX design was working fine while Misoc design was not. (same behaviour than with Artiq)

Since LiteX/LiteDRAM introduce lots of change around the DRAM phy that could also benefits MiSoC, i ported the code to MiSoC and verified it was working correctly.

Before (113MHz):

MiSoC BIOS
(c) Copyright 2007-2017 M-Labs Limited
Built Nov 13 2018 11:20:51

BIOS CRC passed (08398e93)
Initializing SDRAM...
Read delays: 1:20-26  0:21-26  completed
Memtest OK
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
Timeout
Booting from flash...
Error: Invalid flash boot image length 0x98000000
No boot medium found

After (113MHz):

MiSoC BIOS
(c) Copyright 2007-2017 M-Labs Limited
Built Nov 14 2018 18:16:33

BIOS CRC passed (ba103d9f)
Initializing SDRAM...
SDRAM now under software control
Read leveling:
m0, b0: |11111111110000000000000000000000| delays: 05+-05
m0, b1: |00000000000011111111111000000000| delays: 18+-06
m0, b2: |00000000000000000000000000011111| delays: 29+-02
m0, b3: |00000000000000000000000000000000| delays: 32+-00
m0, b4: |00000000000000000000000000000000| delays: 32+-00
m0, b5: |00000000000000000000000000000000| delays: 32+-00
m0, b6: |00000000000000000000000000000000| delays: 32+-00
m0, b7: |00000000000000000000000000000000| delays: 32+-00
best: m0, b1 delays: 17+-05
m1, b0: |11111111100000000000000000000000| delays: 04+-04
m1, b1: |00000000000111111111110000000000| delays: 17+-06
m1, b2: |00000000000000000000000000111111| delays: 29+-03
m1, b3: |00000000000000000000000000000000| delays: 32+-00
m1, b4: |00000000000000000000000000000000| delays: 32+-00
m1, b5: |00000000000000000000000000000000| delays: 32+-00
m1, b6: |00000000000000000000000000000000| delays: 32+-00
m1, b7: |00000000000000000000000000000000| delays: 32+-00
best: m1, b1 delays: 17+-05
SDRAM now under hardware control
Memtest OK
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
Timeout
No boot medium found

And test at 125Mhz to be sure you won't have troubles later:

MiSoC BIOS
(c) Copyright 2007-2017 M-Labs Limited
Built Nov 14 2018 18:49:04

BIOS CRC passed (c41e6528)
Initializing SDRAM...
SDRAM now under software control
Read leveling:
m0, b0: |11111100000000000000000000000000| delays: 03+-03
m0, b1: |00000001111111111100000000000000| delays: 12+-05
m0, b2: |00000000000000000000011111111100| delays: 25+-04
m0, b3: |00000000000000000000000000000000| delays: 32+-00
m0, b4: |00000000000000000000000000000000| delays: 32+-00
m0, b5: |00000000000000000000000000000000| delays: 32+-00
m0, b6: |00000000000000000000000000000000| delays: 32+-00
m0, b7: |00000000000000000000000000000000| delays: 32+-00
best: m0, b1 delays: 12+-05
m1, b0: |11111100000000000000000000000000| delays: 03+-03
m1, b1: |00000000111111111100000000000000| delays: 12+-05
m1, b2: |00000000000000000000011111111000| delays: 25+-04
m1, b3: |00000000000000000000000000000000| delays: 32+-00
m1, b4: |00000000000000000000000000000000| delays: 32+-00
m1, b5: |00000000000000000000000000000000| delays: 32+-00
m1, b6: |00000000000000000000000000000000| delays: 32+-00
m1, b7: |00000000000000000000000000000000| delays: 32+-00
best: m1, b1 delays: 12+-05
SDRAM now under hardware control
Memtest OK
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
Timeout
No boot medium found

This still need some adaptation of the rust code for Artiq which i'd prefer someone else to do (but should not be complicated and would allow review of the calibration). It will maybe not be perfect to you, but i won't be able to allocate more time now on this. I think it gives you something that works and that you can improve.

@sbourdeauducq
Copy link
Member Author

@enjoy-digital thanks, do you have an idea of what exactly is causing this behavior?

@enjoy-digital
Copy link
Contributor

@sbourdeauducq: no, but i'll spend an hour this morning trying to go in the other direction (patched version that is working to the one not working) and try to find it if you want a minimal patch for now.
For later, the big patch could still be interesting since it improves/fixes number of others points.

@sbourdeauducq
Copy link
Member Author

I'm mostly interested in understanding what caused such strange behavior. It is pretty puzzling that the read leveling window becomes smaller when the frequency decreases, and I want to avoid similar issues in the future.

@enjoy-digital
Copy link
Contributor

@sbourdeauducq: i'm on it.

@enjoy-digital
Copy link
Contributor

enjoy-digital commented Nov 15, 2018

The minimal misoc patch is here: https://github.com/enjoy-digital/kasli_sdram/blob/master/0001-cores-sdram_phy-a7ddrphy-change-rdphase-rdcmdphase-r.patch

And give the followings results:

MiSoC BIOS
(c) Copyright 2007-2017 M-Labs Limited
Built Nov 15 2018 09:17:28

BIOS CRC passed (a2fad1f4)
Initializing SDRAM...
Read delays: 1:00-09  0:00-10  completed
Memtest OK
Booting from serial...
Press Q or ESC to abort boot completely.
sL5DdSMmkekro
Timeout
No boot medium found

It changes the rdcmdphase/rdphase to better ones and also removes the initialization of the IDELAY_VALUE to 6 which i think is not need (it's better to let the software manage rdly itself) and can create corner cases in the software that would have to be managed.

For example:
with IDELAY_VALUE=6, we would have the following results:

m0, b0: |11110000000000000000000000111111| delays: 02+-02

with IDELAY_VALUE==0, we have:

m0, b0: |11111111110000000000000000000000| delays: 05+-05

which is easier to manage from the software and if you look at your previous results:

Read leveling scan:
Module 1:
00000000000000000000111111000000
Module 0:
00000000000000000000111111000000

The last 6 zeroes are related this IDELAY_VALUE, which was giving the following "real" scan:

m0, b2: |00000000000000000000000000111111| delays: 29+-02

We were just not in the optimal read window (not enough taps).

All the results of the tests i did are here: https://github.com/enjoy-digital/kasli_sdram/blob/master/README
and all tests are commited so that you can also have a look if you want.

@enjoy-digital
Copy link
Contributor

Note: on Kasli, you could also set bitslip by software to 1 to be sure to be in the best read window (leading and trailing zeroes):

m0, b1: |00000000000111111111111100000000| delays: 17+-06

@sbourdeauducq
Copy link
Member Author

Ah okay, I did suspect the delay maxing out at some point, but I thought that the reset was setting it to 0, not the gateware IDELAY_VALUE...
Thanks!

@sbourdeauducq
Copy link
Member Author

m0, b1: |00000000000111111111111100000000| delays: 17+-06

@enjoy-digital How do you get this result? I added this but no effect.

       for n in 0..DQS_SIGNAL_COUNT {
            ddrphy::dly_sel_write(1 << n);
            for _ in 0..3 {
                ddrphy::rdly_dq_bitslip_write(1);
            }
        }

@sbourdeauducq
Copy link
Member Author

okay, another I/O block reset issue...

@sbourdeauducq
Copy link
Member Author

No more SDRAM issue on any board here! Thanks again for the debugging @enjoy-digital

pmldrmota pushed a commit to pmldrmota/artiq that referenced this issue Jan 17, 2021
Resolved conflicts:

 - coredevice.urukul.CPLD and AD9910 APIs have changed to Robert's
   upstream version, pulse_io_update() and emit_io_update arguments
   no longer exist. Experiment code needs to be updated to use new
   API for phase control.

 - gateware.eem.Urukul reset pin pad name has changed to harmonise
   with upstream phase synchronisation implementation. Need to supply
   ttl_simple.ClockGen as argument in our target configuration.

 - Updated MiSoC to latest, as SDRAM read leveling bug should be
   fixed (cf. m-labs#1149).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants