Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve robustness of OpenOCD connection in SRAM execution tests #15174

Closed
dmcardle opened this issue Sep 28, 2022 · 7 comments · Fixed by #15163
Closed

Improve robustness of OpenOCD connection in SRAM execution tests #15174

dmcardle opened this issue Sep 28, 2022 · 7 comments · Fixed by #15163
Assignees
Labels
Component:FPGA FPGA related issues Component:Software Issue related to Software Type:Bug Bugs
Milestone

Comments

@dmcardle
Copy link
Contributor

Switching from Test ROM to ROM in #15024 caused our tower of GDB/OpenOCD/JTAG/FPGA to become flaky.

I summarized my thoughts in the commit message for 473261d, which I'll copy here.

Prior to this commit, we relied on the startup routine in the Test ROM
that configures the PMP to enable RWX the entire SRAM region. The ROM
does not have the same "gadget" for us to reuse, so we needed to figure
out how to do the same thing with OpenOCD commands. This proved very
tricky due to implementation details of OpenTitan's debug module[^0],
but we did find a workaround.

The test is a little flaky now that we're using the ROM. I think there
are at least two sources of flakiness: (1) the ROM bootloops because it
cannot find ROM_EXT in either flash slot, and (2) the watchdog timer.

  * I think that establishing the OpenOCD connection races with the
    device rebooting, so if we're unlucky the connection will fail. If
    this is the case, we may be able to automatically retry OpenOCD
    until the connection succeeds.

  * I've addressed the timer issue by adding a sequence of GDB commands
    that disable the watchdog timer.

While trying to reduce flakiness, it's easy to succumb to confirmation
bias. To smooth out the noise a bit, I've been repeating the test 20
times with the command below. Anecdotally, I'm seeing somewhere between
0 and 7 flakes per 20 runs.

    ./bazelisk.sh test --runs_per_test=20 \
      --cache_test_results=no --test_output=streamed \
      //sw/device/examples/sram_program:sram_program_fpga_cw310_test

I think it might be possible to eliminate the flakiness by repeating the OpenOCD connection step until it succeeds. This is a little trickier than it sounds because I believe OpenOCD will hang rather than give up with a meaningful exit code. We may have to make an OpenOCD wrapper that runs the command and watches stdout/stderr to decide whether to wait or retry.

@dmcardle dmcardle added Component:FPGA FPGA related issues Component:Software Issue related to Software Type:Bug Bugs Component:Rom/E2e/Test ROM end-to-end test (please use Component:Rom/e2e/Task for non-test tasks) labels Sep 28, 2022
@dmcardle dmcardle self-assigned this Sep 28, 2022
@dmcardle
Copy link
Contributor Author

@alphan FYI

@tjaychen
Copy link

@dmcardle / @alphan for your usecase and testing, would it be easier to just have ROM_EXEC_EN = 0?
At least for the cases we are concerned about in manufacturing, most of the sram injection we expect to happen with ROM_EXEC_EN=0, precisely because the rom would shut down if it cannot find a valid image.

@dmcardle
Copy link
Contributor Author

Yeah, I should have mentioned that actually. Disabling execution prevents the ROM from boot-looping and prevents it from setting up the watchdog timer. We could disable execution in a special OTP image and splice it into the CW310 bitstreams, but OTP splicing needs a little work first (#15162).

@tjaychen
Copy link

ah sounds good. Yeah it would just be good to confirm that having ROM_EXEC_EN=0 actually makes the connection more stable. If even then it's still flaky, we may have some more issues we need to track down.

@dmcardle
Copy link
Contributor Author

I kind of simulated ROM_EXEC_EN=0 by commenting out the condition on the wfi instruction in _rom_start_boot and I believe that eliminated the flakiness. I should confirm that again, though.

@tjaychen
Copy link

ah sounds good...at least there's no other hidden issue hehe.

@dmcardle dmcardle added this to the Project: M2 milestone Sep 28, 2022
@alphan
Copy link
Contributor

alphan commented Sep 28, 2022

Yeah, both #14484 and #14486 say:

CREATOR_SW_CFG_ROM_EXEC_EN should be set to 0.

@dmcardle I think you can close this once you can confirm that everything is stable.

We can either

  • wait for the OTP issue to be resolved,
  • simply comment out the check before wfi and --define bitstream=gcp_splice as you suggested, or
  • set the corresponding entry in the RMA HJSON file to 0 and test using the resulting bitstream.

It's good to know that we can attach even when the ROM is running but the plan is to enable execution only after we are confident.

@alphan alphan removed the Component:Rom/E2e/Test ROM end-to-end test (please use Component:Rom/e2e/Task for non-test tasks) label Sep 29, 2022
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 15, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 16, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 16, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 16, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 16, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 17, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 17, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 17, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 18, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
dmcardle added a commit to dmcardle/opentitan that referenced this issue Oct 18, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix lowRISC#15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
drewmacrae pushed a commit that referenced this issue Oct 19, 2022
Now that execution is disabled, OpenOCD connection should not be flaky
anymore! This should fix #15174.

Signed-off-by: Dan McArdle <dmcardle@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component:FPGA FPGA related issues Component:Software Issue related to Software Type:Bug Bugs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants