-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang encountered with pdti8 tests using proj/mnv2_first/. #47
Comments
I looked through the assembly and found a sequence known to cause problems: a conditional branch immediately followed by a CFU instruction. So I added some no-ops in the source code in front of the CFU instruction, and verified that these translated into assembly (they did). But the hang still occurred, so this sequence does not seem to be the cause. |
The hang occurs in the pixel loop, 100+ pixels into the computation. Each iteration, there are
In all earlier pixels, including the first full iteration shown in the image above, all CFU instructions respond immediately with a In the final pixel, the first |
More findings: The "level" of the fifo for the output queue seems corrupted from the start of the waveform capture. Its starting value is 0x1f8, out of a maximum 0x1ff. So apparently it underflowed at some earlier point, or was left almost full. (I started the waveform capture on the second full inference, since this is where the hang typically occurred). So for the first 100+ pixels, the output data comes out immediately with no delay, since the output queue contains data. When I observed things "slow down", the output queue had finally emptied and was waiting for new items to be produced. That timing is probably what we should expect for the entire run. The actual error likely occurred before the waveform capture started. |
Ok, I had missed some key data and misdiagnosed the situation here. I had noticed that adding a printf inside the pixel loop caused the hang to disappear. BUT, the results were still incorrect. So now I think it IS a logic/software error that is causing the CFU to get into a bad state when running the pdti8 model, presumably because of tensor shapes that occur only in the pdti8 model. To verify, I programmed the board to ensure the CFU was in its initial state. I ran the mnv2 golden tests twice, and got correct results. Then I ran the pdti8 "zeros input" test. Then I ran the mnv2 golden test again; this time incorrect results were reported. @AvG, I suppose we could argue that we have specialized the CFU to run only mnv2 correctly, and that behavior on other models is undefined, and the system does not even check if unsupported tensor shapes occur. Certainly in a production case where code size is limited, we may want to eliminate the non-specialized version of a kernel if it is never called. But I'm not sure that's appropriate here in CFU-Playground. |
Agreed that this is a bug I'd like to find and fix. Even if it does not
fully accelerate the pdti8 model, the accelerator should run correctly.
…On Thu, Apr 8, 2021 at 1:23 PM TCal ***@***.***> wrote:
Ok, I had missed some key data and misdiagnosed the situation here. I had
noticed that adding a printf inside the pixel loop caused the hang to
disappear. BUT, the results were still incorrect. So now I think it IS a
logic/software error that is causing the CFU to get into a bad state when
running the pdti8 model, presumably because of tensor shapes that occur
only in the pdti8 model.
To verify, I programmed the board to ensure the CFU was in its initial
state. I ran the mnv2 golden tests twice, and got correct results. Then I
ran the pdti8 "zeros input" test. Then I ran the mnv2 golden test again;
this time incorrect results were reported.
@AvG <https://github.com/AvG>, I suppose we could argue that we have
specialized the CFU to run only mnv2 correctly, and that behavior on other
models is undefined, and the system does not even check if unsupported
tensor shapes occur. Certainly in a production case where code size is
limited, we may want to eliminate the non-specialized version of a kernel
if it is never called. But I'm not sure that's appropriate here in
CFU-Playground.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#47 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJ4WATSLXZOMBNMXTIVTH3THUOUZANCNFSM42NL7FZQ>
.
--
Alan Green
***@***.***
|
Looking at a simulator trace, I'm seeing 2 invocations of CFU_RUN_MACC per conv 2D pixel instead of one. According to the code, which specifies the order:
In summary, it seems that speculative execution and stateful CFUs don't mix. Could we maybe try a completely unpipelined VexRiscV ? |
That is a very puzzling trace. I don't see any way an extra There are two possible scenarios that we know of where a CFU instruction is executed when it shouldn't be:
Of course we might have multiple bugs here. Let me merge the VexRiscv with earlyBranch=true to eliminate one of the issues. |
Are you also seeing the additional starts in the first conv2d? |
I don't see the extra starts now. I did notice that f_count_max is zero where I notice things getting strange (6th conv2d). The count being loaded is 2048, which doesnt' fit in the [10:0] representation. |
I am able to get correct behavior if I make these changes:
@AvG, I'll let you make the actual fix; I'm not sure my edits are the best solution. |
Correctly track filter size. Expands several fields from 11 to 12 bits, which allows them to express the full range 0-2048, inclusive. This is a fix for bug google#47.
Verified the fix works on the board. |
On current main, after PR #40 , a hang can be reliably reproduced (instructions below). If printf() statements are added, the hang disappears. Thus it seems there is not a logical error with the CFU and software, but perhaps a subtle hardware error, maybe with the CPU but maybe not. No hangs were seen with mnv2.
To reproduce: build and program the bitstream as usual; build the software as usual; load the software for live interaction ("make load"). Then type "1 1 1 1". These commands specify model tests, choose pdti8, run the inference with zeros input, and finally run inference with zeros input again, at which point the hang is encountered. If you exit the interactive session with ctrl-C, and then reload the software ("make load"), then type "1 1 1", you should see the hang the first time inference is run. But if you reprogram the bitstream, then you need to run inference twice to see the hang again.
The hang can be reproduced in Verilator simulation ("make PLATFORM=sim load"). It hangs at exactly the same place of the second inference: layer 5, which is the second CONV_2D layer.
If a printf is added inside the pixel loop of CONV_2D to print out the value of "p" each iteration, the hang disappears.
I experimented building the CPU with no data cache, since data cache misses can cause instructions to be re-executed, which in particular would cause problems with stateful CFUs as is the case here; but I still saw the hang with no data cache.
The text was updated successfully, but these errors were encountered: