Hang encountered with pdti8 tests using proj/mnv2_first/. #47

tcal-x · 2021-04-05T20:05:09Z

On current main, after PR #40 , a hang can be reliably reproduced (instructions below). If printf() statements are added, the hang disappears. Thus it seems there is not a logical error with the CFU and software, but perhaps a subtle hardware error, maybe with the CPU but maybe not. No hangs were seen with mnv2.

To reproduce: build and program the bitstream as usual; build the software as usual; load the software for live interaction ("make load"). Then type "1 1 1 1". These commands specify model tests, choose pdti8, run the inference with zeros input, and finally run inference with zeros input again, at which point the hang is encountered. If you exit the interactive session with ctrl-C, and then reload the software ("make load"), then type "1 1 1", you should see the hang the first time inference is run. But if you reprogram the bitstream, then you need to run inference twice to see the hang again.

The hang can be reproduced in Verilator simulation ("make PLATFORM=sim load"). It hangs at exactly the same place of the second inference: layer 5, which is the second CONV_2D layer.

If a printf is added inside the pixel loop of CONV_2D to print out the value of "p" each iteration, the hang disappears.

I experimented building the CPU with no data cache, since data cache misses can cause instructions to be re-executed, which in particular would cause problems with stateful CFUs as is the case here; but I still saw the hang with no data cache.

tcal-x · 2021-04-05T20:45:36Z

I looked through the assembly and found a sequence known to cause problems: a conditional branch immediately followed by a CFU instruction. So I added some no-ops in the source code in front of the CFU instruction, and verified that these translated into assembly (they did). But the hang still occurred, so this sequence does not seem to be the cause.

tcal-x · 2021-04-07T23:14:26Z

tcal-x · 2021-04-07T23:21:08Z

The hang occurs in the pixel loop, 100+ pixels into the computation.

Each iteration, there are

4 x CFU_STORE_INPUT_VALUE
1 x CFU_MACC_RUN
8 x CFU_GET_OUTPUT

In all earlier pixels, including the first full iteration shown in the image above, all CFU instructions respond immediately with a rsp_valid asserted. However, in the last two iterations, things seem to slow down. Starting at the 3rd of the 8 CFU_GET_OUTPUT, there is a delay of about 10 cycles between the cmd_valid asserted and the corresponding rsp_valid assertion.

In the final pixel, the first CFU_STORE_INPUT_VALUE is sent, but no reply ever occurs.

tcal-x · 2021-04-08T03:03:46Z

More findings: The "level" of the fifo for the output queue seems corrupted from the start of the waveform capture. Its starting value is 0x1f8, out of a maximum 0x1ff. So apparently it underflowed at some earlier point, or was left almost full. (I started the waveform capture on the second full inference, since this is where the hang typically occurred).

So for the first 100+ pixels, the output data comes out immediately with no delay, since the output queue contains data. When I observed things "slow down", the output queue had finally emptied and was waiting for new items to be produced. That timing is probably what we should expect for the entire run.

The actual error likely occurred before the waveform capture started.

tcal-x · 2021-04-08T03:23:40Z

Ok, I had missed some key data and misdiagnosed the situation here. I had noticed that adding a printf inside the pixel loop caused the hang to disappear. BUT, the results were still incorrect. So now I think it IS a logic/software error that is causing the CFU to get into a bad state when running the pdti8 model, presumably because of tensor shapes that occur only in the pdti8 model.

To verify, I programmed the board to ensure the CFU was in its initial state. I ran the mnv2 golden tests twice, and got correct results. Then I ran the pdti8 "zeros input" test. Then I ran the mnv2 golden test again; this time incorrect results were reported.

@AvG, I suppose we could argue that we have specialized the CFU to run only mnv2 correctly, and that behavior on other models is undefined, and the system does not even check if unsupported tensor shapes occur. Certainly in a production case where code size is limited, we may want to eliminate the non-specialized version of a kernel if it is never called. But I'm not sure that's appropriate here in CFU-Playground.

alanvgreen · 2021-04-08T04:02:28Z

Agreed that this is a bug I'd like to find and fix. Even if it does not fully accelerate the pdti8 model, the accelerator should run correctly.

…

On Thu, Apr 8, 2021 at 1:23 PM TCal ***@***.***> wrote: Ok, I had missed some key data and misdiagnosed the situation here. I had noticed that adding a printf inside the pixel loop caused the hang to disappear. BUT, the results were still incorrect. So now I think it IS a logic/software error that is causing the CFU to get into a bad state when running the pdti8 model, presumably because of tensor shapes that occur only in the pdti8 model. To verify, I programmed the board to ensure the CFU was in its initial state. I ran the mnv2 golden tests twice, and got correct results. Then I ran the pdti8 "zeros input" test. Then I ran the mnv2 golden test again; this time incorrect results were reported. @AvG <https://github.com/AvG>, I suppose we could argue that we have specialized the CFU to run only mnv2 correctly, and that behavior on other models is undefined, and the system does not even check if unsupported tensor shapes occur. Certainly in a production case where code size is limited, we may want to eliminate the non-specialized version of a kernel if it is never called. But I'm not sure that's appropriate here in CFU-Playground. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#47 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJ4WATSLXZOMBNMXTIVTH3THUOUZANCNFSM42NL7FZQ> .

-- Alan Green ***@***.***

alanvgreen · 2021-04-08T11:57:04Z

Looking at a simulator trace, I'm seeing 2 invocations of CFU_RUN_MACC per conv 2D pixel instead of one.

According to the code, which specifies the order:

LoadInputs (ins_set)
Run Calc (start_run)
Get from output queue (oq_get)


    for (int p = 0; p < num_pixels; p++) {
      // Load twice on first loop, no load on last loop and once every other
      // time.
      if (p == 0) {
        LoadInputValues(input_ptr, input_depth_words);
      }
      if (p != num_pixels - 1) {
        LoadInputValues(input_ptr, input_depth_words);
      }

      CFU_MACC_RUN();
      UnloadOutputValues(output_ptr, batch_size / 4);
      output_ptr += (output_depth - batch_size) / 4;
    }

In summary, it seems that speculative execution and stateful CFUs don't mix.

Could we maybe try a completely unpipelined VexRiscV ?

tcal-x · 2021-04-08T17:02:41Z

That is a very puzzling trace. I don't see any way an extra CFU_MACC_RUN() could be executed between the unloading and the next loading.

There are two possible scenarios that we know of where a CFU instruction is executed when it shouldn't be:

A conditional branch is taken to hop around an immediately following CFU instruction, but because the branch is not evaluated soon enough, the following instruction is sent to the CFU incorrectly. However, I have versions of VexRiscv that eliminate this possibility, and I still see the hang. Also, this is not exactly the scenario here, since the CFU_MACC_RUN() is not conditional.
A CFU instruction immediately follows an instruction that causes a data cache miss. In this case, the first CFU execution is squashed and re-executed, but like above, we can't actually squash a CFU instruction after it has been sent to the CFU. So it is executed twice in rapid succession when it should have been executed just one. But that is not the case here either.

Of course we might have multiple bugs here. Let me merge the VexRiscv with earlyBranch=true to eliminate one of the issues.

tcal-x · 2021-04-08T20:14:01Z

This is a waveform I captured from the first pdti8 inference I ran. I captured waves only during conv_2d. The signals pointed out seem to get wonky after the 6th conv_2d layer.

alanvgreen · 2021-04-08T20:58:00Z

Are you also seeing the additional starts in the first conv2d?

tcal-x · 2021-04-08T22:57:23Z

I don't see the extra starts now.

I did notice that f_count_max is zero where I notice things getting strange (6th conv2d). The count being loaded is 2048, which doesnt' fit in the [10:0] representation.

tcal-x · 2021-04-08T23:04:18Z

I am able to get correct behavior if I make these changes:

in gateware/sequencing.py, change Signal(11) to Signal(12) in three places
in gateware/mnv2_cfu.py, change FILTER_DATA_MEM_DEPTH from 512 to 2048

@AvG, I'll let you make the actual fix; I'm not sure my edits are the best solution.

Correctly track filter size. Expands several fields from 11 to 12 bits, which allows them to express the full range 0-2048, inclusive. This is a fix for bug google#47.

tcal-x · 2021-04-12T22:19:15Z

Verified the fix works on the board.

tcal-x changed the title ~~Hang encountered with pdti8 tests; cause unknown.~~ Hang encountered with pdti8 tests using proj/mnv2_first/. Apr 8, 2021

tcal-x closed this as completed Apr 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hang encountered with pdti8 tests using proj/mnv2_first/. #47

Hang encountered with pdti8 tests using proj/mnv2_first/. #47

tcal-x commented Apr 5, 2021

tcal-x commented Apr 5, 2021

tcal-x commented Apr 7, 2021

tcal-x commented Apr 7, 2021

tcal-x commented Apr 8, 2021

tcal-x commented Apr 8, 2021

alanvgreen commented Apr 8, 2021 via email

alanvgreen commented Apr 8, 2021 •

edited

tcal-x commented Apr 8, 2021

tcal-x commented Apr 8, 2021

alanvgreen commented Apr 8, 2021

tcal-x commented Apr 8, 2021

tcal-x commented Apr 8, 2021

tcal-x commented Apr 12, 2021

Hang encountered with pdti8 tests using proj/mnv2_first/. #47

Hang encountered with pdti8 tests using proj/mnv2_first/. #47

Comments

tcal-x commented Apr 5, 2021

tcal-x commented Apr 5, 2021

tcal-x commented Apr 7, 2021

tcal-x commented Apr 7, 2021

tcal-x commented Apr 8, 2021

tcal-x commented Apr 8, 2021

alanvgreen commented Apr 8, 2021 via email

alanvgreen commented Apr 8, 2021 • edited

tcal-x commented Apr 8, 2021

tcal-x commented Apr 8, 2021

alanvgreen commented Apr 8, 2021

tcal-x commented Apr 8, 2021

tcal-x commented Apr 8, 2021

tcal-x commented Apr 12, 2021

alanvgreen commented Apr 8, 2021 •

edited