Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang encountered with pdti8 tests using proj/mnv2_first/. #47

Closed
tcal-x opened this issue Apr 5, 2021 · 13 comments
Closed

Hang encountered with pdti8 tests using proj/mnv2_first/. #47

tcal-x opened this issue Apr 5, 2021 · 13 comments

Comments

@tcal-x
Copy link
Collaborator

tcal-x commented Apr 5, 2021

On current main, after PR #40 , a hang can be reliably reproduced (instructions below). If printf() statements are added, the hang disappears. Thus it seems there is not a logical error with the CFU and software, but perhaps a subtle hardware error, maybe with the CPU but maybe not. No hangs were seen with mnv2.

To reproduce: build and program the bitstream as usual; build the software as usual; load the software for live interaction ("make load"). Then type "1 1 1 1". These commands specify model tests, choose pdti8, run the inference with zeros input, and finally run inference with zeros input again, at which point the hang is encountered. If you exit the interactive session with ctrl-C, and then reload the software ("make load"), then type "1 1 1", you should see the hang the first time inference is run. But if you reprogram the bitstream, then you need to run inference twice to see the hang again.

The hang can be reproduced in Verilator simulation ("make PLATFORM=sim load"). It hangs at exactly the same place of the second inference: layer 5, which is the second CONV_2D layer.

If a printf is added inside the pixel loop of CONV_2D to print out the value of "p" each iteration, the hang disappears.

I experimented building the CPU with no data cache, since data cache misses can cause instructions to be re-executed, which in particular would cause problems with stateful CFUs as is the case here; but I still saw the hang with no data cache.

@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 5, 2021

I looked through the assembly and found a sequence known to cause problems: a conditional branch immediately followed by a CFU instruction. So I added some no-ops in the source code in front of the CFU instruction, and verified that these translated into assembly (they did). But the hang still occurred, so this sequence does not seem to be the cause.

@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 7, 2021

waves_hang

@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 7, 2021

The hang occurs in the pixel loop, 100+ pixels into the computation.

Each iteration, there are

  • 4 x CFU_STORE_INPUT_VALUE
  • 1 x CFU_MACC_RUN
  • 8 x CFU_GET_OUTPUT

In all earlier pixels, including the first full iteration shown in the image above, all CFU instructions respond immediately with a rsp_valid asserted. However, in the last two iterations, things seem to slow down. Starting at the 3rd of the 8 CFU_GET_OUTPUT, there is a delay of about 10 cycles between the cmd_valid asserted and the corresponding rsp_valid assertion.

In the final pixel, the first CFU_STORE_INPUT_VALUE is sent, but no reply ever occurs.

@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 8, 2021

More findings: The "level" of the fifo for the output queue seems corrupted from the start of the waveform capture. Its starting value is 0x1f8, out of a maximum 0x1ff. So apparently it underflowed at some earlier point, or was left almost full. (I started the waveform capture on the second full inference, since this is where the hang typically occurred).

So for the first 100+ pixels, the output data comes out immediately with no delay, since the output queue contains data. When I observed things "slow down", the output queue had finally emptied and was waiting for new items to be produced. That timing is probably what we should expect for the entire run.

The actual error likely occurred before the waveform capture started.

@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 8, 2021

Ok, I had missed some key data and misdiagnosed the situation here. I had noticed that adding a printf inside the pixel loop caused the hang to disappear. BUT, the results were still incorrect. So now I think it IS a logic/software error that is causing the CFU to get into a bad state when running the pdti8 model, presumably because of tensor shapes that occur only in the pdti8 model.

To verify, I programmed the board to ensure the CFU was in its initial state. I ran the mnv2 golden tests twice, and got correct results. Then I ran the pdti8 "zeros input" test. Then I ran the mnv2 golden test again; this time incorrect results were reported.

@AvG, I suppose we could argue that we have specialized the CFU to run only mnv2 correctly, and that behavior on other models is undefined, and the system does not even check if unsupported tensor shapes occur. Certainly in a production case where code size is limited, we may want to eliminate the non-specialized version of a kernel if it is never called. But I'm not sure that's appropriate here in CFU-Playground.

@tcal-x tcal-x changed the title Hang encountered with pdti8 tests; cause unknown. Hang encountered with pdti8 tests using proj/mnv2_first/. Apr 8, 2021
@alanvgreen
Copy link
Collaborator

alanvgreen commented Apr 8, 2021 via email

@alanvgreen
Copy link
Collaborator

alanvgreen commented Apr 8, 2021

Looking at a simulator trace, I'm seeing 2 invocations of CFU_RUN_MACC per conv 2D pixel instead of one.

image

According to the code, which specifies the order:

  1. LoadInputs (ins_set)
  2. Run Calc (start_run)
  3. Get from output queue (oq_get)

    for (int p = 0; p < num_pixels; p++) {
      // Load twice on first loop, no load on last loop and once every other
      // time.
      if (p == 0) {
        LoadInputValues(input_ptr, input_depth_words);
      }
      if (p != num_pixels - 1) {
        LoadInputValues(input_ptr, input_depth_words);
      }

      CFU_MACC_RUN();
      UnloadOutputValues(output_ptr, batch_size / 4);
      output_ptr += (output_depth - batch_size) / 4;
    }

In summary, it seems that speculative execution and stateful CFUs don't mix.

Could we maybe try a completely unpipelined VexRiscV ?

@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 8, 2021

That is a very puzzling trace. I don't see any way an extra CFU_MACC_RUN() could be executed between the unloading and the next loading.

There are two possible scenarios that we know of where a CFU instruction is executed when it shouldn't be:

  • A conditional branch is taken to hop around an immediately following CFU instruction, but because the branch is not evaluated soon enough, the following instruction is sent to the CFU incorrectly. However, I have versions of VexRiscv that eliminate this possibility, and I still see the hang. Also, this is not exactly the scenario here, since the CFU_MACC_RUN() is not conditional.

  • A CFU instruction immediately follows an instruction that causes a data cache miss. In this case, the first CFU execution is squashed and re-executed, but like above, we can't actually squash a CFU instruction after it has been sent to the CFU. So it is executed twice in rapid succession when it should have been executed just one. But that is not the case here either.

Of course we might have multiple bugs here. Let me merge the VexRiscv with earlyBranch=true to eliminate one of the issues.

@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 8, 2021

wave_annot

This is a waveform I captured from the first pdti8 inference I ran. I captured waves only during conv_2d. The signals pointed out seem to get wonky after the 6th conv_2d layer.

@alanvgreen
Copy link
Collaborator

Are you also seeing the additional starts in the first conv2d?

@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 8, 2021

I don't see the extra starts now.

I did notice that f_count_max is zero where I notice things getting strange (6th conv2d). The count being loaded is 2048, which doesnt' fit in the [10:0] representation.

@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 8, 2021

I am able to get correct behavior if I make these changes:

  • in gateware/sequencing.py, change Signal(11) to Signal(12) in three places
  • in gateware/mnv2_cfu.py, change FILTER_DATA_MEM_DEPTH from 512 to 2048

@AvG, I'll let you make the actual fix; I'm not sure my edits are the best solution.

alanvgreen added a commit to alanvgreen/CFU-Playground that referenced this issue Apr 11, 2021
Correctly track filter size. Expands several fields from 11 to 12 bits,
which allows them to express the full range 0-2048, inclusive.

This is a fix for bug google#47.
@tcal-x
Copy link
Collaborator Author

tcal-x commented Apr 12, 2021

Verified the fix works on the board.

@tcal-x tcal-x closed this as completed Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants