Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_unit: Fix #356 uncached loads served from WB #368

Merged
merged 2 commits into from
Feb 10, 2020
Merged

Conversation

zarubaf
Copy link
Contributor

@zarubaf zarubaf commented Jan 23, 2020

This is a tentative fix for #356. I would very much appreciate your feedback @jrrk as I don't have an isolated test case for it.

What the fix does is basically checking whether the write buffer is empty in case we are performing an uncached load. If the WB should not be empty the load is killed and retried once the WB gets empty.

@jrrk2
Copy link

jrrk2 commented Jan 28, 2020

I have checked in a modified system containing a testbench with UART software. This appears to run in VCS, I would imagine it would be OK in modelsim as well. I have deleted the old cache because the definitions were conflicting (and it didn't work anyway ?). I haven't been able to find any difference in behaviour yet (other than reduced performance), so perhaps my theory about why it goes wrong is bogus, or else a more complicated interaction. To use, just check out recursively and run 'make simv',
followed by (for example) ./simv -nogui -l simv.log

https://github.com/jrrk2/ariane/tree/issue-356

@jrrk2
Copy link

jrrk2 commented Jan 29, 2020

I have added a simulation of the SD-Card bridge to my branch above. This was the peripheral that was giving the trouble previously. I have added code that initializes that peripheral also. A strange thing happens, a spurious 2-cycle read of the current address (highlighted in red on the attached scrrenshot) is added part way through my list of programmed instructions. This peripheral's interface is different in that it uses axi2mem instead of axi2apb. Interrupts are never enabled in this simulation. You need an installation of Vivado setup with the usual environment variables for this simulation to run.

Screenshot from 2020-01-29 15-49-26

Copy link

@jrrk2 jrrk2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have discovered that the miss unit can reorder uncachable transactions, the following might help:

@@ -388,6 +388,9 @@ module wt_dcache_missunit #(
      end else begin
        state_d     = DRAIN;
      end
    // check for non-cached read whilst write buffer is non-empty, if so drain but still allow more stores
    end else if ((|miss_req_masked_d) && (!miss_is_write) && miss_nc_i[miss_port_idx] && !wbuffer_empty_i) begin
       state_d = DRAIN;
    // we've got a miss to handle
    end else if (|miss_req_masked_d) begin
      // this is a write miss, just pass through (but check whether write collides with MSHR)

@zarubaf
Copy link
Contributor Author

zarubaf commented Feb 3, 2020

Looking at the AXI cache interface the cache seems to do the right thing (see attached picture). It is issuing exactly one load of size 3 (64 bit) (ld I guess). Or I am blind an not seeing the issue here.

peripheral_read

How is the SD Card module connected? My new working assumption would be that the AXI bus is somehow smaller (32 bit) and the downsizer module is somehow translating the 64bit access into two 32bit accesses to (incorrect) address 0x20 and 0x24 (the second getting somehow truncated).

Can you comment on how you connected the SD module?

@jrrk2
Copy link

jrrk2 commented Feb 3, 2020

As you can see from ariane_peripherals.sv:323, the sd_bus module just uses an axi2mem block from your library that does no downsizing. My complaint would be that the source code in disk_initialize at boot.c:194 calls for 8 writes before the first read. The code with just your fix (and without my fix) starts to read after only 5 writes. Therefore your fix alone is insufficient and can you review my fix to your fix (with Michael if necessary) and see if you think it is correct. If so, we have at least one other problem which may be unrelated, because Linux still does not boot correctly. This has to do with systemd behaviour and I will endeavour to isolate this as well in a small simulation.

@zarubaf
Copy link
Contributor Author

zarubaf commented Feb 3, 2020

I think I might now understand what the problem is: I guess the core is making speculative reads to the IO region. The fence after the stores to the array are flushing the WB and flushing the pipeline of the processor which re-triggers the execution of the load (which then leads to a double load).

@msfschaffner took a look at your patch and concluded that this is not the final fix, that is the reason why I am trying to understand what is happening here.

Can you please point me to the binary file. I can't reverse engineer this from the simv invocation.

@jrrk2
Copy link

jrrk2 commented Feb 3, 2020

The binary is checked in here:

https://github.com/jrrk2/ariane/blob/uncached_loads_bug/fpga/src/bootrom/bootrom.elf

Normally you can just recreate with:

make -C fpga/src/bootrom all

The disassembly with interspersed source is here:

https://github.com/jrrk2/ariane/blob/uncached_loads_bug/fpga/src/bootrom/bootrom.dis

@zarubaf
Copy link
Contributor Author

zarubaf commented Feb 3, 2020

Thanks for your support. I've pushed a commit which should fix the idempotency issue you were seeing. I've declared the whole NC region as non-idempotent in ariane_soc_pkg. You might need to adapt this for your particular instantiation.

  • Linux is still booting for me
  • The test-case you kindly provided now shows all eight stores before the memory fence and the subsequent load instruction.

Please let me know how this works for you and whether it magically solves the systemd issue.

P.S.: Somewhere you mentioned that the "wide-fetch" patch did not work for you. Which patch did you exactly mean?

@jrrk
Copy link
Contributor

jrrk commented Feb 3, 2020

The wide fetch patch I meant was the generalisation of the front end to receive 64 bits from the icache into the realigner instead of 32. But perhaps this is not your current thinking. The test case that appears to fail was a 32-bit instruction sandwiched between two 16-bit instructions. But it would take a while to reproduce this.

@zarubaf
Copy link
Contributor Author

zarubaf commented Feb 3, 2020

Yeah that was exactly what I meant, thanks for clarifying. That's the patch that we have on the ariane-dev branch, right? So the master is still "stable" for you?

@jrrk
Copy link
Contributor

jrrk commented Feb 3, 2020

I can confirm the latest patch is OK on FPGA to merge to master by the end of tomorrow I think. the dev branch doesn’t boot Linux if I remember rightly.

@zarubaf zarubaf merged commit bf71095 into master Feb 10, 2020
@zarubaf zarubaf deleted the issue-356 branch April 15, 2020 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants