Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoidable stalling in the RISC-V example #15

Open
mbty opened this issue Jul 15, 2021 · 2 comments
Open

Avoidable stalling in the RISC-V example #15

mbty opened this issue Jul 15, 2021 · 2 comments

Comments

@mbty
Copy link

mbty commented Jul 15, 2021

valid_rs1 and valid_rs2 are never read in the RISC-V example. Making sure that the values really correspond to rs1 or rs2 instead of stalling whenever at least one of them is associated with something other than 0 in the scoreboard yields better performance:

-- Running tests with Cuttlesim --
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/rvbench_qsort.rv32 -1
  [before] real: 0m0.018s	user: 0m0.017s	sys: 0m0.001s
  [after ] real: 0m0.013s	user: 0m0.013s	sys: 0m0.000s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/rvbench_median.rv32 -1
  [before] real: 0m0.012s	user: 0m0.012s	sys: 0m0.000s
  [after ] real: 0m0.009s	user: 0m0.009s	sys: 0m0.000s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/img.rv32 -1
  [before] real: 0m0.213s	user: 0m0.197s	sys: 0m0.016s
  [after ] real: 0m0.157s	user: 0m0.137s	sys: 0m0.019s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/primes.rv32 -1
  [before] real: 0m4.413s	user: 0m4.407s	sys: 0m0.003s
  [after ] real: 0m3.084s	user: 0m3.080s	sys: 0m0.001s
_objects/rv32i.v/rvcore.cuttlesim.opt tests/_build/rv32i/integ/morse.rv32 -1
  [before] real: 0m1.583s	user: 0m1.581s	sys: 0m0.000s
  [after ] real: 0m1.142s	user: 0m1.141s	sys: 0m0.000s
-- Running tests with Verilator --
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/rvbench_qsort.vmh -1
  [before] real: 0m0.033s	user: 0m0.033s	sys: 0m0.000s
  [after ] real: 0m0.031s	user: 0m0.030s	sys: 0m0.001s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/morse.vmh -1
  [before] real: 0m2.692s	user: 0m2.689s	sys: 0m0.001s
  [after ] real: 0m2.547s	user: 0m2.544s	sys: 0m0.000s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/rvbench_median.vmh -1
  [before] real: 0m0.022s	user: 0m0.022s	sys: 0m0.000s
  [after ] real: 0m0.021s	user: 0m0.021s	sys: 0m0.000s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/primes.vmh -1
  [before] real: 0m8.383s	user: 0m8.371s	sys: 0m0.002s
  [after ] real: 0m7.892s	user: 0m7.885s	sys: 0m0.001s
_objects/rv32i.v/obj_dir.opt/Vtop +VMH=tests/_build/rv32i/integ/img.vmh -1
  [before] real: 0m0.401s	user: 0m0.388s	sys: 0m0.013s
  [after ] real: 0m0.378s	user: 0m0.361s	sys: 0m0.017s

I kept only the tests that run for more than 2ms to keep this short. This results in tests taking roughly 25% less time with Cuttlesim and 5% less time for Verilator. Of course, your mileage may vary. I did not check the effects on synthesis.

See related commit 66b59e9.

@cpitclaudel
Copy link
Contributor

cpitclaudel commented Jul 15, 2021

Good stuff! :) Looping in @threonorm , who wrote the code; I wonder if the same pattern existed in the Bluespec version of the code.

This results in tests taking roughly 25% less time with Cuttlesim and 5% less time for Verilator.

I like this change ^^

@threonorm
Copy link
Contributor

threonorm commented Jul 15, 2021

Yes, we currently stall conservatively for source register (but do not add dependencies for destination registers which are not real destinations). Mainly because the design we started from was doing that (it is coming from a class).

I don't have a strong opinion, we can merge that change as it is not important to keep the comparison with the baseline anymore.

Here are the things I think we should look at:
1 - variations on the number of cycles (instead of simulation time, that can vary quite a bit and is not really architecturally meaningful)
2 - variation on the critical path both on FPGA and the open pdk synthesis
3 - variation on the type checking + compilation time
(4 - incompatibilities/porting work/variations on the symbolic evaluation time/use in already written proofs)

Maybe we can do a little experimentation for 1) and 2), I think the critical path is currently in decode on the equivalent pattern for the destination register and the scoreboard. Or at least that's what I remember, so it may move to go to the source instead, not sure.

For 3), hopefully it should not influence too much, but we have seen weirdness before, so it could be good to confirm it.

I think for 4), this could impact the proof we did for the enclave system, but we probably don't want to keep the code in sync anyway (and it is probably out of sync already), so I don't mind ignoring that dimension.

I can look at 1, 2, 3 but sadly not before august.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants