Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SW] Initial support for compilation in Linux environment #312

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

mp-17
Copy link
Collaborator

@mp-17 mp-17 commented Jun 25, 2024

#269 rework

Introduce initial support for kernel compilation under Linux environment

Changelog

Fixed

  • Description of changes

Added

  • Description of changes

Changed

  • Description of changes

Checklist

  • Automated tests pass
  • Changelog updated
  • Code style guideline is observed

Please check our contributing guidelines before opening a Pull Request.

MaistoV and others added 15 commits June 19, 2024 18:13
If LMUL_X has X > 1, Ara injects one reshuffle at a time for each register
within Vn and V(n+X-1) that has an EEW mismatch.
All these reshuffles are reshuffling different Vm with LMUL_1, but also
the same register (Vn with LMUL_X) from the point of view of the hazard
checks on the next instruction that has a dependency on Vn with LMUL_X.

We cannot just inject one macro reshuffle since the registers between
Vn and V(n+X-1) can have different encodings. So, we need finer-grain
reshuffles that messes up the dependency tracking.

For example,
vst @, v0 (LMUL_8)
will use the registers from v0 to v7. If they are all reshuffled, we
will end up with 8 reshuffle instructions that will get IDs from
0 to 7. The store will then see a dependency on the reshuffle ID that
targets v0 only. This is wrong, since if the store opreq is faster than
the slide opreq once the v0-reshuffle is over, it will violate the RAW
dependency.

Not to mess this up, the safest and most suboptimal fix is to just
wait in WAIT_IDLE after a reshuffle with LMUL > 1.

There are many possible optimizations to this:
 1) Check if, when LMUL > 1, we reshuffled more than 1 register.
If we reshuffle 1 reg only, we can also skip the WAIT_IDLE.
 2) Check if all the X registers need to be reshuffled (common case).
If this is the case, inject a large reshuffle with LMUL_X only and
skip WAIT_IDLE.
 3) Not to wait until idle, instead of WAIT_IDLE we can inject the
reshuffles starting from V(n+X-1) instead than Vn. This will automatically
adjust the dependency check and will speed up a bit the whole operation.
* Add MMU interface (just mock)

* Refactoring
* Switch from pulp-platform/cva6 to MaistoV/cva6_fork

* Bump axi to v0.39.0
MaistoV and others added 14 commits June 25, 2024 12:17
* vstart support for vector unit-stride loads and stores

* vstart support for vector strided loads and stores

* vstart support for valu operations, mask operations not tested

* Preliminary work on vstart support for vector indexed loads and stores

* Minor fixes

* Refactoring

* Explanatory comments
- Restrict mem bus to EW if vstore, vstart > 0, and EW < 64-bit

If vstart > 0 and EW < 64, the situation is similar to when the memory addr
is misaligned wrt the memory bus. Because of the VRF Byte Layout and
since the granularity of each lane's payload to the store unit is 64 bit,
all the packets can contain valid data while we have not completed the
beat. So, either we calculate in the addrgen the effective length of a
bursts with unequal beats, or we add a buffer and aligner in the store
unit, or we handle the ready signals at a byte level, or we simply reduce
the effective memory bus to the element width (worst case).
We do the latter. It's low performance, but vstore with vstart > 0 happen
after an exception, so the throughput drop should be acceptable.

- Data packets from VRF to STU

Operand requesters now send balanced payloads from all the lanes
if vstart > 0. The store unit will identify the good ones by itself,
and will only have to handshake balanced payloads.
- Time the STU exception flush with the opqueues
The vstart signal within the lanes is not the architectural
vstart. For all the instructions, it corresponds to
the architectural vstart manipulated to reflect the "vstart"
in every lane for VRF fetch address calculation purposes.
Memory instructions, which support arch vstart > 0, can use
that vstart signal to resize the number of elements to fetch
from the VRF. Slide instructions, instead, further modify
the vstart only for addressing purposes, and should not use
the vstart signal to resize the number of elements to fetch.
* Added LINUX switch, default LINUX=0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants