Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RISCV] Remove SiFive7PipeV and replace it with SiFive7VCQ #73969

Merged
merged 2 commits into from
Dec 4, 2023

Conversation

michaelmaitland
Copy link
Contributor

The Arithmetic, Load, and Store sequencers can accept instructions in parallel. The PipeV blocked that from happening since it became busy if any of the sequencers were busy. This change allows the sequencers to accept instructions in parallel.

The VCQ accepts instructions from the the A Pipe and holds them until the vector unit is ready to dequeue them. The unit dequeues up to one instruction per cycle, in order, as soon as the sequencer for that type of instruction is avaliable. This resource is meant to be used for 1 cycle by all vector instructions, to model that only one vector instruction may be dequed at a time. The actual dequeueing into the sequencer is modeled by the VA, VL, and VS sequencer resources below. Each of them will only accept a single instruction at a time and remain busy for the number of cycles associated with that instruction.

@llvmbot
Copy link
Collaborator

llvmbot commented Nov 30, 2023

@llvm/pr-subscribers-backend-risc-v

Author: Michael Maitland (michaelmaitland)

Changes

The Arithmetic, Load, and Store sequencers can accept instructions in parallel. The PipeV blocked that from happening since it became busy if any of the sequencers were busy. This change allows the sequencers to accept instructions in parallel.

The VCQ accepts instructions from the the A Pipe and holds them until the vector unit is ready to dequeue them. The unit dequeues up to one instruction per cycle, in order, as soon as the sequencer for that type of instruction is avaliable. This resource is meant to be used for 1 cycle by all vector instructions, to model that only one vector instruction may be dequed at a time. The actual dequeueing into the sequencer is modeled by the VA, VL, and VS sequencer resources below. Each of them will only accept a single instruction at a time and remain busy for the number of cycles associated with that instruction.


Patch is 465.37 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/73969.diff

28 Files Affected:

  • (modified) llvm/lib/Target/RISCV/RISCVSchedSiFive7.td (+242-233)
  • (modified) llvm/lib/Target/RISCV/RISCVScheduleV.td (+10-6)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/gpr-bypass-c.s (+2-2)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/gpr-bypass.s (+2-2)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/reductions.s (+211-211)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/strided-load-x0.s (+53-53)
  • (modified) llvm/test/tools/llvm-mca/RISCV/SiFive7/vector-integer-arithmetic.s (+749-749)
  • (modified) llvm/test/tools/llvm-mca/RISCV/different-lmul-instruments.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/different-sew-instruments.s (+9-9)
  • (modified) llvm/test/tools/llvm-mca/RISCV/disable-im.s (+20-20)
  • (modified) llvm/test/tools/llvm-mca/RISCV/fractional-lmul-data.s (+9-9)
  • (modified) llvm/test/tools/llvm-mca/RISCV/lmul-instrument-at-start.s (+6-6)
  • (modified) llvm/test/tools/llvm-mca/RISCV/lmul-instrument-in-middle.s (+13-13)
  • (modified) llvm/test/tools/llvm-mca/RISCV/lmul-instrument-in-region.s (+6-6)
  • (modified) llvm/test/tools/llvm-mca/RISCV/lmul-instrument-straddles-region.s (+6-6)
  • (modified) llvm/test/tools/llvm-mca/RISCV/multiple-same-lmul-instruments.s (+26-26)
  • (modified) llvm/test/tools/llvm-mca/RISCV/multiple-same-sew-instruments.s (+15-15)
  • (modified) llvm/test/tools/llvm-mca/RISCV/needs-sew-but-only-lmul.s (+9-9)
  • (modified) llvm/test/tools/llvm-mca/RISCV/no-vsetvli-to-start.s (+13-13)
  • (modified) llvm/test/tools/llvm-mca/RISCV/sew-instrument-at-start.s (+6-6)
  • (modified) llvm/test/tools/llvm-mca/RISCV/sew-instrument-in-middle.s (+9-9)
  • (modified) llvm/test/tools/llvm-mca/RISCV/sew-instrument-in-region.s (+6-6)
  • (modified) llvm/test/tools/llvm-mca/RISCV/sew-instrument-straddles-region.s (+6-6)
  • (modified) llvm/test/tools/llvm-mca/RISCV/vle-vse.s (+407-407)
  • (modified) llvm/test/tools/llvm-mca/RISCV/vsetivli-lmul-instrument.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/vsetivli-lmul-sew-instrument.s (+9-9)
  • (modified) llvm/test/tools/llvm-mca/RISCV/vsetvli-lmul-instrument.s (+8-8)
  • (modified) llvm/test/tools/llvm-mca/RISCV/vsetvli-lmul-sew-instrument.s (+9-9)
diff --git a/llvm/lib/Target/RISCV/RISCVSchedSiFive7.td b/llvm/lib/Target/RISCV/RISCVSchedSiFive7.td
index 53ef9d1baf7b59a..6962546de7e0c48 100644
--- a/llvm/lib/Target/RISCV/RISCVSchedSiFive7.td
+++ b/llvm/lib/Target/RISCV/RISCVSchedSiFive7.td
@@ -208,20 +208,29 @@ def SiFive7Model : SchedMachineModel {
 // Pipe A can handle memory, integer alu and vector operations.
 // Pipe B can handle integer alu, control flow, integer multiply and divide,
 // and floating point computation.
-// Pipe V can handle the V extension.
+// The V pipeline is modeled by the VCQ, VA, VL, and VS resources.
 let SchedModel = SiFive7Model in {
 let BufferSize = 0 in {
 def SiFive7PipeA       : ProcResource<1>;
 def SiFive7PipeB       : ProcResource<1>;
-def SiFive7PipeV       : ProcResource<1>;
+def SiFive7VA          : ProcResource<1>; // Arithmetic sequencer
+def SiFive7VL          : ProcResource<1>; // Load sequencer
+def SiFive7VS          : ProcResource<1>; // Store sequencer
+// The VCQ accepts instructions from the the A Pipe and holds them until the
+// vector unit is ready to dequeue them. The unit dequeues up to one instruction
+// per cycle, in order, as soon as the sequencer for that type of instruction is
+// avaliable. This resource is meant to be used for 1 cycle by all vector
+// instructions, to model that only one vector instruction may be dequed at a
+// time. The actual dequeueing into the sequencer is modeled by the VA, VL, and
+// VS sequencer resources below. Each of them will only accept a single
+// instruction at a time and remain busy for the number of cycles associated
+// with that instruction.
+def SiFive7VCQ         : ProcResource<1>; // Vector Command Queue
 }
 
 let BufferSize = 1 in {
 def SiFive7IDiv        : ProcResource<1> { let Super = SiFive7PipeB; } // Int Division
 def SiFive7FDiv        : ProcResource<1> { let Super = SiFive7PipeB; } // FP Division/Sqrt
-def SiFive7VA          : ProcResource<1> { let Super = SiFive7PipeV; } // Arithmetic sequencer
-def SiFive7VL          : ProcResource<1> { let Super = SiFive7PipeV; } // Load sequencer
-def SiFive7VS          : ProcResource<1> { let Super = SiFive7PipeV; } // Store sequencer
 }
 
 def SiFive7PipeAB : ProcResGroup<[SiFive7PipeA, SiFive7PipeB]>;
@@ -433,21 +442,21 @@ def : WriteRes<WriteVSETVL, [SiFive7PipeA]>;
 foreach mx = SchedMxList in {
   defvar Cycles = SiFive7GetCyclesDefault<mx>.c;
   defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  let Latency = 4, ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVLDE",    [SiFive7VL], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVLDFF",   [SiFive7VL], mx, IsWorstCase>;
+  let Latency = 4, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVLDE",    [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVLDFF",   [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
   }
-  let Latency = 1, ReleaseAtCycles = [Cycles] in
-  defm "" : LMULWriteResMX<"WriteVSTE",    [SiFive7VS], mx, IsWorstCase>;
+  let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in
+  defm "" : LMULWriteResMX<"WriteVSTE",    [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
 }
 
 foreach mx = SchedMxList in {
   defvar Cycles = SiFive7GetMaskLoadStoreCycles<mx>.c;
   defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  let Latency = 4, ReleaseAtCycles = [Cycles] in
-  defm "" : LMULWriteResMX<"WriteVLDM",    [SiFive7VL], mx, IsWorstCase>;
-  let Latency = 1, ReleaseAtCycles = [Cycles] in
-  defm "" : LMULWriteResMX<"WriteVSTM",    [SiFive7VS], mx, IsWorstCase>;
+  let Latency = 4, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in
+  defm "" : LMULWriteResMX<"WriteVLDM",    [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+  let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in
+  defm "" : LMULWriteResMX<"WriteVSTM",    [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
 }
 
 // Strided loads and stores operate at one element per cycle and should be
@@ -466,17 +475,17 @@ foreach mx = SchedMxList in {
   defvar VLDSX0Cycles = SiFive7GetCyclesDefault<mx>.c;
   defvar Cycles = SiFive7GetCyclesOnePerElement<mx, 8>.c;
   defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  defm SiFive7 : LMULWriteResMXVariant<"WriteVLDS8",  VLDSX0Pred, [SiFive7VL],
-                                       4, [VLDSX0Cycles], !add(3, Cycles),
-                                       [Cycles], mx, IsWorstCase>;
-  let Latency = !add(3, Cycles), ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVLDUX8", [SiFive7VL], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVLDOX8", [SiFive7VL], mx, IsWorstCase>;
+  defm SiFive7 : LMULWriteResMXVariant<"WriteVLDS8",  VLDSX0Pred, [SiFive7VCQ, SiFive7VL],
+                                       4, [0, 1], [1, !add(1, VLDSX0Cycles)], !add(3, Cycles),
+                                       [0, 1], [1, !add(1, Cycles)], mx, IsWorstCase>;
+  let Latency = !add(3, Cycles), AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVLDUX8", [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVLDOX8", [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
   }
-  let Latency = 1, ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVSTS8",  [SiFive7VS], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVSTUX8", [SiFive7VS], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVSTOX8", [SiFive7VS], mx, IsWorstCase>;
+  let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVSTS8",  [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVSTUX8", [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVSTOX8", [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
   }
 }
 // TODO: The MxLists need to be filtered by EEW. We only need to support
@@ -486,72 +495,72 @@ foreach mx = ["MF4", "MF2", "M1", "M2", "M4", "M8"] in {
   defvar VLDSX0Cycles = SiFive7GetCyclesDefault<mx>.c;
   defvar Cycles = SiFive7GetCyclesOnePerElement<mx, 16>.c;
   defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  defm SiFive7 : LMULWriteResMXVariant<"WriteVLDS16",  VLDSX0Pred, [SiFive7VL],
-                                       4, [VLDSX0Cycles], !add(3, Cycles),
-                                       [Cycles], mx, IsWorstCase>;
-  let Latency = !add(3, Cycles), ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVLDUX16", [SiFive7VL], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVLDOX16", [SiFive7VL], mx, IsWorstCase>;
+  defm SiFive7 : LMULWriteResMXVariant<"WriteVLDS16",  VLDSX0Pred, [SiFive7VCQ, SiFive7VL],
+                                       4, [0, 1], [1, !add(1, VLDSX0Cycles)], !add(3, Cycles),
+                                       [0, 1], [1, !add(1, Cycles)], mx, IsWorstCase>;
+  let Latency = !add(3, Cycles), AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVLDUX16", [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVLDOX16", [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
   }
-  let Latency = 1, ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVSTS16",  [SiFive7VS], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVSTUX16", [SiFive7VS], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVSTOX16", [SiFive7VS], mx, IsWorstCase>;
+  let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVSTS16",  [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVSTUX16", [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVSTOX16", [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
   }
 }
 foreach mx = ["MF2", "M1", "M2", "M4", "M8"] in {
   defvar VLDSX0Cycles = SiFive7GetCyclesDefault<mx>.c;
   defvar Cycles = SiFive7GetCyclesOnePerElement<mx, 32>.c;
   defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  defm SiFive7 : LMULWriteResMXVariant<"WriteVLDS32",  VLDSX0Pred, [SiFive7VL],
-                                       4, [VLDSX0Cycles], !add(3, Cycles),
-                                       [Cycles], mx, IsWorstCase>;
-  let Latency = !add(3, Cycles), ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVLDUX32", [SiFive7VL], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVLDOX32", [SiFive7VL], mx, IsWorstCase>;
+  defm SiFive7 : LMULWriteResMXVariant<"WriteVLDS32",  VLDSX0Pred, [SiFive7VCQ, SiFive7VL],
+                                       4, [0, 1], [1, !add(1, VLDSX0Cycles)], !add(3, Cycles),
+                                       [0, 1], [1, !add(1, Cycles)], mx, IsWorstCase>;
+  let Latency = !add(3, Cycles), AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVLDUX32", [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVLDOX32", [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
   }
-  let Latency = 1, ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVSTS32",  [SiFive7VS], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVSTUX32", [SiFive7VS], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVSTOX32", [SiFive7VS], mx, IsWorstCase>;
+  let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVSTS32",  [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVSTUX32", [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVSTOX32", [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
   }
 }
 foreach mx = ["M1", "M2", "M4", "M8"] in {
   defvar VLDSX0Cycles = SiFive7GetCyclesDefault<mx>.c;
   defvar Cycles = SiFive7GetCyclesOnePerElement<mx, 64>.c;
   defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  defm SiFive7 : LMULWriteResMXVariant<"WriteVLDS64",  VLDSX0Pred, [SiFive7VL],
-                                       4, [VLDSX0Cycles], !add(3, Cycles),
-                                       [Cycles], mx, IsWorstCase>;
-  let Latency = !add(3, Cycles), ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVLDUX64", [SiFive7VL], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVLDOX64", [SiFive7VL], mx, IsWorstCase>;
+  defm SiFive7 : LMULWriteResMXVariant<"WriteVLDS64",  VLDSX0Pred, [SiFive7VCQ, SiFive7VL],
+                                       4, [0, 1], [1, !add(1, VLDSX0Cycles)], !add(3, Cycles),
+                                       [0, 1], [1, !add(1, Cycles)], mx, IsWorstCase>;
+  let Latency = !add(3, Cycles), AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVLDUX64", [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVLDOX64", [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
   }
-  let Latency = 1, ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVSTS64",  [SiFive7VS], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVSTUX64", [SiFive7VS], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVSTOX64", [SiFive7VS], mx, IsWorstCase>;
+  let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVSTS64",  [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVSTUX64", [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVSTOX64", [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
   }
 }
 
 // VLD*R is LMUL aware
-let Latency = 4, ReleaseAtCycles = [2] in
-  def : WriteRes<WriteVLD1R,  [SiFive7VL]>;
-let Latency = 4, ReleaseAtCycles = [4] in
-  def : WriteRes<WriteVLD2R,  [SiFive7VL]>;
-let Latency = 4, ReleaseAtCycles = [8] in
-  def : WriteRes<WriteVLD4R,  [SiFive7VL]>;
-let Latency = 4, ReleaseAtCycles = [16] in
-  def : WriteRes<WriteVLD8R,  [SiFive7VL]>;
+let Latency = 4, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, 2)] in
+  def : WriteRes<WriteVLD1R,  [SiFive7VCQ, SiFive7VL]>;
+let Latency = 4, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, 4)] in
+  def : WriteRes<WriteVLD2R,  [SiFive7VCQ, SiFive7VL]>;
+let Latency = 4, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, 8)] in
+  def : WriteRes<WriteVLD4R,  [SiFive7VCQ, SiFive7VL]>;
+let Latency = 4, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, 16)] in
+  def : WriteRes<WriteVLD8R,  [SiFive7VCQ, SiFive7VL]>;
 // VST*R is LMUL aware
-let Latency = 1, ReleaseAtCycles = [2] in
-  def : WriteRes<WriteVST1R,   [SiFive7VS]>;
-let Latency = 1, ReleaseAtCycles = [4] in
-  def : WriteRes<WriteVST2R,   [SiFive7VS]>;
-let Latency = 1, ReleaseAtCycles = [8] in
-  def : WriteRes<WriteVST4R,   [SiFive7VS]>;
-let Latency = 1, ReleaseAtCycles = [16] in
-  def : WriteRes<WriteVST8R,   [SiFive7VS]>;
+let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, 2)] in
+  def : WriteRes<WriteVST1R,   [SiFive7VCQ, SiFive7VS]>;
+let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, 4)] in
+  def : WriteRes<WriteVST2R,   [SiFive7VCQ, SiFive7VS]>;
+let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, 8)] in
+  def : WriteRes<WriteVST4R,   [SiFive7VCQ, SiFive7VS]>;
+let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, 16)] in
+  def : WriteRes<WriteVST8R,   [SiFive7VCQ, SiFive7VS]>;
 
 // Segmented Loads and Stores
 // Unit-stride segmented loads and stores are effectively converted into strided
@@ -564,22 +573,22 @@ foreach mx = SchedMxList in {
     defvar Cycles = SiFive7GetCyclesSegmentedSeg2<mx>.c;
     defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
     // Does not chain so set latency high
-    let Latency = !add(3, Cycles), ReleaseAtCycles = [Cycles] in {
-      defm "" : LMULWriteResMX<"WriteVLSEG2e" # eew,   [SiFive7VL], mx, IsWorstCase>;
-      defm "" : LMULWriteResMX<"WriteVLSEGFF2e" # eew, [SiFive7VL], mx, IsWorstCase>;
+    let Latency = !add(3, Cycles), AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+      defm "" : LMULWriteResMX<"WriteVLSEG2e" # eew,   [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+      defm "" : LMULWriteResMX<"WriteVLSEGFF2e" # eew, [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
     }
-    let Latency = 1, ReleaseAtCycles = [Cycles] in
-    defm "" : LMULWriteResMX<"WriteVSSEG2e" # eew,   [SiFive7VS], mx, IsWorstCase>;
+    let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in
+    defm "" : LMULWriteResMX<"WriteVSSEG2e" # eew,   [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
     foreach nf=3-8 in {
       defvar Cycles = SiFive7GetCyclesSegmented<mx, eew, nf>.c;
       defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
       // Does not chain so set latency high
-      let Latency = !add(3, Cycles), ReleaseAtCycles = [Cycles] in {
-        defm "" : LMULWriteResMX<"WriteVLSEG" # nf # "e" # eew,   [SiFive7VL], mx, IsWorstCase>;
-        defm "" : LMULWriteResMX<"WriteVLSEGFF" # nf # "e" # eew, [SiFive7VL], mx, IsWorstCase>;
+      let Latency = !add(3, Cycles), AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+        defm "" : LMULWriteResMX<"WriteVLSEG" # nf # "e" # eew,   [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+        defm "" : LMULWriteResMX<"WriteVLSEGFF" # nf # "e" # eew, [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
       }
-      let Latency = 1, ReleaseAtCycles = [Cycles] in
-      defm "" : LMULWriteResMX<"WriteVSSEG" # nf # "e" # eew,   [SiFive7VS], mx, IsWorstCase>;
+      let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in
+      defm "" : LMULWriteResMX<"WriteVSSEG" # nf # "e" # eew,   [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
     }
   }
 }
@@ -589,15 +598,15 @@ foreach mx = SchedMxList in {
       defvar Cycles = SiFive7GetCyclesSegmented<mx, eew, nf>.c;
       defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
       // Does not chain so set latency high
-      let Latency = !add(3, Cycles), ReleaseAtCycles = [Cycles] in {
-        defm "" : LMULWriteResMX<"WriteVLSSEG" # nf # "e" # eew,  [SiFive7VL], mx, IsWorstCase>;
-        defm "" : LMULWriteResMX<"WriteVLUXSEG" # nf # "e" # eew, [SiFive7VL], mx, IsWorstCase>;
-        defm "" : LMULWriteResMX<"WriteVLOXSEG" # nf # "e" # eew, [SiFive7VL], mx, IsWorstCase>;
+      let Latency = !add(3, Cycles), AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+        defm "" : LMULWriteResMX<"WriteVLSSEG" # nf # "e" # eew,  [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+        defm "" : LMULWriteResMX<"WriteVLUXSEG" # nf # "e" # eew, [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
+        defm "" : LMULWriteResMX<"WriteVLOXSEG" # nf # "e" # eew, [SiFive7VCQ, SiFive7VL], mx, IsWorstCase>;
       }
-      let Latency = 1, ReleaseAtCycles = [Cycles] in {
-        defm "" : LMULWriteResMX<"WriteVSSSEG" # nf # "e" # eew,  [SiFive7VS], mx, IsWorstCase>;
-        defm "" : LMULWriteResMX<"WriteVSUXSEG" # nf # "e" # eew, [SiFive7VS], mx, IsWorstCase>;
-        defm "" : LMULWriteResMX<"WriteVSOXSEG" # nf # "e" # eew, [SiFive7VS], mx, IsWorstCase>;
+      let Latency = 1, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+        defm "" : LMULWriteResMX<"WriteVSSSEG" # nf # "e" # eew,  [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+        defm "" : LMULWriteResMX<"WriteVSUXSEG" # nf # "e" # eew, [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
+        defm "" : LMULWriteResMX<"WriteVSOXSEG" # nf # "e" # eew, [SiFive7VCQ, SiFive7VS], mx, IsWorstCase>;
       }
     }
   }
@@ -607,41 +616,41 @@ foreach mx = SchedMxList in {
 foreach mx = SchedMxList in {
   defvar Cycles = SiFive7GetCyclesDefault<mx>.c;
   defvar IsWorstCase = SiFive7IsWorstCaseMX<mx, SchedMxList>.c;
-  let Latency = 4, ReleaseAtCycles = [Cycles] in {
-    defm "" : LMULWriteResMX<"WriteVIALUV",     [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIALUX",     [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIALUI",     [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVICALUV",    [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVICALUX",    [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVICALUI",    [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVShiftV",    [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVShiftX",    [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVShiftI",    [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMinMaxV",  [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMinMaxX",  [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMulV",     [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMulX",     [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMulAddV",  [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMulAddX",  [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMergeV",   [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMergeX",   [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMergeI",   [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMovV",     [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMovX",     [SiFive7VA], mx, IsWorstCase>;
-    defm "" : LMULWriteResMX<"WriteVIMovI",     [SiFive7VA], mx, IsWorstCase>;
+  let Latency = 4, AcquireAtCycles = [0, 1], ReleaseAtCycles = [1, !add(1, Cycles)] in {
+    defm "" : LMULWriteResMX<"WriteVIALUV",     [SiFive7VCQ, SiFive7VA], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVIALUX",     [SiFive7VCQ, SiFive7VA], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVIALUI",     [SiFive7VCQ, SiFive7VA], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVICALUV",    [SiFive7VCQ, SiFive7VA], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVICALUX",    [SiFive7VCQ, SiFive7VA], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVICALUI",    [SiFive7VCQ, SiFive7VA], mx, IsWorstCase>;
+    defm "" : LMULWriteResMX<"WriteVShiftV",    [SiFive7VCQ, SiFive7VA], mx, IsWorstCase>;
+    ...
[truncated]

@wangpc-pp
Copy link
Contributor

wangpc-pp commented Dec 1, 2023

I don't know the details about the implementation but the changes LGTM.
Have you tried to evaluate it on FPGA/chip and to see the improvements? I think these model changes will make the vector pipeline more efficient.

@michaelmaitland
Copy link
Contributor Author

Have you tried to evaluate it on FPGA/chip and to see the improvements?

We have evaluated on spec2006 and saw improvements. This change is important to take advantage of the parallelism in the vector pipeline.

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

The Arithmetic, Load, and Store sequencers can accept instructions in
parallel. The PipeV blocked that from happening since it became busy if
any of the sequencers were busy. This change allows the sequencers to
accept instructions in parallel.

The VCQ accepts instructions from the the A Pipe and holds them until the
vector unit is ready to dequeue them. The unit dequeues up to one instruction
per cycle, in order, as soon as the sequencer for that type of instruction is
avaliable. This resource is meant to be used for 1 cycle by all vector
instructions, to model that only one vector instruction may be dequed at a
time. The actual dequeueing into the sequencer is modeled by the VA, VL, and
VS sequencer resources below. Each of them will only accept a single
instruction at a time and remain busy for the number of cycles associated
with that instruction.
@michaelmaitland michaelmaitland merged commit d9570ba into llvm:main Dec 4, 2023
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants