[tblgen][llvm-mca] Add the ability to describe move elimination candi…

…dates via tablegen. This patch adds the ability to identify instructions that are "move elimination candidates". It also allows scheduling models to describe processor register files that allow move elimination. A move elimination candidate is an instruction that can be eliminated at register renaming stage. Each subtarget can specify which instructions are move elimination candidates with the help of tablegen class "IsOptimizableRegisterMove" (see llvm/Target/TargetInstrPredicate.td). For example, on X86, BtVer2 allows both GPR and MMX/SSE moves to be eliminated. The definition of 'IsOptimizableRegisterMove' for BtVer2 looks like this: ``` def : IsOptimizableRegisterMove<[ InstructionEquivalenceClass<[ // GPR variants. MOV32rr, MOV64rr, // MMX variants. MMX_MOVQ64rr, // SSE variants. MOVAPSrr, MOVUPSrr, MOVAPDrr, MOVUPDrr, MOVDQArr, MOVDQUrr, // AVX variants. VMOVAPSrr, VMOVUPSrr, VMOVAPDrr, VMOVUPDrr, VMOVDQArr, VMOVDQUrr ], CheckNot<CheckSameRegOperand<0, 1>> > ]>; ``` Definitions of IsOptimizableRegisterMove from processor models of a same Target are processed by the SubtargetEmitter to auto-generate a target-specific override for each of the following predicate methods: ``` bool TargetSubtargetInfo::isOptimizableRegisterMove(const MachineInstr *MI) const; bool MCInstrAnalysis::isOptimizableRegisterMove(const MCInst &MI, unsigned CPUID) const; ``` By default, those methods return false (i.e. conservatively assume that there are no move elimination candidates). Tablegen class RegisterFile has been extended with the following information: - The set of register classes that allow move elimination. - Maxium number of moves that can be eliminated every cycle. - Whether move elimination is restricted to moves from registers that are known to be zero. This patch is structured in three part: A first part (which is mostly boilerplate) adds the new 'isOptimizableRegisterMove' target hooks, and extends existing register file descriptors in MC by introducing new fields to describe properties related to move elimination. A second part, uses the new tablegen constructs to describe move elimination in the BtVer2 scheduling model. A third part, teaches llm-mca how to query the new 'isOptimizableRegisterMove' hook to mark instructions that are candidates for move elimination. It also teaches class RegisterFile how to describe constraints on move elimination at PRF granularity. llvm-mca tests for btver2 show differences before/after this patch. Differential Revision: https://reviews.llvm.org/D53134 llvm-svn: 344334
llvm · Oct 12, 2018 · 6eebbe0 · 6eebbe0
1 parent e02d09d
commit 6eebbe0
Show file tree

Hide file tree

Showing 16 changed files with 315 additions and 195 deletions.
diff --git a/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h b/llvm/include/llvm/CodeGen/TargetSubtargetInfo.h
@@ -169,6 +169,19 @@ class TargetSubtargetInfo : public MCSubtargetInfo {
     return isZeroIdiom(MI, Mask);
   }
 
+  /// Returns true if MI is a candidate for move elimination.
+  ///
+  /// A candidate for move elimination may be optimized out at register renaming
+  /// stage. Subtargets can specify the set of optimizable moves by
+  /// instantiating tablegen class `IsOptimizableRegisterMove` (see
+  /// llvm/Target/TargetInstrPredicate.td).
+  ///
+  /// SubtargetEmitter is responsible for processing all the definitions of class
+  /// IsOptimizableRegisterMove, and auto-generate an override for this method.
+  virtual bool isOptimizableRegisterMove(const MachineInstr *MI) const {
+    return false;
+  }
+
   /// True if the subtarget should run MachineScheduler after aggressive
   /// coalescing.
   ///

diff --git a/llvm/include/llvm/MC/MCInstrAnalysis.h b/llvm/include/llvm/MC/MCInstrAnalysis.h
@@ -136,6 +136,17 @@ class MCInstrAnalysis {
     return isZeroIdiom(MI, Mask, CPUID);
   }
 
+  /// Returns true if MI is a candidate for move elimination.
+  ///
+  /// Different subtargets may apply different constraints to optimizable
+  /// register moves. For example, on most X86 subtargets, a candidate for move
+  /// elimination cannot specify the same register for both source and
+  /// destination.
+  virtual bool isOptimizableRegisterMove(const MCInst &MI,
+                                         unsigned CPUID) const {
+    return false;
+  }
+
   /// Given a branch instruction try to get the address the branch
   /// targets. Return true on success, and the address in Target.
   virtual bool

diff --git a/llvm/include/llvm/MC/MCSchedule.h b/llvm/include/llvm/MC/MCSchedule.h
@@ -142,6 +142,7 @@ struct MCSchedClassDesc {
 struct MCRegisterCostEntry {
   unsigned RegisterClassID;
   unsigned Cost;
+  bool AllowMoveElimination;
 };
 
 /// A register file descriptor.
@@ -159,6 +160,12 @@ struct MCRegisterFileDesc {
   uint16_t NumRegisterCostEntries;
   // Index of the first cost entry in MCExtraProcessorInfo::RegisterCostTable.
   uint16_t RegisterCostEntryIdx;
+  // A value of zero means: there is no limit in the number of moves that can be
+  // eliminated every cycle.
+  uint16_t MaxMovesEliminatedPerCycle;
+  // Ture if this register file only knows how to optimize register moves from
+  // known zero registers.
+  bool AllowZeroMoveEliminationOnly;
 };
 
 /// Provide extra details about the machine processor.

diff --git a/llvm/include/llvm/Target/TargetInstrPredicate.td b/llvm/include/llvm/Target/TargetInstrPredicate.td
@@ -313,7 +313,7 @@ class STIPredicate<STIPredicateDecl declaration,
 }
 
 // Convenience classes and definitions used by processor scheduling models to
-// describe dependency breaking instructions.
+// describe dependency breaking instructions and move elimination candidates.
 let UpdatesOpcodeMask = 1 in {
 
 def IsZeroIdiomDecl : STIPredicateDecl<"isZeroIdiom">;
@@ -323,8 +323,14 @@ def IsDepBreakingDecl : STIPredicateDecl<"isDependencyBreaking">;
 
 } // UpdatesOpcodeMask
 
+def IsOptimizableRegisterMoveDecl
+    : STIPredicateDecl<"isOptimizableRegisterMove">;
+
 class IsZeroIdiomFunction<list<DepBreakingClass> classes>
     : STIPredicate<IsZeroIdiomDecl, classes>;
 
 class IsDepBreakingFunction<list<DepBreakingClass> classes>
     : STIPredicate<IsDepBreakingDecl, classes>;
+
+class IsOptimizableRegisterMove<list<InstructionEquivalenceClass> classes>
+    : STIPredicate<IsOptimizableRegisterMoveDecl, classes>;
diff --git a/llvm/include/llvm/Target/TargetSchedule.td b/llvm/include/llvm/Target/TargetSchedule.td
@@ -460,6 +460,10 @@ class SchedAlias<SchedReadWrite match, SchedReadWrite alias> {
 //  - The number of physical registers which can be used for register renaming
 //    purpose.
 //  - The cost of a register rename.
+//  - The set of registers that allow move elimination.
+//  - The maximum number of moves that can be eliminated every cycle.
+//  - Whether move elimination is limited to register moves whose input
+//    is known to be zero.
 //
 // The cost of a rename is the number of physical registers allocated by the
 // register alias table to map the new definition. By default, register can be
@@ -506,11 +510,35 @@ class SchedAlias<SchedReadWrite match, SchedReadWrite alias> {
 // partial write is combined with the previous super-register definition.  We
 // should add support for these cases, and correctly model merge problems with
 // partial register accesses.
+//
+// Field MaxMovesEliminatedPerCycle specifies how many moves can be eliminated
+// every cycle. A default value of zero for that field means: there is no limit
+// to the number of moves that can be eliminated by this register file.
+//
+// An instruction MI is a candidate for move elimination if a call to
+// method TargetSubtargetInfo::isOptimizableRegisterMove(MI) returns true (see
+// llvm/CodeGen/TargetSubtargetInfo.h, and llvm/MC/MCInstrAnalysis.h).
+//
+// Subtargets can instantiate tablegen class IsOptimizableRegisterMove (see
+// llvm/Target/TargetInstrPredicate.td) to customize the set of move elimination
+// candidates. By default, no instruction is a valid move elimination candidate.
+//
+// A register move MI is eliminated only if:
+//  - MI is a move elimination candidate.
+//  - The destination register is from a register class that allows move
+//    elimination (see field `AllowMoveElimination` below).
+//  - Constraints on the move kind, and the maximum number of moves that can be
+//    eliminated per cycle are all met.
+
 class RegisterFile<int numPhysRegs, list<RegisterClass> Classes = [],
-                   list<int> Costs = []> {
+                   list<int> Costs = [], list<bit> AllowMoveElim = [],
+                   int MaxMoveElimPerCy = 0, bit AllowZeroMoveElimOnly = 0> {
   list<RegisterClass> RegClasses = Classes;
   list<int> RegCosts = Costs;
+  list<bit> AllowMoveElimination = AllowMoveElim;
   int NumPhysRegs = numPhysRegs;
+  int MaxMovesEliminatedPerCycle = MaxMoveElimPerCy;
+  bit AllowZeroMoveEliminationOnly = AllowZeroMoveElimOnly;
   SchedMachineModel SchedModel = ?;
 }
 

diff --git a/llvm/lib/Target/X86/X86ScheduleBtVer2.td b/llvm/lib/Target/X86/X86ScheduleBtVer2.td
@@ -48,12 +48,22 @@ def JFPU1 : ProcResource<1>; // Vector/FPU Pipe1: VALU1/STC/FPM
 // part of it.
 // Reference: Section 21.10 "AMD Bobcat and Jaguar pipeline: Partial register
 // access" - Agner Fog's "microarchitecture.pdf".
-def JIntegerPRF : RegisterFile<64, [GR64, CCR]>;
+def JIntegerPRF : RegisterFile<64, [GR64, CCR], [1, 1], [1, 0],
+                               0,  // Max moves that can be eliminated per cycle.
+                               1>; // Restrict move elimination to zero regs.
 
 // The Jaguar FP Retire Queue renames SIMD and FP uOps onto a pool of 72 SSE
 // registers. Operations on 256-bit data types are cracked into two COPs.
 // Reference: www.realworldtech.com/jaguar/4/
-def JFpuPRF: RegisterFile<72, [VR64, VR128, VR256], [1, 1, 2]>;
+
+// The PRF in the floating point unit can eliminate a move from a MMX or SSE
+// register that is know to be zero (i.e. it has been zeroed using a zero-idiom
+// dependency breaking instruction, or via VZEROALL).
+// Reference: Section 21.8 "AMD Bobcat and Jaguar pipeline: Dependency-breaking
+// instructions" - Agner Fog's "microarchitecture.pdf"
+def JFpuPRF: RegisterFile<72, [VR64, VR128, VR256], [1, 1, 2], [1, 1, 0],
+                          0,  // Max moves that can be eliminated per cycle.
+                          1>; // Restrict move elimination to zero regs.
 
 // The retire control unit (RCU) can track up to 64 macro-ops in-flight. It can
 // retire up to two macro-ops per cycle.
@@ -805,4 +815,24 @@ def : IsDepBreakingFunction<[
   ], ZeroIdiomPredicate>
 ]>;
 
+def : IsOptimizableRegisterMove<[
+  InstructionEquivalenceClass<[
+    // GPR variants.
+    MOV32rr, MOV64rr,
+
+    // MMX variants.
+    MMX_MOVQ64rr,
+
+    // SSE variants.
+    MOVAPSrr, MOVUPSrr,
+    MOVAPDrr, MOVUPDrr,
+    MOVDQArr, MOVDQUrr,
+
+    // AVX variants.
+    VMOVAPSrr, VMOVUPSrr,
+    VMOVAPDrr, VMOVUPDrr,
+    VMOVDQArr, VMOVDQUrr
+  ], TruePred >
+]>;
+
 } // SchedModel
diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/reg-move-elimination-1.s b/llvm/test/tools/llvm-mca/X86/BtVer2/reg-move-elimination-1.s
@@ -32,13 +32,13 @@ vaddps %xmm1, %xmm1, %xmm2
 # CHECK-NEXT:  1      3     1.00                        vaddps	%xmm1, %xmm1, %xmm2
 
 # CHECK:      Register File statistics:
-# CHECK-NEXT: Total number of mappings created:    6
-# CHECK-NEXT: Max number of mappings used:         5
+# CHECK-NEXT: Total number of mappings created:    3
+# CHECK-NEXT: Max number of mappings used:         3
 
 # CHECK:      *  Register File #1 -- JFpuPRF:
 # CHECK-NEXT:    Number of physical registers:     72
-# CHECK-NEXT:    Total number of mappings created: 6
-# CHECK-NEXT:    Max number of mappings used:      5
+# CHECK-NEXT:    Total number of mappings created: 3
+# CHECK-NEXT:    Max number of mappings used:      3
 
 # CHECK:      *  Register File #2 -- JIntegerPRF:
 # CHECK-NEXT:    Number of physical registers:     64
@@ -63,25 +63,25 @@ vaddps %xmm1, %xmm1, %xmm2
 
 # CHECK:      Resource pressure per iteration:
 # CHECK-NEXT: [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
-# CHECK-NEXT:  -      -      -     1.00   1.00   1.00   1.00    -      -      -      -      -      -      -
+# CHECK-NEXT:  -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -
 
 # CHECK:      Resource pressure by instruction:
 # CHECK-NEXT: [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     vxorps	%xmm0, %xmm0, %xmm0
-# CHECK-NEXT:  -      -      -      -     1.00    -     1.00    -      -      -      -      -      -      -     vmovaps	%xmm0, %xmm1
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     vmovaps	%xmm0, %xmm1
 # CHECK-NEXT:  -      -      -     1.00    -     1.00    -      -      -      -      -      -      -      -     vaddps	%xmm1, %xmm1, %xmm2
 
 # CHECK:      Timeline view:
 # CHECK-NEXT: Index     0123456789
 
 # CHECK:      [0,0]     DR   .   .   vxorps	%xmm0, %xmm0, %xmm0
-# CHECK-NEXT: [0,1]     DeER .   .   vmovaps	%xmm0, %xmm1
+# CHECK-NEXT: [0,1]     DR   .   .   vmovaps	%xmm0, %xmm1
 # CHECK-NEXT: [0,2]     .DeeeER  .   vaddps	%xmm1, %xmm1, %xmm2
 # CHECK-NEXT: [1,0]     .D----R  .   vxorps	%xmm0, %xmm0, %xmm0
-# CHECK-NEXT: [1,1]     . DeE--R .   vmovaps	%xmm0, %xmm1
-# CHECK-NEXT: [1,2]     . D=eeeER.   vaddps	%xmm1, %xmm1, %xmm2
+# CHECK-NEXT: [1,1]     . D----R .   vmovaps	%xmm0, %xmm1
+# CHECK-NEXT: [1,2]     . DeeeER .   vaddps	%xmm1, %xmm1, %xmm2
 # CHECK-NEXT: [2,0]     .  D----R.   vxorps	%xmm0, %xmm0, %xmm0
-# CHECK-NEXT: [2,1]     .  DeE---R   vmovaps	%xmm0, %xmm1
+# CHECK-NEXT: [2,1]     .  D----R.   vmovaps	%xmm0, %xmm1
 # CHECK-NEXT: [2,2]     .   DeeeER   vaddps	%xmm1, %xmm1, %xmm2
 
 # CHECK:      Average Wait times (based on the timeline view):
@@ -92,5 +92,5 @@ vaddps %xmm1, %xmm1, %xmm2
 
 # CHECK:            [0]    [1]    [2]    [3]
 # CHECK-NEXT: 0.     3     0.0    0.0    2.7       vxorps	%xmm0, %xmm0, %xmm0
-# CHECK-NEXT: 1.     3     1.0    1.0    1.7       vmovaps	%xmm0, %xmm1
-# CHECK-NEXT: 2.     3     1.3    0.0    0.0       vaddps	%xmm1, %xmm1, %xmm2
+# CHECK-NEXT: 1.     3     0.0    0.0    2.7       vmovaps	%xmm0, %xmm1
+# CHECK-NEXT: 2.     3     1.0    1.0    0.0       vaddps	%xmm1, %xmm1, %xmm2
diff --git a/llvm/test/tools/llvm-mca/X86/BtVer2/reg-move-elimination-2.s b/llvm/test/tools/llvm-mca/X86/BtVer2/reg-move-elimination-2.s
@@ -14,12 +14,12 @@ movdqu %xmm5, %xmm0
 
 # CHECK:      Iterations:        3
 # CHECK-NEXT: Instructions:      27
-# CHECK-NEXT: Total Cycles:      19
+# CHECK-NEXT: Total Cycles:      15
 # CHECK-NEXT: Total uOps:        27
 
 # CHECK:      Dispatch Width:    2
-# CHECK-NEXT: uOps Per Cycle:    1.42
-# CHECK-NEXT: IPC:               1.42
+# CHECK-NEXT: uOps Per Cycle:    1.80
+# CHECK-NEXT: IPC:               1.80
 # CHECK-NEXT: Block RThroughput: 4.5
 
 # CHECK:      Instruction Info:
@@ -42,13 +42,13 @@ movdqu %xmm5, %xmm0
 # CHECK-NEXT:  1      1     0.50                        movdqu	%xmm5, %xmm0
 
 # CHECK:      Register File statistics:
-# CHECK-NEXT: Total number of mappings created:    21
-# CHECK-NEXT: Max number of mappings used:         8
+# CHECK-NEXT: Total number of mappings created:    0
+# CHECK-NEXT: Max number of mappings used:         0
 
 # CHECK:      *  Register File #1 -- JFpuPRF:
 # CHECK-NEXT:    Number of physical registers:     72
-# CHECK-NEXT:    Total number of mappings created: 21
-# CHECK-NEXT:    Max number of mappings used:      8
+# CHECK-NEXT:    Total number of mappings created: 0
+# CHECK-NEXT:    Max number of mappings used:      0
 
 # CHECK:      *  Register File #2 -- JIntegerPRF:
 # CHECK-NEXT:    Number of physical registers:     64
@@ -73,51 +73,51 @@ movdqu %xmm5, %xmm0
 
 # CHECK:      Resource pressure per iteration:
 # CHECK-NEXT: [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]
-# CHECK-NEXT:  -      -      -     2.00   2.00   3.33   3.67    -      -      -      -     1.33   1.67    -
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -
 
 # CHECK:      Resource pressure by instruction:
 # CHECK-NEXT: [0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    [10]   [11]   [12]   [13]   Instructions:
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     pxor	%mm0, %mm0
-# CHECK-NEXT:  -      -      -      -      -      -     1.00    -      -      -      -      -     1.00    -     movq	%mm0, %mm1
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     movq	%mm0, %mm1
 # CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     xorps	%xmm0, %xmm0
-# CHECK-NEXT:  -      -      -      -     1.00   0.33   0.67    -      -      -      -      -      -      -     movaps	%xmm0, %xmm1
-# CHECK-NEXT:  -      -      -     1.00    -     0.33   0.67    -      -      -      -      -      -      -     movups	%xmm1, %xmm2
-# CHECK-NEXT:  -      -      -      -     1.00   0.67   0.33    -      -      -      -      -      -      -     movapd	%xmm2, %xmm3
-# CHECK-NEXT:  -      -      -     1.00    -     0.33   0.67    -      -      -      -      -      -      -     movupd	%xmm3, %xmm4
-# CHECK-NEXT:  -      -      -      -      -     1.00    -      -      -      -      -     1.00    -      -     movdqa	%xmm4, %xmm5
-# CHECK-NEXT:  -      -      -      -      -     0.67   0.33    -      -      -      -     0.33   0.67    -     movdqu	%xmm5, %xmm0
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     movaps	%xmm0, %xmm1
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     movups	%xmm1, %xmm2
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     movapd	%xmm2, %xmm3
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     movupd	%xmm3, %xmm4
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     movdqa	%xmm4, %xmm5
+# CHECK-NEXT:  -      -      -      -      -      -      -      -      -      -      -      -      -      -     movdqu	%xmm5, %xmm0
 
 # CHECK:      Timeline view:
-# CHECK-NEXT:                     012345678
+# CHECK-NEXT:                     01234
 # CHECK-NEXT: Index     0123456789
 
-# CHECK:      [0,0]     DR   .    .    .  .   pxor	%mm0, %mm0
-# CHECK-NEXT: [0,1]     DeER .    .    .  .   movq	%mm0, %mm1
-# CHECK-NEXT: [0,2]     .D-R .    .    .  .   xorps	%xmm0, %xmm0
-# CHECK-NEXT: [0,3]     .DeER.    .    .  .   movaps	%xmm0, %xmm1
-# CHECK-NEXT: [0,4]     . DeER    .    .  .   movups	%xmm1, %xmm2
-# CHECK-NEXT: [0,5]     . D=eER   .    .  .   movapd	%xmm2, %xmm3
-# CHECK-NEXT: [0,6]     .  D=eER  .    .  .   movupd	%xmm3, %xmm4
-# CHECK-NEXT: [0,7]     .  D==eER .    .  .   movdqa	%xmm4, %xmm5
-# CHECK-NEXT: [0,8]     .   D==eER.    .  .   movdqu	%xmm5, %xmm0
-# CHECK-NEXT: [1,0]     .   D----R.    .  .   pxor	%mm0, %mm0
-# CHECK-NEXT: [1,1]     .    DeE--R    .  .   movq	%mm0, %mm1
-# CHECK-NEXT: [1,2]     .    D----R    .  .   xorps	%xmm0, %xmm0
-# CHECK-NEXT: [1,3]     .    .DeE--R   .  .   movaps	%xmm0, %xmm1
-# CHECK-NEXT: [1,4]     .    .D=eE-R   .  .   movups	%xmm1, %xmm2
-# CHECK-NEXT: [1,5]     .    . D=eE-R  .  .   movapd	%xmm2, %xmm3
-# CHECK-NEXT: [1,6]     .    . D==eER  .  .   movupd	%xmm3, %xmm4
-# CHECK-NEXT: [1,7]     .    .  D==eER .  .   movdqa	%xmm4, %xmm5
-# CHECK-NEXT: [1,8]     .    .  D===eER.  .   movdqu	%xmm5, %xmm0
-# CHECK-NEXT: [2,0]     .    .   D----R.  .   pxor	%mm0, %mm0
-# CHECK-NEXT: [2,1]     .    .   DeE---R  .   movq	%mm0, %mm1
-# CHECK-NEXT: [2,2]     .    .    D----R  .   xorps	%xmm0, %xmm0
-# CHECK-NEXT: [2,3]     .    .    DeE---R .   movaps	%xmm0, %xmm1
-# CHECK-NEXT: [2,4]     .    .    .DeE--R .   movups	%xmm1, %xmm2
-# CHECK-NEXT: [2,5]     .    .    .D=eE--R.   movapd	%xmm2, %xmm3
-# CHECK-NEXT: [2,6]     .    .    . D=eE-R.   movupd	%xmm3, %xmm4
-# CHECK-NEXT: [2,7]     .    .    . D==eE-R   movdqa	%xmm4, %xmm5
-# CHECK-NEXT: [2,8]     .    .    .  D==eER   movdqu	%xmm5, %xmm0
+# CHECK:      [0,0]     DR   .    .   .   pxor	%mm0, %mm0
+# CHECK-NEXT: [0,1]     DR   .    .   .   movq	%mm0, %mm1
+# CHECK-NEXT: [0,2]     .DR  .    .   .   xorps	%xmm0, %xmm0
+# CHECK-NEXT: [0,3]     .DR  .    .   .   movaps	%xmm0, %xmm1
+# CHECK-NEXT: [0,4]     . DR .    .   .   movups	%xmm1, %xmm2
+# CHECK-NEXT: [0,5]     . DR .    .   .   movapd	%xmm2, %xmm3
+# CHECK-NEXT: [0,6]     .  DR.    .   .   movupd	%xmm3, %xmm4
+# CHECK-NEXT: [0,7]     .  DR.    .   .   movdqa	%xmm4, %xmm5
+# CHECK-NEXT: [0,8]     .   DR    .   .   movdqu	%xmm5, %xmm0
+# CHECK-NEXT: [1,0]     .   DR    .   .   pxor	%mm0, %mm0
+# CHECK-NEXT: [1,1]     .    DR   .   .   movq	%mm0, %mm1
+# CHECK-NEXT: [1,2]     .    DR   .   .   xorps	%xmm0, %xmm0
+# CHECK-NEXT: [1,3]     .    .DR  .   .   movaps	%xmm0, %xmm1
+# CHECK-NEXT: [1,4]     .    .DR  .   .   movups	%xmm1, %xmm2
+# CHECK-NEXT: [1,5]     .    . DR .   .   movapd	%xmm2, %xmm3
+# CHECK-NEXT: [1,6]     .    . DR .   .   movupd	%xmm3, %xmm4
+# CHECK-NEXT: [1,7]     .    .  DR.   .   movdqa	%xmm4, %xmm5
+# CHECK-NEXT: [1,8]     .    .  DR.   .   movdqu	%xmm5, %xmm0
+# CHECK-NEXT: [2,0]     .    .   DR   .   pxor	%mm0, %mm0
+# CHECK-NEXT: [2,1]     .    .   DR   .   movq	%mm0, %mm1
+# CHECK-NEXT: [2,2]     .    .    DR  .   xorps	%xmm0, %xmm0
+# CHECK-NEXT: [2,3]     .    .    DR  .   movaps	%xmm0, %xmm1
+# CHECK-NEXT: [2,4]     .    .    .DR .   movups	%xmm1, %xmm2
+# CHECK-NEXT: [2,5]     .    .    .DR .   movapd	%xmm2, %xmm3
+# CHECK-NEXT: [2,6]     .    .    . DR.   movupd	%xmm3, %xmm4
+# CHECK-NEXT: [2,7]     .    .    . DR.   movdqa	%xmm4, %xmm5
+# CHECK-NEXT: [2,8]     .    .    .  DR   movdqu	%xmm5, %xmm0
 
 # CHECK:      Average Wait times (based on the timeline view):
 # CHECK-NEXT: [0]: Executions
@@ -126,12 +126,12 @@ movdqu %xmm5, %xmm0
 # CHECK-NEXT: [3]: Average time elapsed from WB until retire stage
 
 # CHECK:            [0]    [1]    [2]    [3]
-# CHECK-NEXT: 0.     3     0.0    0.0    2.7       pxor	%mm0, %mm0
-# CHECK-NEXT: 1.     3     1.0    1.0    1.7       movq	%mm0, %mm1
-# CHECK-NEXT: 2.     3     0.0    0.0    3.0       xorps	%xmm0, %xmm0
-# CHECK-NEXT: 3.     3     1.0    1.0    1.7       movaps	%xmm0, %xmm1
-# CHECK-NEXT: 4.     3     1.3    0.0    1.0       movups	%xmm1, %xmm2
-# CHECK-NEXT: 5.     3     2.0    0.0    1.0       movapd	%xmm2, %xmm3
-# CHECK-NEXT: 6.     3     2.3    0.0    0.3       movupd	%xmm3, %xmm4
-# CHECK-NEXT: 7.     3     3.0    0.0    0.3       movdqa	%xmm4, %xmm5
-# CHECK-NEXT: 8.     3     3.3    0.0    0.0       movdqu	%xmm5, %xmm0
+# CHECK-NEXT: 0.     3     0.0    0.0    0.0       pxor	%mm0, %mm0
+# CHECK-NEXT: 1.     3     0.0    0.0    0.0       movq	%mm0, %mm1
+# CHECK-NEXT: 2.     3     0.0    0.0    0.0       xorps	%xmm0, %xmm0
+# CHECK-NEXT: 3.     3     0.0    0.0    0.0       movaps	%xmm0, %xmm1
+# CHECK-NEXT: 4.     3     0.0    0.0    0.0       movups	%xmm1, %xmm2
+# CHECK-NEXT: 5.     3     0.0    0.0    0.0       movapd	%xmm2, %xmm3
+# CHECK-NEXT: 6.     3     0.0    0.0    0.0       movupd	%xmm3, %xmm4
+# CHECK-NEXT: 7.     3     0.0    0.0    0.0       movdqa	%xmm4, %xmm5
+# CHECK-NEXT: 8.     3     0.0    0.0    0.0       movdqu	%xmm5, %xmm0