How to detect? Options wort until all previous stones committed Options) keep a list of pending stores In a "Store buffer" echock whether "Doad address" matches a prev. "Store adds" older Ader data valid size sequence young 1 unallocates ] "Store buffer" : lost of store's that are pending in the machine Idadorios How schedule? > Option D Assume dependent session (2) " independent 3 Predaty simple more accurate & need still need recovery \* Pata For Warding Between Store & Load Ly Modern processors use LQ (loadqueue) ( & SQ (Storegrere) age-based comparison!! "Ow. + Order Completion of Memops 1 Stope think ROB (2) Load generate addr Sewich ROB occes memory receive the value from the youngest older into that wrote to the addr (from ROB) or mem search logic > Content adversable memory · Store-Land Forwarding Complexity Opting Range Search
opting Age-based Search also DLoad data from [SQ [Week / Cache

Other Approaches to Concurring (MDCB)

(on Instruction Level Parallelism)

Superscalar execution VLIN Offine-grain multi-thready

OSIMD Processing Crector & anay processors, 6PU;

OB Decompled Access Execute

## \* Superscaller Execution

19 Systola Arrays

Idea: Fetch/Decode/Excute/Retire multiple intr per gyl.

· N-wide superscalar > N mstr percycle.

o In-Order super scalar processon

- Copies of datapath.

- Dependency make it tuicky.

That; high IDC OBade; Repondency, morehardware.

## Leture 18. Branch Prediction



if (x=n) { A 3 else { B} ~ { 0}}

| tarent for MA Hor: -Bad! | of (potater !=NULL) EA3 else [B3 A) more probable!  -> programmer needs to try maximize |
|--------------------------|-----------------------------------------------------------------------------------------|

probability.



Lecture 19: Branch Prodution + VLINI Fine Grained Multitheading Other control dependence handling Branch Prediction -> Branch Delay Stot - 7 Fine-grained multi-threading La prodicated execution in multipath execution VLIW) very large instruction word es superscalar; multiple interitions 8 check sependencies between Software packs independent instructions in a larger "instruction bundle" PC -> and more that I -> No dependency checking =) simple hardware (But complex compiler) Q. what about variable latency operation? · VLIW philosophy VImpact. => RISC , VINTEL JA-64, Superblack · Trade-off - Adv: No need for dependency checking has drawn to the align

Dadu: Congiler more surplex !,





## Lect 21 SIMD & GPUS - Stride & banking issures (shold be rolatively - Storage of a maxrix. · Row major: Store data vow by now · Column major; store data columby column. - Matrix multiply AXB (Nem load Vector w/ 1 Stride 1 B-9 11/10/12 row 1 load vector by Different strides can lead minimizing bank conflict to "bank onflicts 1 mere banks @ Better Datalayout - 9 material AXB Letranspose 3 Better mapping of (i.e, colum major) address to bank. E.g.) randomized mapping (stude 1) Sparse recton) Gathery Scatter

- Modern SIMD = mixture of MX19

  away processor

  & vector processor

  & vector processor

  Local unit Multiply

  Multiply

  Multiply

  Addum

  Addum
  - for (i=0; i<N; i++)

    ([i] = A[i] + B[i];

    (load load vector instruction and all all store store store loop dependence analyse.
- 'SIMD ISA Extensions (graphics)
- · Intel foutium MMX Operations.

L) Image Overlaying (w/ siemask)

GPU (Graphiz Processing Units)

- programing Is done using threats, NOTSIMD Instructions.
- Programming Model VS. Hardware Execution Model
  4 Next page!

O. How to calculate

for (i=0; i<1; i++) C(i) = A[i]+(i).