Thursday, March 23, 2017

3:53 PM



| Table 4.<br>Register contents in matrix transpose |          |                   |    |    |    |
|---------------------------------------------------|----------|-------------------|----|----|----|
| Instruction                                       |          | Register contents |    |    |    |
|                                                   |          | R1 = a1           | b1 | c1 | d1 |
|                                                   |          | R2 = a2           | b2 | c2 | d2 |
|                                                   |          | R3 = a3           | b3 | c3 | d3 |
|                                                   |          | R4 = a4           | b4 | c4 | d4 |
| MixH,L                                            | R1,R2,t1 | t1 = a1           | a2 | c1 | c2 |
| MixH,R                                            | R1,R2,t2 | t2 = b1           | b2 | d1 | ď2 |
| MixH,L                                            | R3,R4,t3 | t3 = a3           | a4 | c3 | c4 |
| MixH,R                                            | R3,R4,t4 | t4 = b3           | b4 | d3 | d4 |
| MixW,L                                            | t1,t3,R1 | R1 = a1           | a2 | a3 | a4 |
| MixW,L                                            | t2,t4,R2 | R2 = b1           | b2 | b3 | b4 |
| MixW,R                                            | t1,t3,R3 | R3 = c1           | c2 | c3 | c4 |
| MixW,R                                            | t2,t4,R4 | R4 = d1           | d2 | d3 | d4 |

★ 4\*4 matrix transpose (16 instructions):

Load into R1-R4: (sstt set by us)

VLD r, s+t: 1001-///-sstt-10//-0010-rrrr-///-///

**->**1:

->2:

->3:

->4:

32-bit shuffle: 000-nnnnnnn-11-10-ssss-rrrr-vvvv-wwww

Mix L: 000-00001010-11-10-0101-rrrr-vvvv-www

Mix R: 000-01011111-11-10-0101-rrrr-vvvv-www

Assuming storing in R1-R4:

1,2->5, L:000-00001010-11-10-0101-0101-0001-0010

1,2->6, R:000-01011111-11-10-0101-0110-0001-0010

3,4->7, L:000-00001010-11-10-0101-0111-0011-0100

3,4->8, R:000-01011111-11-10-0101-1000-0011-0100

5,7->1, L:000-00001010-11-10-0101-0001-0101-0111

6,8->2, L:000-00001010-11-10-0101-0010-0110-1000

5,7->3, R:000-01011111-11-10-0101-0011-0101-0100

6,8->4, R:000-01011111-11-10-0101-0100-0110-1000

Store back to memory:

VST s+t, r: 1011-00//-sstt-10//-0011-////-vvvv-////

(Without SIMD: 32 instructions)

★ 4\*4 matrix multiplication (72 instructions): **Load** into R1-R4, R5-R8:VLD (8 instructions)

**LOOP:** (4\*15 = 60)

Multiply: R1\*R5,6,7,8 (4 instructions)

VMUL: 12\*-0110-rrrr-vvvv-wwww(32 bit)

1,5->9: 12\*-0110-1001-0001-0101

1,6->10: 12\*-0110-1010-0001-0110

1,7->11: 12\*-0110-1011-0001-0111

1,8->12: 12\*-0110-1100-0001-1000

**Transpose**:(using 9,10,11,12 back to 9,10,11,12) (8 instructions)

**Add**: R9+R10+R11+R12 (3 instructions)

VADD: 12\*-0000-rrrr-vvvv-wwww(32 bit)

9,10->9:

9,11->9:

9,12->1:

Repeat

**Store** back (4 instructions)

(Without SIMD: 16+32+16\*7 = **160** instructions)