Fix broken C version of MulMatricesC #37

Narann · 2014-12-31T20:14:40Z

float destrow[4] is unused.

destrow was used in the original SSE algorithm in Glide64, before translation to portable C.
cxd4/mupen64plus-libretro@b08ab92#diff-bd9803dbb35139f5ba7c4e7a77557f76L239

It seems to have been meant as an intermediary token for accumulating the shuffled result before eventually storing it to that two-dimensional r[4][4] array. While updating the pattern for readability I decided that this step wasn't needed, and used Gonetz's comment about the summands to prepare each mulitplication to an array under that name instead.

I'd forgotten that I was no longer wanting to use their destrow array anymore. Feel free to remove that unused declaration if you like; the makefile I used for the build didn't have warnings like that set to complain. The chief reason for the commit I made was that before we couldn't get a 32-bit build going, and I needed to change the function pointer here to indicate that we had no reason to assume a dependency on SSE1.

Thanks for the explanations cxd4, I appreciate.

I agree about removing SSE instructions in favor of portable C code. I removed destrow[4].

I just think the "auto vectorized way" is kind of unnatural to read and not obvious to get so I added few comments to provide the hint of potential future contributors:

// auto-vectorizable algorithm // vectorized loop style, such that compilers can // easily create optimized SSE instructions.

Thanks for contributing! :)

That's fine. In particular, my point is, only the C code (also thought of as the "auto-vectorized" code) has universal meaning. Anything else like inline assembly, or assembly language intrinsics, is in the eye of the beholder and not naturally readable code to everyone--because, its readability is dictated more by the target processor.

For example, suppose you might prefer the intrinsic use:
ALIGNED int32_t a[4], b[4], c[4];
int main(void) {
*(__m128i *)c = _mm_and_si128(*(__m128i *)a, *(__m128i *)b);
return 0;

However, I find it even more readable to both the human and the compiler (for better optimizations outside the version of the SIMD instruction set architecture assumed by the usage of that intrinsic) to instead say:

int a[4], b[4], c[4];
int main(void) {
register unsigned int i;
for (i = 0; i < 4; i++)
c[i] = a[i] & b[i];
return 0;

Although the name of the _mm_and_si128 intrinsic is self-explanatory as to the operation of the code, the algorithm itself is more readable and more natural when using C code (and the compiler is left to generate what should be the best choice of SIMD intrinsic for and'ing all elements). This commit probably isn't the best example of that, however--you're welcome to disagree whether my C loops are as readable as some of the old code. The gist of it may have been portability, but I thought it was more readable to divide each parallelized step of the overal algorithm into its own loop. The original C loop code you had was much better than mine, though, in terms of shortness and brevity--maybe open to interpretation on the parallel steps what goes on. So I think it's fine if certain sketchy examples of this practice come up that you like to comment about.

I agree, there is a lot of SSE in many emulators. It's a shame as often, emulators whose code will remains for years have bad portability. It's also a reason why I tent to prefer shortness and brevity over compiler compliant writing: Emulator code should focus on readability over performance as it make life of potential future contributors easier.

I'm not a compiler expert but if we convert this:

for (j = 0; j < 4; j++) summand[0][j] = leftrow[0] * row[0][j]; for (j = 0; j < 4; j++) summand[1][j] = leftrow[1] * row[1][j]; for (j = 0; j < 4; j++) summand[2][j] = leftrow[2] * row[2][j]; for (j = 0; j < 4; j++) summand[3][j] = leftrow[3] * row[3][j];

To this:

for (j = 0; j < 4; j++) { summand[0][j] = leftrow[0] * row[0][j]; summand[1][j] = leftrow[1] * row[1][j]; summand[2][j] = leftrow[2] * row[2][j]; summand[3][j] = leftrow[3] * row[3][j]; }

Are we sure the compiler will not detect vectorizable code and provide the same result than the first case?

I'm just curious.

A fair question. I will check it now, but I anticipate the answer is that it will give the same result.

And yes, "readability over performance" is a good, slow but steady way to ultimately achieve better performance. I agree about prioritizing that over performance or premature optimization. :) That, and future developers understand the key to each algorithm like you said. Forcing non-portable code or prematurely optimizing for a limited span of targets (e.g. always use SSE intrinsics) generally worked better in the long run for closed-source projects or those that don't fully appreciate some benefits of open-source.

So if we convert those 4 separate loops I wrote into one single loop as you implied, GCC yields: the exact same output for both methods! So yes, your method of changing my loop produces the exact same output within almost certainly any compiler intelligent enough to vectorize them in the first place. To be more specific, the output it produces is:

movss xmm0, DWORD PTR leftrow shufps xmm0, xmm0, 0 mulps xmm0, XMMWORD PTR row movaps XMMWORD PTR summand, xmm0 movss xmm0, DWORD PTR leftrow+4 shufps xmm0, xmm0, 0 mulps xmm0, XMMWORD PTR row+16 movaps XMMWORD PTR summand+16, xmm0 movss xmm0, DWORD PTR leftrow+8 shufps xmm0, xmm0, 0 mulps xmm0, XMMWORD PTR row+32 movaps XMMWORD PTR summand+32, xmm0 movss xmm0, DWORD PTR leftrow+12 shufps xmm0, xmm0, 0 mulps xmm0, XMMWORD PTR row+48 movaps XMMWORD PTR summand+48, xmm0 ret

As to whether I'd prefer my method of writing the loops or your method of unifying them in a single loop, they both look good to me. Yours is probably better for organizing the same "type" of operation going on within its own body of its own loop; mine mostly just focuses on representing each SIMD instruction in C loop form for each individual one, with only a theoretically better chance at enforcing the vectorization individually. It's a hard syntactical habit of mine for me to break.

But there is also an option C) here as well if you are interested: You could also do--

int j, k; for (k = 0; k < 4; k++) for (j = 0; j < 4; j++) summand[k][j] = leftrow[k] * row[k][j];

Sorry about formatting/indentation issues. I'm still kind of new to GitHub features.
EDIT by Narann: I reformat it for you, edit to see it. :)

Yeah! I love your C option. Can you confirm it provide the same assembly results? :)

I like it, too. Unfortunately, I just checked, and it did not provide the same assembly language results.

It added dynamic shuffle instruction code due to the new k variable, where shuffling one element across all four single-precision floating-point elements of the 128-bit xmm was no longer as fixated when using a loop variable instead of a literal 0, 1, 2, 3 for array index.

This is unfortunate I think, since we seem to agree that this option C) at the bottom of my reply is more readable. Maybe it will go away in a future version of the GCC compiler, and you can use it without any performance loss whatsoever. I sometimes observe that the current version of GCC is not good at auto-vectorizing but that the upcoming release of GCC will have such volatility fixed; you can try that if you want. I am currently compiling on 32-bit Linux with GCC 4.8.2; you might have a better version for this particular limit.

Here is the exact, full C code I am using to debug this btw:

float summand[4][4], leftrow[4], row[4][4]; int main(void) { register unsigned int j, k; for (k = 0; k < 4; k++) for (j = 0; j < 4; j++) summand[k][j] = leftrow[k] * row[k][j]; return 0; }

Compiled with:
gcc -S -O3 -msse2 -o /home/no/m64/main.s /home/no/m64/main.c
Omit the -msse2 flag if compiling in GCC for 64-bit x86 code.
Check main.s for the assembly language results of the code.

It was the same with the first two methods I gave, but the third method for all its readability and flexibility seems less rigid to the compiler's ability to guarantee single-move shuffling in the SHUFPS op-code.

I just tested and yeah, GCC doesn't like nested loops, even small ones...

I just tested on a old gcc (4.4 64-bits):

for (j = 0; j < 4; j++) summand[0][j] = leftrow[0] * row[0][j]; for (j = 0; j < 4; j++) summand[1][j] = leftrow[1] * row[1][j]; for (j = 0; j < 4; j++) summand[2][j] = leftrow[2] * row[2][j]; for (j = 0; j < 4; j++) summand[3][j] = leftrow[3] * row[3][j];

Result to this:

movss leftrow(%rip), %xmm0 xorl %eax, %eax movss row(%rip), %xmm1 mulss %xmm0, %xmm1 movss %xmm1, summand(%rip) movss row+4(%rip), %xmm1 mulss %xmm0, %xmm1 movss %xmm1, summand+4(%rip) movss row+8(%rip), %xmm1 mulss %xmm0, %xmm1 mulss row+12(%rip), %xmm0 movss %xmm1, summand+8(%rip) movss row+16(%rip), %xmm1 movss %xmm0, summand+12(%rip) movss leftrow+4(%rip), %xmm0 mulss %xmm0, %xmm1 movss %xmm1, summand+16(%rip) movss row+20(%rip), %xmm1 mulss %xmm0, %xmm1 movss %xmm1, summand+20(%rip) movss row+24(%rip), %xmm1 mulss %xmm0, %xmm1 mulss row+28(%rip), %xmm0 movss %xmm1, summand+24(%rip) movss row+32(%rip), %xmm1 movss %xmm0, summand+28(%rip) movss leftrow+8(%rip), %xmm0 mulss %xmm0, %xmm1 movss %xmm1, summand+32(%rip) movss row+36(%rip), %xmm1 mulss %xmm0, %xmm1 movss %xmm1, summand+36(%rip) movss row+40(%rip), %xmm1 mulss %xmm0, %xmm1 mulss row+44(%rip), %xmm0 movss %xmm1, summand+40(%rip) movss row+48(%rip), %xmm1 movss %xmm0, summand+44(%rip) movss leftrow+12(%rip), %xmm0 mulss %xmm0, %xmm1 movss %xmm1, summand+48(%rip) movss row+52(%rip), %xmm1 mulss %xmm0, %xmm1 movss %xmm1, summand+52(%rip) movss row+56(%rip), %xmm1 mulss %xmm0, %xmm1 movss %xmm1, summand+56(%rip) mulss row+60(%rip), %xmm0 movss %xmm0, summand+60(%rip)

I'm not confortable with asm but this result seems "silly". Is it me?

This:

for (j = 0; j < 4; j++){ summand[0][j] = leftrow[0] * row[0][j]; summand[1][j] = leftrow[1] * row[1][j]; summand[2][j] = leftrow[2] * row[2][j]; summand[3][j] = leftrow[3] * row[3][j]; }

Result to this (your result):

movss leftrow(%rip), %xmm0 xorl %eax, %eax shufps $0, %xmm0, %xmm0 mulps row(%rip), %xmm0 movaps %xmm0, summand(%rip) movss leftrow+4(%rip), %xmm0 shufps $0, %xmm0, %xmm0 mulps row+16(%rip), %xmm0 movaps %xmm0, summand+16(%rip) movss leftrow+8(%rip), %xmm0 shufps $0, %xmm0, %xmm0 mulps row+32(%rip), %xmm0 movaps %xmm0, summand+32(%rip) movss leftrow+12(%rip), %xmm0 shufps $0, %xmm0, %xmm0 mulps row+48(%rip), %xmm0 movaps %xmm0, summand+48(%rip)

And:

for (k = 0; k < 4; k++) for (j = 0; j < 4; j++) summand[k][j] = leftrow[k] * row[k][j];

Is the worse:

movaps row(%rip), %xmm0 xorl %eax, %eax movaps row+16(%rip), %xmm3 movaps %xmm0, %xmm2 movaps row+32(%rip), %xmm5 shufps $221, %xmm3, %xmm0 shufps $136, %xmm3, %xmm2 movaps row+48(%rip), %xmm1 movaps %xmm5, %xmm6 shufps $136, %xmm1, %xmm6 movaps %xmm2, %xmm3 shufps $221, %xmm1, %xmm5 movaps %xmm0, %xmm1 shufps $136, %xmm6, %xmm3 movaps leftrow(%rip), %xmm4 shufps $136, %xmm5, %xmm1 shufps $221, %xmm6, %xmm2 mulps %xmm4, %xmm3 shufps $221, %xmm5, %xmm0 mulps %xmm4, %xmm2 mulps %xmm4, %xmm1 mulps %xmm4, %xmm0 movaps %xmm3, %xmm4 unpckhps %xmm2, %xmm3 unpcklps %xmm2, %xmm4 movaps %xmm1, %xmm2 unpcklps %xmm0, %xmm2 unpckhps %xmm0, %xmm1 movaps %xmm4, %xmm0 unpcklps %xmm2, %xmm0 unpckhps %xmm2, %xmm4 movaps %xmm0, summand(%rip) movaps %xmm3, %xmm0 unpckhps %xmm1, %xmm3 unpcklps %xmm1, %xmm0 movaps %xmm4, summand+16(%rip) movaps %xmm0, summand+32(%rip) movaps %xmm3, summand+48(%rip)

So, on gcc 4.4, option 2 seems to be the best.

I'm impressed how results can change in this simple loop doing the exact same thing...

-Original file line number
+Diff line change
@@ Expand Up @@
     void MulMatricesC(float m1[4][4],float m2[4][4],float r[4][4])
     {
-      for (int i=0; i<4; i++)
-      {
-        for (int j=0; j<4; j++)
+        float row[4][4];
+        register unsigned int i, j;
+        for (i = 0; i < 4; i++)
+            for (j = 0; j < 4; j++)
+                row[i][j] = m2[i][j];
+        for (i = 0; i < 4; i++)
         {
-          r[i][j] = m1[i][0] * m2[0][j] +
-                    m1[i][1] * m2[1][j] +
-                    m1[i][2] * m2[2][j] +
-                    m1[i][3] * m2[3][j];
+            float leftrow[4], destrow[4];
+            float summand[4][4];
+            for (j = 0; j < 4; j++)
+                leftrow[j] = m1[i][j];
+            for (j = 0; j < 4; j++)
+                summand[0][j] = leftrow[0] * row[0][j];
+            for (j = 0; j < 4; j++)
+                summand[1][j] = leftrow[1] * row[1][j];
+            for (j = 0; j < 4; j++)
+                summand[2][j] = leftrow[2] * row[2][j];
+            for (j = 0; j < 4; j++)
+                summand[3][j] = leftrow[3] * row[3][j];
+            for (j = 0; j < 4; j++)
+                r[i][j] =
+                    summand[0][j]
+                  + summand[1][j]
+                  + summand[2][j]
+                  + summand[3][j]
+            ;
         }
-      }
     }
     // 2008.03.29 H.Morii - added SSE 3DNOW! 3x3 1x3 matrix multiplication
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix broken C version of MulMatricesC #37

Diff view

Diff view

There are no files selected for viewing

Narann Dec 31, 2014

cxd4 Dec 31, 2014

Narann Dec 31, 2014

cxd4 Jan 1, 2015

Narann Jan 1, 2015

cxd4 Jan 1, 2015

Narann Jan 1, 2015

cxd4 Jan 1, 2015

Narann Jan 2, 2015

Fix broken C version of MulMatricesC #37

Fix broken C version of MulMatricesC #37

Diff view

Diff view

There are no files selected for viewing

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment