Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken C version of MulMatricesC #37

Merged
merged 1 commit into from
Dec 31, 2014
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 29 additions & 8 deletions src/Glide64/3dmath.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -190,16 +190,37 @@ void InverseTransformVectorC (float *src, float *dst, float mat[4][4])

void MulMatricesC(float m1[4][4],float m2[4][4],float r[4][4])
{
for (int i=0; i<4; i++)
{
for (int j=0; j<4; j++)
float row[4][4];
register unsigned int i, j;

for (i = 0; i < 4; i++)
for (j = 0; j < 4; j++)
row[i][j] = m2[i][j];
for (i = 0; i < 4; i++)
{
r[i][j] = m1[i][0] * m2[0][j] +
m1[i][1] * m2[1][j] +
m1[i][2] * m2[2][j] +
m1[i][3] * m2[3][j];
float leftrow[4], destrow[4];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float destrow[4] is unused.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destrow was used in the original SSE algorithm in Glide64, before translation to portable C.
cxd4/mupen64plus-libretro@b08ab92#diff-bd9803dbb35139f5ba7c4e7a77557f76L239

It seems to have been meant as an intermediary token for accumulating the shuffled result before eventually storing it to that two-dimensional r[4][4] array. While updating the pattern for readability I decided that this step wasn't needed, and used Gonetz's comment about the summands to prepare each mulitplication to an array under that name instead.

I'd forgotten that I was no longer wanting to use their destrow array anymore. Feel free to remove that unused declaration if you like; the makefile I used for the build didn't have warnings like that set to complain. The chief reason for the commit I made was that before we couldn't get a 32-bit build going, and I needed to change the function pointer here to indicate that we had no reason to assume a dependency on SSE1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations cxd4, I appreciate.

I agree about removing SSE instructions in favor of portable C code. I removed destrow[4].

I just think the "auto vectorized way" is kind of unnatural to read and not obvious to get so I added few comments to provide the hint of potential future contributors:

// auto-vectorizable algorithm
// vectorized loop style, such that compilers can
// easily create optimized SSE instructions.

Thanks for contributing! :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. In particular, my point is, only the C code (also thought of as the "auto-vectorized" code) has universal meaning. Anything else like inline assembly, or assembly language intrinsics, is in the eye of the beholder and not naturally readable code to everyone--because, its readability is dictated more by the target processor.

For example, suppose you might prefer the intrinsic use:
ALIGNED int32_t a[4], b[4], c[4];
int main(void) {
*(__m128i *)c = _mm_and_si128(*(__m128i *)a, *(__m128i *)b);
return 0;

However, I find it even more readable to both the human and the compiler (for better optimizations outside the version of the SIMD instruction set architecture assumed by the usage of that intrinsic) to instead say:

int a[4], b[4], c[4];
int main(void) {
register unsigned int i;
for (i = 0; i < 4; i++)
c[i] = a[i] & b[i];
return 0;

Although the name of the _mm_and_si128 intrinsic is self-explanatory as to the operation of the code, the algorithm itself is more readable and more natural when using C code (and the compiler is left to generate what should be the best choice of SIMD intrinsic for and'ing all elements). This commit probably isn't the best example of that, however--you're welcome to disagree whether my C loops are as readable as some of the old code. The gist of it may have been portability, but I thought it was more readable to divide each parallelized step of the overal algorithm into its own loop. The original C loop code you had was much better than mine, though, in terms of shortness and brevity--maybe open to interpretation on the parallel steps what goes on. So I think it's fine if certain sketchy examples of this practice come up that you like to comment about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, there is a lot of SSE in many emulators. It's a shame as often, emulators whose code will remains for years have bad portability. It's also a reason why I tent to prefer shortness and brevity over compiler compliant writing: Emulator code should focus on readability over performance as it make life of potential future contributors easier.

I'm not a compiler expert but if we convert this:

for (j = 0; j < 4; j++)
    summand[0][j] = leftrow[0] * row[0][j];
for (j = 0; j < 4; j++)
    summand[1][j] = leftrow[1] * row[1][j];
for (j = 0; j < 4; j++)
    summand[2][j] = leftrow[2] * row[2][j];
for (j = 0; j < 4; j++)
    summand[3][j] = leftrow[3] * row[3][j];

To this:

for (j = 0; j < 4; j++) {
    summand[0][j] = leftrow[0] * row[0][j];
    summand[1][j] = leftrow[1] * row[1][j];
    summand[2][j] = leftrow[2] * row[2][j];
    summand[3][j] = leftrow[3] * row[3][j];
}

Are we sure the compiler will not detect vectorizable code and provide the same result than the first case?

I'm just curious.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fair question. I will check it now, but I anticipate the answer is that it will give the same result.

And yes, "readability over performance" is a good, slow but steady way to ultimately achieve better performance. I agree about prioritizing that over performance or premature optimization. :) That, and future developers understand the key to each algorithm like you said. Forcing non-portable code or prematurely optimizing for a limited span of targets (e.g. always use SSE intrinsics) generally worked better in the long run for closed-source projects or those that don't fully appreciate some benefits of open-source.

So if we convert those 4 separate loops I wrote into one single loop as you implied, GCC yields: the exact same output for both methods! So yes, your method of changing my loop produces the exact same output within almost certainly any compiler intelligent enough to vectorize them in the first place. To be more specific, the output it produces is:

movss   xmm0, DWORD PTR leftrow
shufps  xmm0, xmm0, 0
mulps   xmm0, XMMWORD PTR row
movaps  XMMWORD PTR summand, xmm0
movss   xmm0, DWORD PTR leftrow+4
shufps  xmm0, xmm0, 0
mulps   xmm0, XMMWORD PTR row+16
movaps  XMMWORD PTR summand+16, xmm0
movss   xmm0, DWORD PTR leftrow+8
shufps  xmm0, xmm0, 0
mulps   xmm0, XMMWORD PTR row+32
movaps  XMMWORD PTR summand+32, xmm0
movss   xmm0, DWORD PTR leftrow+12
shufps  xmm0, xmm0, 0
mulps   xmm0, XMMWORD PTR row+48
movaps  XMMWORD PTR summand+48, xmm0
ret

As to whether I'd prefer my method of writing the loops or your method of unifying them in a single loop, they both look good to me. Yours is probably better for organizing the same "type" of operation going on within its own body of its own loop; mine mostly just focuses on representing each SIMD instruction in C loop form for each individual one, with only a theoretically better chance at enforcing the vectorization individually. It's a hard syntactical habit of mine for me to break.

But there is also an option C) here as well if you are interested: You could also do--

int j, k;
for (k = 0; k < 4; k++)
    for (j = 0; j < 4; j++)
        summand[k][j] = leftrow[k] * row[k][j];

Sorry about formatting/indentation issues. I'm still kind of new to GitHub features.
EDIT by Narann: I reformat it for you, edit to see it. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! I love your C option. Can you confirm it provide the same assembly results? :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, too. Unfortunately, I just checked, and it did not provide the same assembly language results.

It added dynamic shuffle instruction code due to the new k variable, where shuffling one element across all four single-precision floating-point elements of the 128-bit xmm was no longer as fixated when using a loop variable instead of a literal 0, 1, 2, 3 for array index.

This is unfortunate I think, since we seem to agree that this option C) at the bottom of my reply is more readable. Maybe it will go away in a future version of the GCC compiler, and you can use it without any performance loss whatsoever. I sometimes observe that the current version of GCC is not good at auto-vectorizing but that the upcoming release of GCC will have such volatility fixed; you can try that if you want. I am currently compiling on 32-bit Linux with GCC 4.8.2; you might have a better version for this particular limit.

Here is the exact, full C code I am using to debug this btw:

float summand[4][4], leftrow[4], row[4][4];

int main(void)
{
    register unsigned int j, k;

    for (k = 0; k < 4; k++)
        for (j = 0; j < 4; j++)
            summand[k][j] = leftrow[k] * row[k][j];
    return 0;
}

Compiled with:
gcc -S -O3 -msse2 -o /home/no/m64/main.s /home/no/m64/main.c
Omit the -msse2 flag if compiling in GCC for 64-bit x86 code.
Check main.s for the assembly language results of the code.

It was the same with the first two methods I gave, but the third method for all its readability and flexibility seems less rigid to the compiler's ability to guarantee single-move shuffling in the SHUFPS op-code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested and yeah, GCC doesn't like nested loops, even small ones...

I just tested on a old gcc (4.4 64-bits):

for (j = 0; j < 4; j++)
    summand[0][j] = leftrow[0] * row[0][j];
for (j = 0; j < 4; j++)
    summand[1][j] = leftrow[1] * row[1][j];
for (j = 0; j < 4; j++)
    summand[2][j] = leftrow[2] * row[2][j];
for (j = 0; j < 4; j++)
    summand[3][j] = leftrow[3] * row[3][j];

Result to this:

    movss   leftrow(%rip), %xmm0
    xorl    %eax, %eax
    movss   row(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand(%rip)
    movss   row+4(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+4(%rip)
    movss   row+8(%rip), %xmm1
    mulss   %xmm0, %xmm1
    mulss   row+12(%rip), %xmm0
    movss   %xmm1, summand+8(%rip)
    movss   row+16(%rip), %xmm1
    movss   %xmm0, summand+12(%rip)
    movss   leftrow+4(%rip), %xmm0
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+16(%rip)
    movss   row+20(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+20(%rip)
    movss   row+24(%rip), %xmm1
    mulss   %xmm0, %xmm1
    mulss   row+28(%rip), %xmm0
    movss   %xmm1, summand+24(%rip)
    movss   row+32(%rip), %xmm1
    movss   %xmm0, summand+28(%rip)
    movss   leftrow+8(%rip), %xmm0
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+32(%rip)
    movss   row+36(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+36(%rip)
    movss   row+40(%rip), %xmm1
    mulss   %xmm0, %xmm1
    mulss   row+44(%rip), %xmm0
    movss   %xmm1, summand+40(%rip)
    movss   row+48(%rip), %xmm1
    movss   %xmm0, summand+44(%rip)
    movss   leftrow+12(%rip), %xmm0
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+48(%rip)
    movss   row+52(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+52(%rip)
    movss   row+56(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+56(%rip)
    mulss   row+60(%rip), %xmm0
    movss   %xmm0, summand+60(%rip)

I'm not confortable with asm but this result seems "silly". Is it me?

This:

for (j = 0; j < 4; j++){
    summand[0][j] = leftrow[0] * row[0][j];
    summand[1][j] = leftrow[1] * row[1][j];
    summand[2][j] = leftrow[2] * row[2][j];
    summand[3][j] = leftrow[3] * row[3][j];
}

Result to this (your result):

    movss   leftrow(%rip), %xmm0
    xorl    %eax, %eax
    shufps  $0, %xmm0, %xmm0
    mulps   row(%rip), %xmm0
    movaps  %xmm0, summand(%rip)
    movss   leftrow+4(%rip), %xmm0
    shufps  $0, %xmm0, %xmm0
    mulps   row+16(%rip), %xmm0
    movaps  %xmm0, summand+16(%rip)
    movss   leftrow+8(%rip), %xmm0
    shufps  $0, %xmm0, %xmm0
    mulps   row+32(%rip), %xmm0
    movaps  %xmm0, summand+32(%rip)
    movss   leftrow+12(%rip), %xmm0
    shufps  $0, %xmm0, %xmm0
    mulps   row+48(%rip), %xmm0
    movaps  %xmm0, summand+48(%rip)

And:

    for (k = 0; k < 4; k++)
        for (j = 0; j < 4; j++)
            summand[k][j] = leftrow[k] * row[k][j];

Is the worse:

    movaps  row(%rip), %xmm0
    xorl    %eax, %eax
    movaps  row+16(%rip), %xmm3
    movaps  %xmm0, %xmm2
    movaps  row+32(%rip), %xmm5
    shufps  $221, %xmm3, %xmm0
    shufps  $136, %xmm3, %xmm2
    movaps  row+48(%rip), %xmm1
    movaps  %xmm5, %xmm6
    shufps  $136, %xmm1, %xmm6
    movaps  %xmm2, %xmm3
    shufps  $221, %xmm1, %xmm5
    movaps  %xmm0, %xmm1
    shufps  $136, %xmm6, %xmm3
    movaps  leftrow(%rip), %xmm4
    shufps  $136, %xmm5, %xmm1
    shufps  $221, %xmm6, %xmm2
    mulps   %xmm4, %xmm3
    shufps  $221, %xmm5, %xmm0
    mulps   %xmm4, %xmm2
    mulps   %xmm4, %xmm1
    mulps   %xmm4, %xmm0
    movaps  %xmm3, %xmm4
    unpckhps    %xmm2, %xmm3
    unpcklps    %xmm2, %xmm4
    movaps  %xmm1, %xmm2
    unpcklps    %xmm0, %xmm2
    unpckhps    %xmm0, %xmm1
    movaps  %xmm4, %xmm0
    unpcklps    %xmm2, %xmm0
    unpckhps    %xmm2, %xmm4
    movaps  %xmm0, summand(%rip)
    movaps  %xmm3, %xmm0
    unpckhps    %xmm1, %xmm3
    unpcklps    %xmm1, %xmm0
    movaps  %xmm4, summand+16(%rip)
    movaps  %xmm0, summand+32(%rip)
    movaps  %xmm3, summand+48(%rip)

So, on gcc 4.4, option 2 seems to be the best.

I'm impressed how results can change in this simple loop doing the exact same thing...

float summand[4][4];

for (j = 0; j < 4; j++)
leftrow[j] = m1[i][j];

for (j = 0; j < 4; j++)
summand[0][j] = leftrow[0] * row[0][j];
for (j = 0; j < 4; j++)
summand[1][j] = leftrow[1] * row[1][j];
for (j = 0; j < 4; j++)
summand[2][j] = leftrow[2] * row[2][j];
for (j = 0; j < 4; j++)
summand[3][j] = leftrow[3] * row[3][j];

for (j = 0; j < 4; j++)
r[i][j] =
summand[0][j]
+ summand[1][j]
+ summand[2][j]
+ summand[3][j]
;
}
}
}

// 2008.03.29 H.Morii - added SSE 3DNOW! 3x3 1x3 matrix multiplication
Expand Down