Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix broken C version of MulMatricesC #37

Merged
merged 1 commit into from
Dec 31, 2014
Merged

Fix broken C version of MulMatricesC #37

merged 1 commit into from
Dec 31, 2014

Conversation

inactive123
Copy link
Contributor

Me and cxd4 (Hatcat/mephostophilis) stumbled upon this while backporting some stuff to the libretro version.

The C version of MulMatricesC appears to be totally broken. I guess this went unnoticed before because it would default to the SSE2 codepath version on regular desktops.

Read the explanation by cxd4 here:

https://github.com/libretro/mupen64plus-libretro/pull/159

' repairs were made to the ANSI C code which had broken texture elements in the backgrounds of games like Super Mario 64. In addition, the new ANSI C function is written in vectorized loop style, such that GCC can very easily create the SSE instructions relevant to said loops in a manner perhaps more optimized than the original SSE-fixed code.'

@littleguy77
Copy link
Member

Maybe I'm missing something, but I don't see any defects in the original code:

void MulMatricesC(float m1[4][4],float m2[4][4],float r[4][4])
{
    for (int i=0; i<4; i++)
    {
        for (int j=0; j<4; j++)
        {
            r[i][j] = m1[i][0] * m2[0][j] +
                      m1[i][1] * m2[1][j] +
                      m1[i][2] * m2[2][j] +
                      m1[i][3] * m2[3][j];
        }
    }
}

@littleguy77
Copy link
Member

yikes... chill @fayvel

Can we agree that this pull request is an optimization, not a fix.

@Narann
Copy link
Member

Narann commented Dec 30, 2014

@fayvel I'm always open to discussion with people from other N64 projects, especially when they try to contribute to the upstream. N64 scene is too small to let people rant over each others. :)

@twinaphex thanks for contributing back but as @littleguy77 say, the code seems to work. I suggest to look at the potential result of your PR.

I think your Glide64 version is not based on mk2 but the old glide64. Your PR would have more sense on this repo.

@Narann Narann closed this Dec 30, 2014
@inactive123
Copy link
Contributor Author

OK, forget about getting anymore upstream patches after this. You can thank @fayvel for that.

@inactive123
Copy link
Contributor Author

I really don't know what you're talking about, and with your passive-aggressive attitude on display here, I don't think I really would want to know anyway.

@inactive123
Copy link
Contributor Author

You create a hostile environment that I really want no part in and I have better things to do than to negotiate with people with anger management issues. It has nothing to do with 'punishing anybody'. The fork is opensource anyway so if they want anything they can take it.

@ghost
Copy link

ghost commented Dec 30, 2014

Can someone please merge the patch from @twinaphex. Thanks

@inactive123
Copy link
Contributor Author

This is not about the patch not having been merged or not, I don't care if it is merged or not, I was just trying to be helpful and in any case, the author of that patch (@cxd4) provided plenty of reasons for it being an improvement. But this is not about that. This is about your choice of words in that comment you have now removed. I'd appreciate it if you would let it just stay up instead of trying to act like you said nothing bad.

@littleguy77
Copy link
Member

lame @fayvel

@ghost
Copy link

ghost commented Dec 30, 2014

@twinaphex I am terrible sorry. What can I do to make all this go away? Can I give you a present or anything else?

I will try to not say anything bad again about the work of other people

@inactive123
Copy link
Contributor Author

You can say bad things for sure, but you don't have to be insulting and confrontational about it.

Anyway, I'll pretend this didn't happen for this time and I'll assume good intent from now on.

@Narann
Copy link
Member

Narann commented Dec 31, 2014

I've hesitated to remove the original @fayvel message at the moment I saw it but I didn't because I don't like this kind censorship and I prefer discussion.

But, I'm tired of all the angriness around the N64 scene. This angriness that make every N64 devs want to work away from other devs while I deeply think we should work together to make N64 the emulator it deserve. I spend some time trying to discuss and involve devs from other N64 project in the upstream so any upstream contribution should be appreciate and encouraged.

I will be less diplomatic next time I see such attack and would remove it instantly.

Sorry for the bad noise @twinaphex. Any upstream patch from you or your team is appreciate. If it would require some work, I would recommend to create an issue pointing your internal discussion. This would avoid your team to spend time on something maybe already fixed.

@littleguy77
Copy link
Member

This PR still represents a potential optimization. Anyone know any good test cases where MulMatricesC is a bottleneck?

@littleguy77
Copy link
Member

Agree with @Narann. There are too few devs to waste on pointless drama. Particularly good devs like @twinaphex -- who has already contributed numerous critical fixes to long-standing bugs back to upstream.

@Narann
Copy link
Member

Narann commented Dec 31, 2014

I was blocked on the fact this PR was put on the bad Glide64 version and miss the auto vectorization feature it provide.

@Narann Narann reopened this Dec 31, 2014
@Narann
Copy link
Member

Narann commented Dec 31, 2014

Thanks! @twinaphex I will merge now and comment why we do things like this (simply add "this make the code easy to vectorize for the compiler").

I encourage devs to comment when they write code is a unnatural way for specific optimization purpose.

Narann added a commit that referenced this pull request Dec 31, 2014
Rewrite MulMatricesC in a more auto-vectorizable way.
@Narann Narann merged commit 8b0e9b5 into mupen64plus:master Dec 31, 2014
m1[i][1] * m2[1][j] +
m1[i][2] * m2[2][j] +
m1[i][3] * m2[3][j];
float leftrow[4], destrow[4];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float destrow[4] is unused.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destrow was used in the original SSE algorithm in Glide64, before translation to portable C.
cxd4/mupen64plus-libretro@b08ab92#diff-bd9803dbb35139f5ba7c4e7a77557f76L239

It seems to have been meant as an intermediary token for accumulating the shuffled result before eventually storing it to that two-dimensional r[4][4] array. While updating the pattern for readability I decided that this step wasn't needed, and used Gonetz's comment about the summands to prepare each mulitplication to an array under that name instead.

I'd forgotten that I was no longer wanting to use their destrow array anymore. Feel free to remove that unused declaration if you like; the makefile I used for the build didn't have warnings like that set to complain. The chief reason for the commit I made was that before we couldn't get a 32-bit build going, and I needed to change the function pointer here to indicate that we had no reason to assume a dependency on SSE1.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanations cxd4, I appreciate.

I agree about removing SSE instructions in favor of portable C code. I removed destrow[4].

I just think the "auto vectorized way" is kind of unnatural to read and not obvious to get so I added few comments to provide the hint of potential future contributors:

// auto-vectorizable algorithm
// vectorized loop style, such that compilers can
// easily create optimized SSE instructions.

Thanks for contributing! :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine. In particular, my point is, only the C code (also thought of as the "auto-vectorized" code) has universal meaning. Anything else like inline assembly, or assembly language intrinsics, is in the eye of the beholder and not naturally readable code to everyone--because, its readability is dictated more by the target processor.

For example, suppose you might prefer the intrinsic use:
ALIGNED int32_t a[4], b[4], c[4];
int main(void) {
*(__m128i *)c = _mm_and_si128(*(__m128i *)a, *(__m128i *)b);
return 0;

However, I find it even more readable to both the human and the compiler (for better optimizations outside the version of the SIMD instruction set architecture assumed by the usage of that intrinsic) to instead say:

int a[4], b[4], c[4];
int main(void) {
register unsigned int i;
for (i = 0; i < 4; i++)
c[i] = a[i] & b[i];
return 0;

Although the name of the _mm_and_si128 intrinsic is self-explanatory as to the operation of the code, the algorithm itself is more readable and more natural when using C code (and the compiler is left to generate what should be the best choice of SIMD intrinsic for and'ing all elements). This commit probably isn't the best example of that, however--you're welcome to disagree whether my C loops are as readable as some of the old code. The gist of it may have been portability, but I thought it was more readable to divide each parallelized step of the overal algorithm into its own loop. The original C loop code you had was much better than mine, though, in terms of shortness and brevity--maybe open to interpretation on the parallel steps what goes on. So I think it's fine if certain sketchy examples of this practice come up that you like to comment about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, there is a lot of SSE in many emulators. It's a shame as often, emulators whose code will remains for years have bad portability. It's also a reason why I tent to prefer shortness and brevity over compiler compliant writing: Emulator code should focus on readability over performance as it make life of potential future contributors easier.

I'm not a compiler expert but if we convert this:

for (j = 0; j < 4; j++)
    summand[0][j] = leftrow[0] * row[0][j];
for (j = 0; j < 4; j++)
    summand[1][j] = leftrow[1] * row[1][j];
for (j = 0; j < 4; j++)
    summand[2][j] = leftrow[2] * row[2][j];
for (j = 0; j < 4; j++)
    summand[3][j] = leftrow[3] * row[3][j];

To this:

for (j = 0; j < 4; j++) {
    summand[0][j] = leftrow[0] * row[0][j];
    summand[1][j] = leftrow[1] * row[1][j];
    summand[2][j] = leftrow[2] * row[2][j];
    summand[3][j] = leftrow[3] * row[3][j];
}

Are we sure the compiler will not detect vectorizable code and provide the same result than the first case?

I'm just curious.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A fair question. I will check it now, but I anticipate the answer is that it will give the same result.

And yes, "readability over performance" is a good, slow but steady way to ultimately achieve better performance. I agree about prioritizing that over performance or premature optimization. :) That, and future developers understand the key to each algorithm like you said. Forcing non-portable code or prematurely optimizing for a limited span of targets (e.g. always use SSE intrinsics) generally worked better in the long run for closed-source projects or those that don't fully appreciate some benefits of open-source.

So if we convert those 4 separate loops I wrote into one single loop as you implied, GCC yields: the exact same output for both methods! So yes, your method of changing my loop produces the exact same output within almost certainly any compiler intelligent enough to vectorize them in the first place. To be more specific, the output it produces is:

movss   xmm0, DWORD PTR leftrow
shufps  xmm0, xmm0, 0
mulps   xmm0, XMMWORD PTR row
movaps  XMMWORD PTR summand, xmm0
movss   xmm0, DWORD PTR leftrow+4
shufps  xmm0, xmm0, 0
mulps   xmm0, XMMWORD PTR row+16
movaps  XMMWORD PTR summand+16, xmm0
movss   xmm0, DWORD PTR leftrow+8
shufps  xmm0, xmm0, 0
mulps   xmm0, XMMWORD PTR row+32
movaps  XMMWORD PTR summand+32, xmm0
movss   xmm0, DWORD PTR leftrow+12
shufps  xmm0, xmm0, 0
mulps   xmm0, XMMWORD PTR row+48
movaps  XMMWORD PTR summand+48, xmm0
ret

As to whether I'd prefer my method of writing the loops or your method of unifying them in a single loop, they both look good to me. Yours is probably better for organizing the same "type" of operation going on within its own body of its own loop; mine mostly just focuses on representing each SIMD instruction in C loop form for each individual one, with only a theoretically better chance at enforcing the vectorization individually. It's a hard syntactical habit of mine for me to break.

But there is also an option C) here as well if you are interested: You could also do--

int j, k;
for (k = 0; k < 4; k++)
    for (j = 0; j < 4; j++)
        summand[k][j] = leftrow[k] * row[k][j];

Sorry about formatting/indentation issues. I'm still kind of new to GitHub features.
EDIT by Narann: I reformat it for you, edit to see it. :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! I love your C option. Can you confirm it provide the same assembly results? :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, too. Unfortunately, I just checked, and it did not provide the same assembly language results.

It added dynamic shuffle instruction code due to the new k variable, where shuffling one element across all four single-precision floating-point elements of the 128-bit xmm was no longer as fixated when using a loop variable instead of a literal 0, 1, 2, 3 for array index.

This is unfortunate I think, since we seem to agree that this option C) at the bottom of my reply is more readable. Maybe it will go away in a future version of the GCC compiler, and you can use it without any performance loss whatsoever. I sometimes observe that the current version of GCC is not good at auto-vectorizing but that the upcoming release of GCC will have such volatility fixed; you can try that if you want. I am currently compiling on 32-bit Linux with GCC 4.8.2; you might have a better version for this particular limit.

Here is the exact, full C code I am using to debug this btw:

float summand[4][4], leftrow[4], row[4][4];

int main(void)
{
    register unsigned int j, k;

    for (k = 0; k < 4; k++)
        for (j = 0; j < 4; j++)
            summand[k][j] = leftrow[k] * row[k][j];
    return 0;
}

Compiled with:
gcc -S -O3 -msse2 -o /home/no/m64/main.s /home/no/m64/main.c
Omit the -msse2 flag if compiling in GCC for 64-bit x86 code.
Check main.s for the assembly language results of the code.

It was the same with the first two methods I gave, but the third method for all its readability and flexibility seems less rigid to the compiler's ability to guarantee single-move shuffling in the SHUFPS op-code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested and yeah, GCC doesn't like nested loops, even small ones...

I just tested on a old gcc (4.4 64-bits):

for (j = 0; j < 4; j++)
    summand[0][j] = leftrow[0] * row[0][j];
for (j = 0; j < 4; j++)
    summand[1][j] = leftrow[1] * row[1][j];
for (j = 0; j < 4; j++)
    summand[2][j] = leftrow[2] * row[2][j];
for (j = 0; j < 4; j++)
    summand[3][j] = leftrow[3] * row[3][j];

Result to this:

    movss   leftrow(%rip), %xmm0
    xorl    %eax, %eax
    movss   row(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand(%rip)
    movss   row+4(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+4(%rip)
    movss   row+8(%rip), %xmm1
    mulss   %xmm0, %xmm1
    mulss   row+12(%rip), %xmm0
    movss   %xmm1, summand+8(%rip)
    movss   row+16(%rip), %xmm1
    movss   %xmm0, summand+12(%rip)
    movss   leftrow+4(%rip), %xmm0
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+16(%rip)
    movss   row+20(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+20(%rip)
    movss   row+24(%rip), %xmm1
    mulss   %xmm0, %xmm1
    mulss   row+28(%rip), %xmm0
    movss   %xmm1, summand+24(%rip)
    movss   row+32(%rip), %xmm1
    movss   %xmm0, summand+28(%rip)
    movss   leftrow+8(%rip), %xmm0
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+32(%rip)
    movss   row+36(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+36(%rip)
    movss   row+40(%rip), %xmm1
    mulss   %xmm0, %xmm1
    mulss   row+44(%rip), %xmm0
    movss   %xmm1, summand+40(%rip)
    movss   row+48(%rip), %xmm1
    movss   %xmm0, summand+44(%rip)
    movss   leftrow+12(%rip), %xmm0
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+48(%rip)
    movss   row+52(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+52(%rip)
    movss   row+56(%rip), %xmm1
    mulss   %xmm0, %xmm1
    movss   %xmm1, summand+56(%rip)
    mulss   row+60(%rip), %xmm0
    movss   %xmm0, summand+60(%rip)

I'm not confortable with asm but this result seems "silly". Is it me?

This:

for (j = 0; j < 4; j++){
    summand[0][j] = leftrow[0] * row[0][j];
    summand[1][j] = leftrow[1] * row[1][j];
    summand[2][j] = leftrow[2] * row[2][j];
    summand[3][j] = leftrow[3] * row[3][j];
}

Result to this (your result):

    movss   leftrow(%rip), %xmm0
    xorl    %eax, %eax
    shufps  $0, %xmm0, %xmm0
    mulps   row(%rip), %xmm0
    movaps  %xmm0, summand(%rip)
    movss   leftrow+4(%rip), %xmm0
    shufps  $0, %xmm0, %xmm0
    mulps   row+16(%rip), %xmm0
    movaps  %xmm0, summand+16(%rip)
    movss   leftrow+8(%rip), %xmm0
    shufps  $0, %xmm0, %xmm0
    mulps   row+32(%rip), %xmm0
    movaps  %xmm0, summand+32(%rip)
    movss   leftrow+12(%rip), %xmm0
    shufps  $0, %xmm0, %xmm0
    mulps   row+48(%rip), %xmm0
    movaps  %xmm0, summand+48(%rip)

And:

    for (k = 0; k < 4; k++)
        for (j = 0; j < 4; j++)
            summand[k][j] = leftrow[k] * row[k][j];

Is the worse:

    movaps  row(%rip), %xmm0
    xorl    %eax, %eax
    movaps  row+16(%rip), %xmm3
    movaps  %xmm0, %xmm2
    movaps  row+32(%rip), %xmm5
    shufps  $221, %xmm3, %xmm0
    shufps  $136, %xmm3, %xmm2
    movaps  row+48(%rip), %xmm1
    movaps  %xmm5, %xmm6
    shufps  $136, %xmm1, %xmm6
    movaps  %xmm2, %xmm3
    shufps  $221, %xmm1, %xmm5
    movaps  %xmm0, %xmm1
    shufps  $136, %xmm6, %xmm3
    movaps  leftrow(%rip), %xmm4
    shufps  $136, %xmm5, %xmm1
    shufps  $221, %xmm6, %xmm2
    mulps   %xmm4, %xmm3
    shufps  $221, %xmm5, %xmm0
    mulps   %xmm4, %xmm2
    mulps   %xmm4, %xmm1
    mulps   %xmm4, %xmm0
    movaps  %xmm3, %xmm4
    unpckhps    %xmm2, %xmm3
    unpcklps    %xmm2, %xmm4
    movaps  %xmm1, %xmm2
    unpcklps    %xmm0, %xmm2
    unpckhps    %xmm0, %xmm1
    movaps  %xmm4, %xmm0
    unpcklps    %xmm2, %xmm0
    unpckhps    %xmm2, %xmm4
    movaps  %xmm0, summand(%rip)
    movaps  %xmm3, %xmm0
    unpckhps    %xmm1, %xmm3
    unpcklps    %xmm1, %xmm0
    movaps  %xmm4, summand+16(%rip)
    movaps  %xmm0, summand+32(%rip)
    movaps  %xmm3, summand+48(%rip)

So, on gcc 4.4, option 2 seems to be the best.

I'm impressed how results can change in this simple loop doing the exact same thing...

Narann added a commit that referenced this pull request Jan 1, 2015
Rewrite MulMatricesC in a more auto-vectorizable way.
@cxd4
Copy link

cxd4 commented Jan 2, 2015

Inevitably, they will. Auto-vectorization is still a rather sensitive science. You can usually count on new releases of Microsoft Visual C compilers to be stubborn and predictable in that regard (also less effective overall), whereas GCC it is always ever-improving...and ever-worsening! I had to compile my older RSP code in GCC 4.7.2 instead of GCC 4.8 due to the regressions outweighing the improvements when it came to auto-vectorization, so it's always best if we keep it simple.

Your first asm blob result--the first option that I was already doing with those 4 separate loops--seems to be the result of GCC failing to pack the floating-points into parallel streaming. You keep seeing loads of "movss" and "mulss" throughout that overly large asm result, when really, it should be "mulps". Rather than multiplying 4 floats in a single loop, it's doing 4 individual mulss operations, per each loop.

The second asm blob result is still the result of your method you suggested earlier, and the result of my method from before on GCC 4.8.2. It seems you were right after all to suggest this improvement to my loops--I will do another pull request to twinaphex shortly crediting you.

Now, the third asm result you gave, IS succeeding at packing the results in parallel SIMD, but failing at shuffling by your outer loop variable, k, causing a lack of static compile-time prediction when organizing the shufps shuffle immediates. It's supposed to move one 32-bit FP into the low 32 bits of the xmm, and then shuffle it across bits 63..32, 95..64, and 127..96 of the XMM using SHUFPS xmm0, xmm0, 00h, but is instead trying to bend the 8-bit shuffle immediate over to each case offset. So it is half successful, ironically. Why option 3 is half-successful whereas my original option 1 didn't pack at all does go beyond me, though.

@cxd4
Copy link

cxd4 commented Jan 2, 2015

And thanks for sharing all that, by the way.
Now I'm giving it a go in MSVC 2013's standard 64-bit Release mode optimizations.

MSVC is a bit slower with improvements to vectorization than GCC is, but it is also more steady and stubborn about it. MSVC 2010 and earlier had no auto-vectorization support whatsoever from what I could tell, so I have to use 2013 for this. (Not sure who has MSVC 2012?)

option 1 (my original loops) gave:

main PROC
    movss   xmm2, DWORD PTR leftrow
    movss   xmm3, DWORD PTR leftrow+4
    xor     eax, eax
    movaps  xmm0, xmm2
    movaps  xmm1, xmm2
    mulss   xmm0, DWORD PTR row
    mulss   xmm1, DWORD PTR row+4
    movss   DWORD PTR summand, xmm0
    movaps  xmm0, xmm2
    movss   DWORD PTR summand+4, xmm1
    movaps  xmm1, xmm3
    mulss   xmm0, DWORD PTR row+8
    mulss   xmm1, DWORD PTR row+20
    mulss   xmm2, DWORD PTR row+12
    movss   DWORD PTR summand+8, xmm0
    movaps  xmm0, xmm3
    movss   DWORD PTR summand+20, xmm1
    movss   DWORD PTR summand+12, xmm2
    movss   xmm2, DWORD PTR leftrow+8
    mulss   xmm0, DWORD PTR row+16
    movaps  xmm1, xmm2
    mulss   xmm1, DWORD PTR row+36
    movss   DWORD PTR summand+16, xmm0
    movaps  xmm0, xmm3
    movss   DWORD PTR summand+36, xmm1
    mulss   xmm0, DWORD PTR row+24
    mulss   xmm3, DWORD PTR row+28
    movss   DWORD PTR summand+24, xmm0
    movaps  xmm0, xmm2
    movss   DWORD PTR summand+28, xmm3
    movss   xmm3, DWORD PTR leftrow+12
    mulss   xmm0, DWORD PTR row+32
    movaps  xmm1, xmm3
    mulss   xmm1, DWORD PTR row+52
    movss   DWORD PTR summand+32, xmm0
    movaps  xmm0, xmm2
    mulss   xmm0, DWORD PTR row+40
    mulss   xmm2, DWORD PTR row+44
    movss   DWORD PTR summand+52, xmm1
    movss   DWORD PTR summand+40, xmm0
    movaps  xmm0, xmm3
    movss   DWORD PTR summand+44, xmm2
    mulss   xmm0, DWORD PTR row+48
    movss   DWORD PTR summand+48, xmm0
    movaps  xmm0, xmm3
    mulss   xmm3, DWORD PTR row+60
    mulss   xmm0, DWORD PTR row+56
    movss   DWORD PTR summand+60, xmm3
    movss   DWORD PTR summand+56, xmm0
    ret 0

If I'm not mis-counting the op-codes, it's quite the same situation as your GCC 4.4 output for that C.

option 2 (your new loop) gave:

main PROC
    movss   xmm2, DWORD PTR leftrow
    movups  xmm0, XMMWORD PTR row
    movss   xmm3, DWORD PTR leftrow+4
    xor     eax, eax
    shufps  xmm2, xmm2, 0
    shufps  xmm3, xmm3, 0
    mulps   xmm0, xmm2
    movss   xmm2, DWORD PTR leftrow+8
    movups  XMMWORD PTR summand, xmm0
    movups  xmm0, XMMWORD PTR row+16
    shufps  xmm2, xmm2, 0
    mulps   xmm0, xmm3
    movss   xmm3, DWORD PTR leftrow+12
    movups  XMMWORD PTR summand+16, xmm0
    movups  xmm0, XMMWORD PTR row+32
    shufps  xmm3, xmm3, 0
    mulps   xmm0, xmm2
    movups  XMMWORD PTR summand+32, xmm0
    movups  xmm0, XMMWORD PTR row+48
    mulps   xmm0, xmm3
    movups  XMMWORD PTR summand+48, xmm0
    ret

Quite comparable to your output with GCC 4.4, except with 4 additional instructions due to some extra moves. However it is clearly as successful.

Now the third option with the loop inside of a loop we both liked:

main PROC
    xor     eax, eax
    lea     rcx, OFFSET FLAT:leftrow
    lea     r8, OFFSET FLAT:__ImageBase
    lea     edx, QWORD PTR [rax+4]
$LL6@main:
    movss   xmm0, DWORD PTR [rcx]
    movups  xmm1, XMMWORD PTR row[rax+r8]
    lea     rax, QWORD PTR [rax+16]
    lea     rcx, QWORD PTR [rcx+4]
    shufps  xmm0, xmm0, 0
    mulps   xmm1, xmm0
    movups  XMMWORD PTR summand[rax+r8-16], xmm1
    dec     rdx
    jne     SHORT $LL6@main
    xor     eax, eax
    ret     0

...fails actually in unrolling the loop (although GCC 4.4 may have succeeded at this, it didn't completely unroll the loop in a well-enough way that it also reorganized the shuffling optimally), but succeeds in every other sense. In terms of readable assembly code it is interesting to look at nonetheless.

MSVC also continues to refuse to unroll the loop regardless of what optimization settings I try.

What interests me however, is that MSVC succeeded where GCC failed: It did shufps by 0, whereas GCC kept trying to change the immediate and inversely impacting the efficiency. Still, while MSVC is sometimes more intelligent, it tends to generate slightly extra instructions than it needs to.

@Narann
Copy link
Member

Narann commented Jan 2, 2015

Thanks for those valuable information. The conclusion to this seems to still be "write your code in order to make it humanly readable as compilers will never succeed to optimize it every time anyway".

I guess it's even more true as the number of instruction set archs tend to increase....

@cxd4
Copy link

cxd4 commented Jan 2, 2015

That's what C is best at. As the number of instruction set archs increases, individual compilers become more and more obligated to treat cross-platform, "humanly readable" code as being the same exact thing as "optimized" code. :) Of course, it will take many years before those two really do coincide as I'm sure you've seen. But it would be lovely for them to mean the same exact thing, wouldn't it? We're obviously not there now, but we're much closer than we were in the 90s.

There's a lot of fundamental truth to that philosophy you pose, but besides readability, accuracy is important also. Sometimes, if we refuse to write code for any convenience to the compiler's ability to optimize, and only for human readability, the end code is less accurate. For example, if we don't tempt the compiler into vectorizing readable code, even if sometimes by making it slightly less readable, we might not get any SIMD emulation of N64 RCP's SIMD at all. Never minding that this is slower: It is less accurate as well in concept. Vector CPU should be emulated by another vector CPU--besides being faster, I see it as accuracy. Even the use of even less readable, non-portable SSE intrinsic code can be justifiable under that logic ("accuracy" vs. "readability" of alg.), if we have no alternatives. But thankfully, we do!

But anyway, I'm trying not to spam this conversation's length with so much chatting. I just figured I'd drop a note--thanks to your information I sent twinaphex another pull request here. Figured I'd let you know just in case you thought it had any other improvement to readability.
https://github.com/libretro/mupen64plus-libretro/pull/170

@Narann
Copy link
Member

Narann commented Jan 3, 2015

I agree, thanks for the discussion.
If you are interested in GCC vectorization, few links and look for -ftree-vectorize and -ftree-vectorizer-verbose=2.

Keep the good work!

littleguy77 added a commit to littleguy77/mupen64plus-video-glide64mk2 that referenced this pull request Jan 7, 2015
mupen64plus@3e58eb7

* 3e58eb7 fix texture pack read failure on some 64-bit platforms, found by EndoplasmaticReticulum on github, due to png_uint_32 type being unfortunately defined as 'unsigned long int'
*   a18ee5b Merge pull request mupen64plus#29 from littleguy77/master
|\
| * 717b587 Implement optional frameskip feature.
|/
*   4727785 Merge branch 'master' of https://github.com/mupen64plus/mupen64plus-video-glide64mk2
|\
| *   8b0e9b5 Merge pull request mupen64plus#37 from twinaphex/master
| |\
* | \   c5998a5 Merge pull request mupen64plus#37 from twinaphex/master
|\ \ \
| |/ /
|/| /
| |/
| * 154d818 Fix broken C version of MulMatricesC
* |   c435f2a Merge pull request mupen64plus#38 from fayvel/build
|\ \
| * | 3493bd6 add extra includes and preprocessor checks for _WIN32
| * | 4d8c50a use more widely used preprocessor defines for MinGW
| * | fd37e9e use different swprintf call for _GLIBCXX_HAVE_BROKEN_VSWPRINTF
| * | a8ca725 use travis ci with clang
| * | 378c250 use newer boost for C++-11 compatibility
| * | 9593518 use Unix-compatible SDL include for SDL_thread
| * | d610bff use non-strict-ansi mode to build sources
|/ /
* |   2e66b4a Merge pull request mupen64plus#36 from krnlyng/gles
|\ \
| |/
|/|
| * a4f33ad include SDL_opengles2.h instead of SDL_opengles.h
|/
* 9a3f9ff more makefile fixes for OSX with xcode 6.1.1
* 4bbe95d use SDL threads instead of C++11 threads, which are not supported on clang
* 6a46ad1 update makefile for build with clang on XCode 6.1.1
* 339ca50 clang doesnt have the to_wstring method
* 7200662 Merge pull request mupen64plus#28 from littleguy77/master
* 3a44862 Add documentation comments for polygon offset settings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants