-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken C version of MulMatricesC #37
Conversation
Maybe I'm missing something, but I don't see any defects in the original code:
|
yikes... chill @fayvel Can we agree that this pull request is an optimization, not a fix. |
@fayvel I'm always open to discussion with people from other N64 projects, especially when they try to contribute to the upstream. N64 scene is too small to let people rant over each others. :) @twinaphex thanks for contributing back but as @littleguy77 say, the code seems to work. I suggest to look at the potential result of your PR. I think your Glide64 version is not based on mk2 but the old glide64. Your PR would have more sense on this repo. |
OK, forget about getting anymore upstream patches after this. You can thank @fayvel for that. |
I really don't know what you're talking about, and with your passive-aggressive attitude on display here, I don't think I really would want to know anyway. |
You create a hostile environment that I really want no part in and I have better things to do than to negotiate with people with anger management issues. It has nothing to do with 'punishing anybody'. The fork is opensource anyway so if they want anything they can take it. |
Can someone please merge the patch from @twinaphex. Thanks |
This is not about the patch not having been merged or not, I don't care if it is merged or not, I was just trying to be helpful and in any case, the author of that patch (@cxd4) provided plenty of reasons for it being an improvement. But this is not about that. This is about your choice of words in that comment you have now removed. I'd appreciate it if you would let it just stay up instead of trying to act like you said nothing bad. |
lame @fayvel |
@twinaphex I am terrible sorry. What can I do to make all this go away? Can I give you a present or anything else? I will try to not say anything bad again about the work of other people |
You can say bad things for sure, but you don't have to be insulting and confrontational about it. Anyway, I'll pretend this didn't happen for this time and I'll assume good intent from now on. |
I've hesitated to remove the original @fayvel message at the moment I saw it but I didn't because I don't like this kind censorship and I prefer discussion. But, I'm tired of all the angriness around the N64 scene. This angriness that make every N64 devs want to work away from other devs while I deeply think we should work together to make N64 the emulator it deserve. I spend some time trying to discuss and involve devs from other N64 project in the upstream so any upstream contribution should be appreciate and encouraged. I will be less diplomatic next time I see such attack and would remove it instantly. Sorry for the bad noise @twinaphex. Any upstream patch from you or your team is appreciate. If it would require some work, I would recommend to create an issue pointing your internal discussion. This would avoid your team to spend time on something maybe already fixed. |
This PR still represents a potential optimization. Anyone know any good test cases where MulMatricesC is a bottleneck? |
Agree with @Narann. There are too few devs to waste on pointless drama. Particularly good devs like @twinaphex -- who has already contributed numerous critical fixes to long-standing bugs back to upstream. |
I was blocked on the fact this PR was put on the bad Glide64 version and miss the auto vectorization feature it provide. |
Thanks! @twinaphex I will merge now and comment why we do things like this (simply add "this make the code easy to vectorize for the compiler"). I encourage devs to comment when they write code is a unnatural way for specific optimization purpose. |
Rewrite MulMatricesC in a more auto-vectorizable way.
m1[i][1] * m2[1][j] + | ||
m1[i][2] * m2[2][j] + | ||
m1[i][3] * m2[3][j]; | ||
float leftrow[4], destrow[4]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float destrow[4]
is unused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destrow
was used in the original SSE algorithm in Glide64, before translation to portable C.
cxd4/mupen64plus-libretro@b08ab92#diff-bd9803dbb35139f5ba7c4e7a77557f76L239
It seems to have been meant as an intermediary token for accumulating the shuffled result before eventually storing it to that two-dimensional r[4][4]
array. While updating the pattern for readability I decided that this step wasn't needed, and used Gonetz's comment about the summands to prepare each mulitplication to an array under that name instead.
I'd forgotten that I was no longer wanting to use their destrow
array anymore. Feel free to remove that unused declaration if you like; the makefile I used for the build didn't have warnings like that set to complain. The chief reason for the commit I made was that before we couldn't get a 32-bit build going, and I needed to change the function pointer here to indicate that we had no reason to assume a dependency on SSE1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations cxd4, I appreciate.
I agree about removing SSE instructions in favor of portable C code. I removed destrow[4]
.
I just think the "auto vectorized way" is kind of unnatural to read and not obvious to get so I added few comments to provide the hint of potential future contributors:
// auto-vectorizable algorithm
// vectorized loop style, such that compilers can
// easily create optimized SSE instructions.
Thanks for contributing! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. In particular, my point is, only the C code (also thought of as the "auto-vectorized" code) has universal meaning. Anything else like inline assembly, or assembly language intrinsics, is in the eye of the beholder and not naturally readable code to everyone--because, its readability is dictated more by the target processor.
For example, suppose you might prefer the intrinsic use:
ALIGNED int32_t a[4], b[4], c[4];
int main(void) {
*(__m128i *)c = _mm_and_si128(*(__m128i *)a, *(__m128i *)b);
return 0;
However, I find it even more readable to both the human and the compiler (for better optimizations outside the version of the SIMD instruction set architecture assumed by the usage of that intrinsic) to instead say:
int a[4], b[4], c[4];
int main(void) {
register unsigned int i;
for (i = 0; i < 4; i++)
c[i] = a[i] & b[i];
return 0;
Although the name of the _mm_and_si128
intrinsic is self-explanatory as to the operation of the code, the algorithm itself is more readable and more natural when using C code (and the compiler is left to generate what should be the best choice of SIMD intrinsic for and'ing all elements). This commit probably isn't the best example of that, however--you're welcome to disagree whether my C loops are as readable as some of the old code. The gist of it may have been portability, but I thought it was more readable to divide each parallelized step of the overal algorithm into its own loop. The original C loop code you had was much better than mine, though, in terms of shortness and brevity--maybe open to interpretation on the parallel steps what goes on. So I think it's fine if certain sketchy examples of this practice come up that you like to comment about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, there is a lot of SSE in many emulators. It's a shame as often, emulators whose code will remains for years have bad portability. It's also a reason why I tent to prefer shortness and brevity over compiler compliant writing: Emulator code should focus on readability over performance as it make life of potential future contributors easier.
I'm not a compiler expert but if we convert this:
for (j = 0; j < 4; j++)
summand[0][j] = leftrow[0] * row[0][j];
for (j = 0; j < 4; j++)
summand[1][j] = leftrow[1] * row[1][j];
for (j = 0; j < 4; j++)
summand[2][j] = leftrow[2] * row[2][j];
for (j = 0; j < 4; j++)
summand[3][j] = leftrow[3] * row[3][j];
To this:
for (j = 0; j < 4; j++) {
summand[0][j] = leftrow[0] * row[0][j];
summand[1][j] = leftrow[1] * row[1][j];
summand[2][j] = leftrow[2] * row[2][j];
summand[3][j] = leftrow[3] * row[3][j];
}
Are we sure the compiler will not detect vectorizable code and provide the same result than the first case?
I'm just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fair question. I will check it now, but I anticipate the answer is that it will give the same result.
And yes, "readability over performance" is a good, slow but steady way to ultimately achieve better performance. I agree about prioritizing that over performance or premature optimization. :) That, and future developers understand the key to each algorithm like you said. Forcing non-portable code or prematurely optimizing for a limited span of targets (e.g. always use SSE intrinsics) generally worked better in the long run for closed-source projects or those that don't fully appreciate some benefits of open-source.
So if we convert those 4 separate loops I wrote into one single loop as you implied, GCC yields: the exact same output for both methods! So yes, your method of changing my loop produces the exact same output within almost certainly any compiler intelligent enough to vectorize them in the first place. To be more specific, the output it produces is:
movss xmm0, DWORD PTR leftrow
shufps xmm0, xmm0, 0
mulps xmm0, XMMWORD PTR row
movaps XMMWORD PTR summand, xmm0
movss xmm0, DWORD PTR leftrow+4
shufps xmm0, xmm0, 0
mulps xmm0, XMMWORD PTR row+16
movaps XMMWORD PTR summand+16, xmm0
movss xmm0, DWORD PTR leftrow+8
shufps xmm0, xmm0, 0
mulps xmm0, XMMWORD PTR row+32
movaps XMMWORD PTR summand+32, xmm0
movss xmm0, DWORD PTR leftrow+12
shufps xmm0, xmm0, 0
mulps xmm0, XMMWORD PTR row+48
movaps XMMWORD PTR summand+48, xmm0
ret
As to whether I'd prefer my method of writing the loops or your method of unifying them in a single loop, they both look good to me. Yours is probably better for organizing the same "type" of operation going on within its own body of its own loop; mine mostly just focuses on representing each SIMD instruction in C loop form for each individual one, with only a theoretically better chance at enforcing the vectorization individually. It's a hard syntactical habit of mine for me to break.
But there is also an option C) here as well if you are interested: You could also do--
int j, k;
for (k = 0; k < 4; k++)
for (j = 0; j < 4; j++)
summand[k][j] = leftrow[k] * row[k][j];
Sorry about formatting/indentation issues. I'm still kind of new to GitHub features.
EDIT by Narann: I reformat it for you, edit to see it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! I love your C option. Can you confirm it provide the same assembly results? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, too. Unfortunately, I just checked, and it did not provide the same assembly language results.
It added dynamic shuffle instruction code due to the new k
variable, where shuffling one element across all four single-precision floating-point elements of the 128-bit xmm was no longer as fixated when using a loop variable instead of a literal 0, 1, 2, 3 for array index.
This is unfortunate I think, since we seem to agree that this option C) at the bottom of my reply is more readable. Maybe it will go away in a future version of the GCC compiler, and you can use it without any performance loss whatsoever. I sometimes observe that the current version of GCC is not good at auto-vectorizing but that the upcoming release of GCC will have such volatility fixed; you can try that if you want. I am currently compiling on 32-bit Linux with GCC 4.8.2; you might have a better version for this particular limit.
Here is the exact, full C code I am using to debug this btw:
float summand[4][4], leftrow[4], row[4][4];
int main(void)
{
register unsigned int j, k;
for (k = 0; k < 4; k++)
for (j = 0; j < 4; j++)
summand[k][j] = leftrow[k] * row[k][j];
return 0;
}
Compiled with:
gcc -S -O3 -msse2 -o /home/no/m64/main.s /home/no/m64/main.c
Omit the -msse2
flag if compiling in GCC for 64-bit x86 code.
Check main.s
for the assembly language results of the code.
It was the same with the first two methods I gave, but the third method for all its readability and flexibility seems less rigid to the compiler's ability to guarantee single-move shuffling in the SHUFPS op-code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested and yeah, GCC doesn't like nested loops, even small ones...
I just tested on a old gcc (4.4 64-bits):
for (j = 0; j < 4; j++)
summand[0][j] = leftrow[0] * row[0][j];
for (j = 0; j < 4; j++)
summand[1][j] = leftrow[1] * row[1][j];
for (j = 0; j < 4; j++)
summand[2][j] = leftrow[2] * row[2][j];
for (j = 0; j < 4; j++)
summand[3][j] = leftrow[3] * row[3][j];
Result to this:
movss leftrow(%rip), %xmm0
xorl %eax, %eax
movss row(%rip), %xmm1
mulss %xmm0, %xmm1
movss %xmm1, summand(%rip)
movss row+4(%rip), %xmm1
mulss %xmm0, %xmm1
movss %xmm1, summand+4(%rip)
movss row+8(%rip), %xmm1
mulss %xmm0, %xmm1
mulss row+12(%rip), %xmm0
movss %xmm1, summand+8(%rip)
movss row+16(%rip), %xmm1
movss %xmm0, summand+12(%rip)
movss leftrow+4(%rip), %xmm0
mulss %xmm0, %xmm1
movss %xmm1, summand+16(%rip)
movss row+20(%rip), %xmm1
mulss %xmm0, %xmm1
movss %xmm1, summand+20(%rip)
movss row+24(%rip), %xmm1
mulss %xmm0, %xmm1
mulss row+28(%rip), %xmm0
movss %xmm1, summand+24(%rip)
movss row+32(%rip), %xmm1
movss %xmm0, summand+28(%rip)
movss leftrow+8(%rip), %xmm0
mulss %xmm0, %xmm1
movss %xmm1, summand+32(%rip)
movss row+36(%rip), %xmm1
mulss %xmm0, %xmm1
movss %xmm1, summand+36(%rip)
movss row+40(%rip), %xmm1
mulss %xmm0, %xmm1
mulss row+44(%rip), %xmm0
movss %xmm1, summand+40(%rip)
movss row+48(%rip), %xmm1
movss %xmm0, summand+44(%rip)
movss leftrow+12(%rip), %xmm0
mulss %xmm0, %xmm1
movss %xmm1, summand+48(%rip)
movss row+52(%rip), %xmm1
mulss %xmm0, %xmm1
movss %xmm1, summand+52(%rip)
movss row+56(%rip), %xmm1
mulss %xmm0, %xmm1
movss %xmm1, summand+56(%rip)
mulss row+60(%rip), %xmm0
movss %xmm0, summand+60(%rip)
I'm not confortable with asm but this result seems "silly". Is it me?
This:
for (j = 0; j < 4; j++){
summand[0][j] = leftrow[0] * row[0][j];
summand[1][j] = leftrow[1] * row[1][j];
summand[2][j] = leftrow[2] * row[2][j];
summand[3][j] = leftrow[3] * row[3][j];
}
Result to this (your result):
movss leftrow(%rip), %xmm0
xorl %eax, %eax
shufps $0, %xmm0, %xmm0
mulps row(%rip), %xmm0
movaps %xmm0, summand(%rip)
movss leftrow+4(%rip), %xmm0
shufps $0, %xmm0, %xmm0
mulps row+16(%rip), %xmm0
movaps %xmm0, summand+16(%rip)
movss leftrow+8(%rip), %xmm0
shufps $0, %xmm0, %xmm0
mulps row+32(%rip), %xmm0
movaps %xmm0, summand+32(%rip)
movss leftrow+12(%rip), %xmm0
shufps $0, %xmm0, %xmm0
mulps row+48(%rip), %xmm0
movaps %xmm0, summand+48(%rip)
And:
for (k = 0; k < 4; k++)
for (j = 0; j < 4; j++)
summand[k][j] = leftrow[k] * row[k][j];
Is the worse:
movaps row(%rip), %xmm0
xorl %eax, %eax
movaps row+16(%rip), %xmm3
movaps %xmm0, %xmm2
movaps row+32(%rip), %xmm5
shufps $221, %xmm3, %xmm0
shufps $136, %xmm3, %xmm2
movaps row+48(%rip), %xmm1
movaps %xmm5, %xmm6
shufps $136, %xmm1, %xmm6
movaps %xmm2, %xmm3
shufps $221, %xmm1, %xmm5
movaps %xmm0, %xmm1
shufps $136, %xmm6, %xmm3
movaps leftrow(%rip), %xmm4
shufps $136, %xmm5, %xmm1
shufps $221, %xmm6, %xmm2
mulps %xmm4, %xmm3
shufps $221, %xmm5, %xmm0
mulps %xmm4, %xmm2
mulps %xmm4, %xmm1
mulps %xmm4, %xmm0
movaps %xmm3, %xmm4
unpckhps %xmm2, %xmm3
unpcklps %xmm2, %xmm4
movaps %xmm1, %xmm2
unpcklps %xmm0, %xmm2
unpckhps %xmm0, %xmm1
movaps %xmm4, %xmm0
unpcklps %xmm2, %xmm0
unpckhps %xmm2, %xmm4
movaps %xmm0, summand(%rip)
movaps %xmm3, %xmm0
unpckhps %xmm1, %xmm3
unpcklps %xmm1, %xmm0
movaps %xmm4, summand+16(%rip)
movaps %xmm0, summand+32(%rip)
movaps %xmm3, summand+48(%rip)
So, on gcc 4.4, option 2 seems to be the best.
I'm impressed how results can change in this simple loop doing the exact same thing...
Rewrite MulMatricesC in a more auto-vectorizable way.
Inevitably, they will. Auto-vectorization is still a rather sensitive science. You can usually count on new releases of Microsoft Visual C compilers to be stubborn and predictable in that regard (also less effective overall), whereas GCC it is always ever-improving...and ever-worsening! I had to compile my older RSP code in GCC 4.7.2 instead of GCC 4.8 due to the regressions outweighing the improvements when it came to auto-vectorization, so it's always best if we keep it simple. Your first asm blob result--the first option that I was already doing with those 4 separate loops--seems to be the result of GCC failing to pack the floating-points into parallel streaming. You keep seeing loads of "movss" and "mulss" throughout that overly large asm result, when really, it should be "mulps". Rather than multiplying 4 floats in a single loop, it's doing 4 individual mulss operations, per each loop. The second asm blob result is still the result of your method you suggested earlier, and the result of my method from before on GCC 4.8.2. It seems you were right after all to suggest this improvement to my loops--I will do another pull request to twinaphex shortly crediting you. Now, the third asm result you gave, IS succeeding at packing the results in parallel SIMD, but failing at shuffling by your outer loop variable, |
And thanks for sharing all that, by the way. MSVC is a bit slower with improvements to vectorization than GCC is, but it is also more steady and stubborn about it. MSVC 2010 and earlier had no auto-vectorization support whatsoever from what I could tell, so I have to use 2013 for this. (Not sure who has MSVC 2012?) option 1 (my original loops) gave: main PROC
movss xmm2, DWORD PTR leftrow
movss xmm3, DWORD PTR leftrow+4
xor eax, eax
movaps xmm0, xmm2
movaps xmm1, xmm2
mulss xmm0, DWORD PTR row
mulss xmm1, DWORD PTR row+4
movss DWORD PTR summand, xmm0
movaps xmm0, xmm2
movss DWORD PTR summand+4, xmm1
movaps xmm1, xmm3
mulss xmm0, DWORD PTR row+8
mulss xmm1, DWORD PTR row+20
mulss xmm2, DWORD PTR row+12
movss DWORD PTR summand+8, xmm0
movaps xmm0, xmm3
movss DWORD PTR summand+20, xmm1
movss DWORD PTR summand+12, xmm2
movss xmm2, DWORD PTR leftrow+8
mulss xmm0, DWORD PTR row+16
movaps xmm1, xmm2
mulss xmm1, DWORD PTR row+36
movss DWORD PTR summand+16, xmm0
movaps xmm0, xmm3
movss DWORD PTR summand+36, xmm1
mulss xmm0, DWORD PTR row+24
mulss xmm3, DWORD PTR row+28
movss DWORD PTR summand+24, xmm0
movaps xmm0, xmm2
movss DWORD PTR summand+28, xmm3
movss xmm3, DWORD PTR leftrow+12
mulss xmm0, DWORD PTR row+32
movaps xmm1, xmm3
mulss xmm1, DWORD PTR row+52
movss DWORD PTR summand+32, xmm0
movaps xmm0, xmm2
mulss xmm0, DWORD PTR row+40
mulss xmm2, DWORD PTR row+44
movss DWORD PTR summand+52, xmm1
movss DWORD PTR summand+40, xmm0
movaps xmm0, xmm3
movss DWORD PTR summand+44, xmm2
mulss xmm0, DWORD PTR row+48
movss DWORD PTR summand+48, xmm0
movaps xmm0, xmm3
mulss xmm3, DWORD PTR row+60
mulss xmm0, DWORD PTR row+56
movss DWORD PTR summand+60, xmm3
movss DWORD PTR summand+56, xmm0
ret 0 If I'm not mis-counting the op-codes, it's quite the same situation as your GCC 4.4 output for that C. option 2 (your new loop) gave: main PROC
movss xmm2, DWORD PTR leftrow
movups xmm0, XMMWORD PTR row
movss xmm3, DWORD PTR leftrow+4
xor eax, eax
shufps xmm2, xmm2, 0
shufps xmm3, xmm3, 0
mulps xmm0, xmm2
movss xmm2, DWORD PTR leftrow+8
movups XMMWORD PTR summand, xmm0
movups xmm0, XMMWORD PTR row+16
shufps xmm2, xmm2, 0
mulps xmm0, xmm3
movss xmm3, DWORD PTR leftrow+12
movups XMMWORD PTR summand+16, xmm0
movups xmm0, XMMWORD PTR row+32
shufps xmm3, xmm3, 0
mulps xmm0, xmm2
movups XMMWORD PTR summand+32, xmm0
movups xmm0, XMMWORD PTR row+48
mulps xmm0, xmm3
movups XMMWORD PTR summand+48, xmm0
ret Quite comparable to your output with GCC 4.4, except with 4 additional instructions due to some extra moves. However it is clearly as successful. Now the third option with the loop inside of a loop we both liked: main PROC
xor eax, eax
lea rcx, OFFSET FLAT:leftrow
lea r8, OFFSET FLAT:__ImageBase
lea edx, QWORD PTR [rax+4]
$LL6@main:
movss xmm0, DWORD PTR [rcx]
movups xmm1, XMMWORD PTR row[rax+r8]
lea rax, QWORD PTR [rax+16]
lea rcx, QWORD PTR [rcx+4]
shufps xmm0, xmm0, 0
mulps xmm1, xmm0
movups XMMWORD PTR summand[rax+r8-16], xmm1
dec rdx
jne SHORT $LL6@main
xor eax, eax
ret 0 ...fails actually in unrolling the loop (although GCC 4.4 may have succeeded at this, it didn't completely unroll the loop in a well-enough way that it also reorganized the shuffling optimally), but succeeds in every other sense. In terms of readable assembly code it is interesting to look at nonetheless. MSVC also continues to refuse to unroll the loop regardless of what optimization settings I try. What interests me however, is that MSVC succeeded where GCC failed: It did shufps by 0, whereas GCC kept trying to change the immediate and inversely impacting the efficiency. Still, while MSVC is sometimes more intelligent, it tends to generate slightly extra instructions than it needs to. |
Thanks for those valuable information. The conclusion to this seems to still be "write your code in order to make it humanly readable as compilers will never succeed to optimize it every time anyway". I guess it's even more true as the number of instruction set archs tend to increase.... |
That's what C is best at. As the number of instruction set archs increases, individual compilers become more and more obligated to treat cross-platform, "humanly readable" code as being the same exact thing as "optimized" code. :) Of course, it will take many years before those two really do coincide as I'm sure you've seen. But it would be lovely for them to mean the same exact thing, wouldn't it? We're obviously not there now, but we're much closer than we were in the 90s. There's a lot of fundamental truth to that philosophy you pose, but besides readability, accuracy is important also. Sometimes, if we refuse to write code for any convenience to the compiler's ability to optimize, and only for human readability, the end code is less accurate. For example, if we don't tempt the compiler into vectorizing readable code, even if sometimes by making it slightly less readable, we might not get any SIMD emulation of N64 RCP's SIMD at all. Never minding that this is slower: It is less accurate as well in concept. Vector CPU should be emulated by another vector CPU--besides being faster, I see it as accuracy. Even the use of even less readable, non-portable SSE intrinsic code can be justifiable under that logic ("accuracy" vs. "readability" of alg.), if we have no alternatives. But thankfully, we do! But anyway, I'm trying not to spam this conversation's length with so much chatting. I just figured I'd drop a note--thanks to your information I sent twinaphex another pull request here. Figured I'd let you know just in case you thought it had any other improvement to readability. |
mupen64plus@3e58eb7 * 3e58eb7 fix texture pack read failure on some 64-bit platforms, found by EndoplasmaticReticulum on github, due to png_uint_32 type being unfortunately defined as 'unsigned long int' * a18ee5b Merge pull request mupen64plus#29 from littleguy77/master |\ | * 717b587 Implement optional frameskip feature. |/ * 4727785 Merge branch 'master' of https://github.com/mupen64plus/mupen64plus-video-glide64mk2 |\ | * 8b0e9b5 Merge pull request mupen64plus#37 from twinaphex/master | |\ * | \ c5998a5 Merge pull request mupen64plus#37 from twinaphex/master |\ \ \ | |/ / |/| / | |/ | * 154d818 Fix broken C version of MulMatricesC * | c435f2a Merge pull request mupen64plus#38 from fayvel/build |\ \ | * | 3493bd6 add extra includes and preprocessor checks for _WIN32 | * | 4d8c50a use more widely used preprocessor defines for MinGW | * | fd37e9e use different swprintf call for _GLIBCXX_HAVE_BROKEN_VSWPRINTF | * | a8ca725 use travis ci with clang | * | 378c250 use newer boost for C++-11 compatibility | * | 9593518 use Unix-compatible SDL include for SDL_thread | * | d610bff use non-strict-ansi mode to build sources |/ / * | 2e66b4a Merge pull request mupen64plus#36 from krnlyng/gles |\ \ | |/ |/| | * a4f33ad include SDL_opengles2.h instead of SDL_opengles.h |/ * 9a3f9ff more makefile fixes for OSX with xcode 6.1.1 * 4bbe95d use SDL threads instead of C++11 threads, which are not supported on clang * 6a46ad1 update makefile for build with clang on XCode 6.1.1 * 339ca50 clang doesnt have the to_wstring method * 7200662 Merge pull request mupen64plus#28 from littleguy77/master * 3a44862 Add documentation comments for polygon offset settings.
Me and cxd4 (Hatcat/mephostophilis) stumbled upon this while backporting some stuff to the libretro version.
The C version of MulMatricesC appears to be totally broken. I guess this went unnoticed before because it would default to the SSE2 codepath version on regular desktops.
Read the explanation by cxd4 here:
https://github.com/libretro/mupen64plus-libretro/pull/159
' repairs were made to the ANSI C code which had broken texture elements in the backgrounds of games like Super Mario 64. In addition, the new ANSI C function is written in vectorized loop style, such that GCC can very easily create the SSE instructions relevant to said loops in a manner perhaps more optimized than the original SSE-fixed code.'