-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix broken C version of MulMatricesC #37
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float destrow[4]
is unused.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destrow
was used in the original SSE algorithm in Glide64, before translation to portable C.cxd4/mupen64plus-libretro@b08ab92#diff-bd9803dbb35139f5ba7c4e7a77557f76L239
It seems to have been meant as an intermediary token for accumulating the shuffled result before eventually storing it to that two-dimensional
r[4][4]
array. While updating the pattern for readability I decided that this step wasn't needed, and used Gonetz's comment about the summands to prepare each mulitplication to an array under that name instead.I'd forgotten that I was no longer wanting to use their
destrow
array anymore. Feel free to remove that unused declaration if you like; the makefile I used for the build didn't have warnings like that set to complain. The chief reason for the commit I made was that before we couldn't get a 32-bit build going, and I needed to change the function pointer here to indicate that we had no reason to assume a dependency on SSE1.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanations cxd4, I appreciate.
I agree about removing SSE instructions in favor of portable C code. I removed
destrow[4]
.I just think the "auto vectorized way" is kind of unnatural to read and not obvious to get so I added few comments to provide the hint of potential future contributors:
Thanks for contributing! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine. In particular, my point is, only the C code (also thought of as the "auto-vectorized" code) has universal meaning. Anything else like inline assembly, or assembly language intrinsics, is in the eye of the beholder and not naturally readable code to everyone--because, its readability is dictated more by the target processor.
For example, suppose you might prefer the intrinsic use:
ALIGNED int32_t a[4], b[4], c[4];
int main(void) {
*(__m128i *)c = _mm_and_si128(*(__m128i *)a, *(__m128i *)b);
return 0;
However, I find it even more readable to both the human and the compiler (for better optimizations outside the version of the SIMD instruction set architecture assumed by the usage of that intrinsic) to instead say:
int a[4], b[4], c[4];
int main(void) {
register unsigned int i;
for (i = 0; i < 4; i++)
c[i] = a[i] & b[i];
return 0;
Although the name of the
_mm_and_si128
intrinsic is self-explanatory as to the operation of the code, the algorithm itself is more readable and more natural when using C code (and the compiler is left to generate what should be the best choice of SIMD intrinsic for and'ing all elements). This commit probably isn't the best example of that, however--you're welcome to disagree whether my C loops are as readable as some of the old code. The gist of it may have been portability, but I thought it was more readable to divide each parallelized step of the overal algorithm into its own loop. The original C loop code you had was much better than mine, though, in terms of shortness and brevity--maybe open to interpretation on the parallel steps what goes on. So I think it's fine if certain sketchy examples of this practice come up that you like to comment about.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, there is a lot of SSE in many emulators. It's a shame as often, emulators whose code will remains for years have bad portability. It's also a reason why I tent to prefer shortness and brevity over compiler compliant writing: Emulator code should focus on readability over performance as it make life of potential future contributors easier.
I'm not a compiler expert but if we convert this:
To this:
Are we sure the compiler will not detect vectorizable code and provide the same result than the first case?
I'm just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fair question. I will check it now, but I anticipate the answer is that it will give the same result.
And yes, "readability over performance" is a good, slow but steady way to ultimately achieve better performance. I agree about prioritizing that over performance or premature optimization. :) That, and future developers understand the key to each algorithm like you said. Forcing non-portable code or prematurely optimizing for a limited span of targets (e.g. always use SSE intrinsics) generally worked better in the long run for closed-source projects or those that don't fully appreciate some benefits of open-source.
So if we convert those 4 separate loops I wrote into one single loop as you implied, GCC yields: the exact same output for both methods! So yes, your method of changing my loop produces the exact same output within almost certainly any compiler intelligent enough to vectorize them in the first place. To be more specific, the output it produces is:
As to whether I'd prefer my method of writing the loops or your method of unifying them in a single loop, they both look good to me. Yours is probably better for organizing the same "type" of operation going on within its own body of its own loop; mine mostly just focuses on representing each SIMD instruction in C loop form for each individual one, with only a theoretically better chance at enforcing the vectorization individually. It's a hard syntactical habit of mine for me to break.
But there is also an option C) here as well if you are interested: You could also do--
Sorry about formatting/indentation issues. I'm still kind of new to GitHub features.
EDIT by Narann: I reformat it for you, edit to see it. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah! I love your C option. Can you confirm it provide the same assembly results? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like it, too. Unfortunately, I just checked, and it did not provide the same assembly language results.
It added dynamic shuffle instruction code due to the new
k
variable, where shuffling one element across all four single-precision floating-point elements of the 128-bit xmm was no longer as fixated when using a loop variable instead of a literal 0, 1, 2, 3 for array index.This is unfortunate I think, since we seem to agree that this option C) at the bottom of my reply is more readable. Maybe it will go away in a future version of the GCC compiler, and you can use it without any performance loss whatsoever. I sometimes observe that the current version of GCC is not good at auto-vectorizing but that the upcoming release of GCC will have such volatility fixed; you can try that if you want. I am currently compiling on 32-bit Linux with GCC 4.8.2; you might have a better version for this particular limit.
Here is the exact, full C code I am using to debug this btw:
Compiled with:
gcc -S -O3 -msse2 -o /home/no/m64/main.s /home/no/m64/main.c
Omit the
-msse2
flag if compiling in GCC for 64-bit x86 code.Check
main.s
for the assembly language results of the code.It was the same with the first two methods I gave, but the third method for all its readability and flexibility seems less rigid to the compiler's ability to guarantee single-move shuffling in the SHUFPS op-code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tested and yeah, GCC doesn't like nested loops, even small ones...
I just tested on a old gcc (4.4 64-bits):
Result to this:
I'm not confortable with asm but this result seems "silly". Is it me?
This:
Result to this (your result):
And:
Is the worse:
So, on gcc 4.4, option 2 seems to be the best.
I'm impressed how results can change in this simple loop doing the exact same thing...