-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Squint -- a peephole optimizer for amacc #87
Conversation
This pull request is a starting point for discussion on how to integrate Squint most effectively with amacc. It may best be done with separate repos that are tied through modules, but I don't know. I added a -p option to amacc.c that must currently be used with the amacc -o option to produce an ELF file that can be optimized with scripts/peep. In order for peep to work, a version of squint must be compiled, e.g. gcc -o squint squint.c . Example: % gcc -o amacc-gcc amacc.c -ldl % ./amacc-gcc -p -o amacc amacc.c % scripts/disasm amacc > amacc.s % gcc -o squint squint.c % scripts/peep amacc executable file amacc was optimized. % scripts/disasm amacc > amacc.s.opt % sdiff -W -d -w 180 amacc.s amacc.s.opt I have more advanced versions of this that has even beat gcc -O1 for fib.c and possibly other things, but I suspect you will not like my coding style, so we will have to discuss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makefile
should be modified to build squint
.
Hi, I do not have a cross compile or QEMU environment, so I likely can't test or modify the make scripts without breaking something. Can you add that? |
I will review all comments tomorrow before making changes. Have a good day. It is 3:30AM here so I need some sleep. |
One more thing -- now that any library call is supported, it is possible to let the user use the -p option in place of the -o option, and then have a system call at the end of main() in amacc.c : if (peep) system("squint "); Also, squint will work on jit() directly in amacc if you put squint.c in amacc.c and remove the squint() manin() function. I doubt you will want to do that, but it works. |
This version of squint versus baseline:
|
No problem at all. You can modify |
I created a new benchmarking test (Sieve of Eratosthenes), and I am considering pulling minimal register renaming from my non-published version of squint, hence the delay. I'll have finished that decision process by tonight, and will commit by tomorrow. It would be nice to be at least as fast as gcc -O0 with a minimal set of peephole changes. |
@lecopzer, Please review this effort on introducing a peephole optimizer to AMaCC. |
After a quick review, Does this peephole algorithm made by yourself or any referenced paper or exist algorithm? It looks like Squint inserts some hint ( for example, in squint.c, how the map was constructed and the usage, what the meaning of ambiguous function naming such as |
Add sieve.c to benchmark peephole optimization Try to clean up some spacing issues The peephole optimization is 150 lines, but results in code slightly faster than gcc -O0. By adding just a few hundred more lines to the peephole optimizer, performance can be. The current code strikes a balance between being small, educational, and performant.
Everything in here was created by me except for popcount, which is a widely published algorithm, so not really attributable to anyone. popcount is "open literature". AMaCC is not documented, not even subsets such as the ELF driver, the stack based AST, the type system, or anything other than the IR. I can write documentation when I get around to it, I just don't think it is fair to require it for this commit. There are comments at the top of every function desribing what that function covers. |
Final comment -- AMaCC itself is extremely compilcated. The fact that amacc itself can be optimized by Squintis a testament to its robustness. Also, you can compare the assembly language of any program produced with the original AMaCC (cf scripts/disasm) compared with the peephole optimized version of that executable as a testament to the quality of the optimizations. |
Eating some crow (idiom) now, I found a problem in the frame-register optimizer I just added since yesterday. If I can't quickly resolve it, I will remove the frame-register optimizer. |
I traced back a problem to a use-def analysis across a basic block boundary. This is not a problem in my private version of Squint which is more complete. I "dubded down" the algorithm, so it is no longer faster than gcc -O0, but on the other hand, it is not much slower considering how much dumber the algorithm is.
OK. I simplified the frame-register optimization and re-commited. Best time of sieve benchmark for 5 runs: I will commit the Makefile tomorrow. |
Also, so a little more syntax cleanup of squint.c
@lecopzer, looking forward to ELF documentation. |
I am curious about the conditions when AMaCC + Squint can be approaching |
"or example, in squint.c, how the map was constructed and the usage, what the meaning of ambiguous function naming such as simplifyBranchX apply_peepholesX X=1~4, what the NOP use for...etc." As I explain in the documentation, the order that functions are applied or how many times they are applied improves optimization. The generic names mean that anything could be placed in the function blocks in the future, and each function just provides a scope that can be put it a different order or repeated multiple times. |
So far, just fib is faster than gcc -01. I have a function called simplify_frame() that gets rid of fp, so that only lr gets pushed with a function call, and the stack related logic gets modified/eliminated. Fib is all data movement on the stack frame via function calls, so it ends up being faster than gcc -O1 with those modifications. That said, there is also a ton of use-def analysis that can be done to cut at least 1/3 of the instructions in many loops out, and to better balance pipelines. For instance, there are two load pipelines and two integer pipelines. Keeping them all going at full tilt can be done by balancing LEA location (explicit integer or CISC addressing) and balancing immediate constants vs grabbing them from pc-relative addresses. Getting performance is not hard. What is hard is balancing it against code size when the philosophy of AMaCC is "small but powerful." |
I'll be honest. I am flying blind here. My environment does not support any of the features required for the AMaCC test environemtn. I just took a wild guess at what was needed, and will likely have trouble correcting any errors in what I've done.
I am thinking of the feasibility to change option |
@jserv To clarify, do you want 'amacc -O -o foo foo.c'? I am worried that people will forget the squint step, and believe that the '-O' is applying the optimization directly on the executable, since that is standard compiler convention. I have found that people do not pay close attention when trying to use tools. Using -Op from the start may add clarity. |
I agree with the |
FYI I just modified squint for fun to run as a shared object library. Here is a jit() run: $ gcc -o amacc amacc.c -Wl,-rpath=/home/pi/amacc -ldl libsquint.so
$ time ./amacc tests/sieve.c
real 0m7.786s
user 0m7.774s
sys 0m0.011s
$ time ./amacc -Op tests/sieve.c
real 0m2.384s
user 0m2.354s
sys 0m0.022s diff --git a/amacc.c b/amacc.c
index ee21887..5e9185c 100644
--- a/amacc.c
+++ b/amacc.c
@@ -22,6 +22,7 @@
#include <sys/stat.h>
#include <fcntl.h>
#include <dlfcn.h>
+#include "squint.h"
/* 64-bit host support */
#if defined(__x86_64__) || defined(__aarch64__)
@@ -1831,6 +1832,8 @@ int jit(int poolsz, int *main, int argc, char **argv)
if (je >= jitmap) die("jitmem too small");
*tje = reloc_bl(jitmap[((int) main - (int) text) >> 2] - (int) tje);
+ if (peephole) peephole_opt((int *)jitmem, je-1);
+
// hack to jump into specific function pointer
__clear_cache(jitmem, je);
int *res = bsearch(&sym, sym, 1, 1, (void *) _start); I can configure the makefile to build as standalone or shared library if you are interested. If |
I like the idea to run Squint as a shared library, which means that AMaCC can compile itself while Squint can be built with optimizing compilers such as gcc and clang. Can you use After you resolve the issues found by @loganchien, I am about to merge this effort. |
There is a lot here to keep track of. It would be nice to address any further issues in a subsequent pull request. A very small Squint follow-up pull requests is planned to improve/finalize squint testing and to add an option to configure squint as a shared library in addition to current standalone mode.
I have resolved all outstanding issues I am aware of. |
Let me be blunt. It has been 14 days since this pull request has been essentially ready to go, and 20 days total in a pull request state. At several points, I asked if Squint was too much for AMaCC to accept, and you replied that you thought it was appropriate. The one bug that @loganchien found is one which will never appear in actual code, whereas AMaCC itself is riddled with bugs when trying to write the simplest of code fragments. I have fixed many such bugs since I began contributing to this project six months ago that are now a part of AMaCC. I find it disrespectful that I have had this pull request open for 20 days. You need to accept this pull request by tomorrow or I close it, and continue with AMaCC in my own repository on Github. While I am upset in this comment, I think all my points are both true and reasonable. I think AMaCC is a great project, and I would like to continue contributing here rather than on my own, but I have to trade that against being held back from being able to do good work in a timely fashion. |
I understand your frustration for the long review time and I feel sorry about it. But I must clarify that I do not intend to be disrespectful. I am happy to see the change and willing to see it to be merged eventually. The prolonged review time is the direct result of two reasons: (1) the magnitude of I think I won't be able to review it at a faster speed. I leave it to jserv@ to decide whether he wants to take this PR as-is and maybe fix any bugs found later. I am only trying to be helpful here. Thank you for your understanding. |
I would feel it as a responsibility to fix any bugs as they arise. Since this is an auxillary program and not a part of AMaCC proper, the risks of impacting AMaCC are very low. Indeed Squint could be easily jettisoned with less than a minute of makefile and repo changes at any time without impact to AMaCC. |
Hi. I have decided to create my own repo, since the turnaround time for review here is too long for my development needs. That said, I will leave Squint as a pull request in case you want to merge it into AMaCC. Feel free to cancel if you don't want it. As I do AMaCC bug fixes on my own repo, I will submit some of them as pull requests to this repo. What you've done with this compiler is really phenomenal, and I chose it vs Github competitors for that reason. Thank you for such an amazing compiler to work with, based mostly on C4 but going further just the right amount. Now it is my turn to split off as you did, and hopefully continue improving this amazing educational compiler. It will be in a new repo called Squint under HPCguy. |
PS @loganchien Thank you for the time you took to review this. I know you sacrificed personal time, and I appreciate your careful analysis to make sure AMaCC would maintain its reputation before allowing the merge. |
I wrote AMaCC in my spare time as one of volunteer projects. I appreciate the contributions from @HPCguy, @loganchien, and @lecopzer. Based on c4, AMaCC is now moving forward to be more practical and usable -- I could not imagine this when I started this project in 2016. Per the consideration of @loganchien, I think we can leave some messages in git log and merge the effort for experimental purpose. For future commits, more enhancements can be landed and targeting stronger Squint. |
And thank you for your incredible work! I have started a new repo here: https://github.com/HPCguy/Squint . I encourage you to Read the README there. My mind does not deal well with outstanding pull requests, and I want to plow ahead rapidly on some new development. I will back-patch most bugfixes from the new repo into AMaCC, along with select features that I think are appropriate for AMaCC. I need the new repo for my mental health, so I can rapidly expand the capabilities of AMaCC without some constraints I have been putting on myself to stay true to the "small but powerful" AMaCC model. I think AMaCC provides a wonderful educational code base, and I don't want to mess that up. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Progress so far:
- apply_peepholes1
- apply_peepholes2 (partial)
- apply_peepholes4
- simplify_branch1
- simplify_branch2
- simplify_branch3
- simplify_branch4 (I couldn't understand)
- simplify_branch5
if ((*scan & 0x0e000000) == 0x0a000000) { | ||
hasLink = *scan & (1<<24); | ||
if ( ((mode == 0) && !hasLink) || // intra-function mode | ||
((mode != 0) && hasLink) ) { // inter-function mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what do you mean with the term intra-function
and inter-function
.
I guess that by intra-function
you mean that NOP removal will only be applied to basic blocks within a function and the function address won't be changed; OTOH, by inter-function
you mean that NOP removal will be applied across functions (in addition to the basic block).
If my "guess" is correct, shouldn't the condition:
if ( ((mode == 0) && !hasLink) || // intra-function mode
((mode != 0) && hasLink) ) { // inter-function mode
be:
if ( !hasLink || // basic block branch for intra/inter-function mode
((mode != 0) && hasLink) ) { // inter-function mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use scripts/disasm to compare the code before applying Squint and after Squint, you will see large NOP blocks between functions. The inter-procedual call to relocate_nop removes those NOP blocks between functions (after I fix the bug, which is on my list, but at low priority right now). The intra-function call to relocate-NOP takes the NOPS produced by peephole optimization out of a given function. I believe this is what you said, but I just want to clarify it is a two step process, because that is important. First, intra-function is applied to all functions, and second you apply inter-function compression. I'm pretty sure the conditional is correct, but if I am wrong, I will fix it when I get back to this within a few months.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left all the comments I would like to say for the portion I've reviewed.
However, the mind set differences between the author and me is huge. I don't see a chance to reach an agreement. To break the tie, I decide to step down as a reviewer. Maybe other reviewers can do a better job then me. Best regards.
for (i = 0, j = 0; i < cnst_pool_size; ++i) { | ||
val = cbegin[cnst_pool[i].data_addr/4]; | ||
rotate = 0; | ||
if ((val & 0x0000ff00) && !(val & 0xffff00ff)) rotate = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be generalized by a function that can match all possible modified immediate:
// Returns the ARM modified immediate if x can be encoded as a modified
// immediate; otherwise, return -1.
int encode_arm_modified_immediate(unsigned x) {
int rotate;
for (rotate = 0; rotate < 16; ++rotate) {
if ((x & 0xffffff00) == 0) {
return (rotate << 8) | x;
}
x = (x << 2) | (x >> 30); // Rotate 2 bits from MSB to LSB
}
return -1;
}
(p.s. It is possible to further improve the performance with __builtin_clz
and __builtin_ctz
, but I don't have time to think about it thoroughly.)
And then check the constant value with:
int mov_imm_inst = 0;
int immediate;
if ((immediate = encode_arm_modified_constant(val)) != -1) {
mov_imm_inst = 0xe3a00000; // mov
} else if ((immediate = encoding_arm_modified_constant(~val)) != -1) {
mov_imm_inst = 0xe3e00000; // mvn
}
if (mov_imm_inst) {
inst = cnst_pool[i].inst;
while (inst != 0) {
next = inst->next;
newinst = cbegin + inst->inst_addr/4;
*newinst = mov_imm_inst | (*newinst & RI_Rd) | immediate;
free(inst);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x = (x << 2) | (x >> 30); // Rotate 2 bits from MSB to LSB
I believe 0x80000000 >> 30 will smear a 1 through the top 30 bits, corrupting x. I see your point though, and I will try to devise an implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe 0x80000000 >> 30 will smear a 1 through the top 30 bits, corrupting x.
I used unsigned
as the type for x
. x >> 30
should be translated to logical shift right (LSR), which should pad zeros instead of the most significant bit (MSB).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I fix issue #94 in AMaCC, amacc will be able to compile Squint, at least in my repo. I'm hoping that if I add optimizations to Squint that beat gcc -O1 on a wide variety of tests, you guys might accept some Squint changes back, assuming you guys were to accept the current pull request. I'm hoping to beat gcc -O1 on a variety of tests with as little as 500 lines of additional optimizations in Squint.
I don't mind the suggested peephole checks being changed after this pull request, or any other changes anyone may want to add. I just won't be the one to do it, on philisophical grounds. You have the option to discard this pull request with no ill feelings on my part. I suspected this might be too big for AMaCC when I started, which is why I was hesitant in my first post for this pull request. In any event, I will continue to submit small bug fix pull requests to AMaCC that touch 20 or less lines. I will return to fixing my amacc "issue" submissions around March 14, after I have completed float data type support for this code base on my personal repo @ https://github.com/HPCguy/Squint. |
The compiler is now beating gcc -O3 for the Sieve of Eratosthenes problem. Floating point is decently implemented and runs a physics benchmark in about 2/3 the time of Gcc -O0. The https://github.com/HPCguy/Squint/blob/main/README.md page now has tables for executable size, compile time, and runtime comparisons. After a few more performance enhanchements (I want floating point to beat gcc -O1), I will likely start fixing AMaCC bugs, and backporting those changes here. |
The floating point execution performance of my extended version of amacc in the repository https://github.com/HPCguy/Squint is now beating gcc -O3 floating point performance. |
I'm going to close this pull request, since Github HPCguy/Squint supersedes this work. AMaCC is mentioned liberally within that project, and it never would have been completed without the excellent development in the early AMaCC compiler. |
This pull request is a starting point for discussion on how to integrate
Squint most effectively with amacc. It may best be done with separate
repos that are tied through modules, but I don't know.
I added a -p option to amacc.c that must currently be used with the
amacc -o option to produce an ELF file that can be optimized with
scripts/peep. In order for peep to work, a version of squint must
be compiled, e.g. gcc -o squint squint.c .
Example:
% gcc -o amacc-gcc amacc.c -ldl
% ./amacc-gcc -p -o amacc amacc.c
% scripts/disasm amacc > amacc.s
% gcc -o squint squint.c
% scripts/peep amacc
executable file amacc was optimized.
% scripts/disasm amacc > amacc.s.opt
% sdiff -W -d -w 180 amacc.s amacc.s.opt
I have more advanced versions of this that has even beat gcc -O1 for fib.c
and possibly other things, but I suspect you will not like my coding style,
so we will have to discuss.