Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Squint -- a peephole optimizer for amacc #87

Closed
wants to merge 32 commits into from
Closed

Conversation

HPCguy
Copy link
Contributor

@HPCguy HPCguy commented Jan 31, 2022

This pull request is a starting point for discussion on how to integrate
Squint most effectively with amacc. It may best be done with separate
repos that are tied through modules, but I don't know.

I added a -p option to amacc.c that must currently be used with the
amacc -o option to produce an ELF file that can be optimized with
scripts/peep. In order for peep to work, a version of squint must
be compiled, e.g. gcc -o squint squint.c .

Example:

% gcc -o amacc-gcc amacc.c -ldl
% ./amacc-gcc -p -o amacc amacc.c
% scripts/disasm amacc > amacc.s
% gcc -o squint squint.c
% scripts/peep amacc
executable file amacc was optimized.
% scripts/disasm amacc > amacc.s.opt
% sdiff -W -d -w 180 amacc.s amacc.s.opt

I have more advanced versions of this that has even beat gcc -O1 for fib.c
and possibly other things, but I suspect you will not like my coding style,
so we will have to discuss.

This pull request is a starting point for discussion on how to integrate
Squint most effectively with amacc.  It may best be done with separate
repos that are tied through modules, but I don't know.

I added a -p option to amacc.c that must currently be used with the
amacc -o option to produce an ELF file that can be optimized with
scripts/peep.  In order for peep to work, a version of squint must
be compiled, e.g. gcc -o squint squint.c .

Example:

% gcc -o amacc-gcc amacc.c -ldl
% ./amacc-gcc -p -o amacc amacc.c
% scripts/disasm amacc > amacc.s
% gcc -o squint squint.c
% scripts/peep amacc
executable file amacc was optimized.
% scripts/disasm amacc > amacc.s.opt
% sdiff -W -d -w 180 amacc.s amacc.s.opt

I have more advanced versions of this that has even beat gcc -O1 for fib.c
and possibly other things, but I suspect you will not like my coding style,
so we will have to discuss.
squint.c Outdated Show resolved Hide resolved
Copy link
Owner

@jserv jserv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makefile should be modified to build squint.

@HPCguy
Copy link
Contributor Author

HPCguy commented Jan 31, 2022

Hi, I do not have a cross compile or QEMU environment, so I likely can't test or modify the make scripts without breaking something. Can you add that?

@HPCguy
Copy link
Contributor Author

HPCguy commented Jan 31, 2022

I will review all comments tomorrow before making changes. Have a good day. It is 3:30AM here so I need some sleep.

@HPCguy
Copy link
Contributor Author

HPCguy commented Jan 31, 2022

One more thing -- now that any library call is supported, it is possible to let the user use the -p option in place of the -o option, and then have a system call at the end of main() in amacc.c : if (peep) system("squint "); Also, squint will work on jit() directly in amacc if you put squint.c in amacc.c and remove the squint() manin() function. I doubt you will want to do that, but it works.

@HPCguy
Copy link
Contributor Author

HPCguy commented Jan 31, 2022

This version of squint versus baseline:

feature amacc.s amacc.s.opt
ldr 8091 4717
str 6732 3043
reg-reg 9654 5673
sub 1878 693
add 1567 1170
mul 60 16

@jserv
Copy link
Owner

jserv commented Jan 31, 2022

Hi, I do not have a cross compile or QEMU environment, so I likely can't test or modify the make scripts without breaking something. Can you add that?

No problem at all. You can modify Makefile to reflect squint specific changes. Then, once make check includes the necessary checks, the CI can validate itself.

squint.c Outdated Show resolved Hide resolved
@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 1, 2022

I created a new benchmarking test (Sieve of Eratosthenes), and I am considering pulling minimal register renaming from my non-published version of squint, hence the delay. I'll have finished that decision process by tonight, and will commit by tomorrow. It would be nice to be at least as fast as gcc -O0 with a minimal set of peephole changes.

@jserv
Copy link
Owner

jserv commented Feb 1, 2022

@lecopzer, Please review this effort on introducing a peephole optimizer to AMaCC.

@lecopzer
Copy link
Contributor

lecopzer commented Feb 2, 2022

After a quick review,
I'd like to see a doc like https://github.com/jserv/amacc/blob/master/docs/IR.md because it seems a complex algorithm.

Does this peephole algorithm made by yourself or any referenced paper or exist algorithm?

It looks like Squint inserts some hint (nop) into codegen and resolve in squint.c.
But we don't have a brief concept of how the algorithm does and make it hard to review.

for example, in squint.c, how the map was constructed and the usage, what the meaning of ambiguous function naming such as simplifyBranchX apply_peepholesX X=1~4, what the NOP use for...etc.

Add sieve.c to benchmark peephole optimization
Try to clean up some spacing issues

The peephole optimization is 150 lines, but results
in code slightly faster than gcc -O0.  By adding
just a few hundred more lines to the peephole
optimizer, performance can be.  The current
code strikes a balance between being small,
educational, and performant.
@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 2, 2022

Everything in here was created by me except for popcount, which is a widely published algorithm, so not really attributable to anyone. popcount is "open literature".

AMaCC is not documented, not even subsets such as the ELF driver, the stack based AST, the type system, or anything other than the IR. I can write documentation when I get around to it, I just don't think it is fair to require it for this commit.

There are comments at the top of every function desribing what that function covers.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 2, 2022

Final comment -- AMaCC itself is extremely compilcated. The fact that amacc itself can be optimized by Squintis a testament to its robustness. Also, you can compare the assembly language of any program produced with the original AMaCC (cf scripts/disasm) compared with the peephole optimized version of that executable as a testament to the quality of the optimizations.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 2, 2022

Eating some crow (idiom) now, I found a problem in the frame-register optimizer I just added since yesterday. If I can't quickly resolve it, I will remove the frame-register optimizer.

I traced back a problem to a use-def analysis across a basic block boundary.
This is not a problem in my private version of Squint which is more complete.
I "dubded down" the algorithm, so it is no longer faster than gcc -O0, but
on the other hand, it is not much slower considering how much dumber the
algorithm is.
@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 2, 2022

OK. I simplified the frame-register optimization and re-commited.

Best time of sieve benchmark for 5 runs:
amacc (no squint): 0m7.643s
amacc (squint): 0m2.312s
Gcc -O0: 0m2.191s

I will commit the Makefile tomorrow.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 2, 2022

@lecopzer, looking forward to ELF documentation.

amacc.c Outdated Show resolved Hide resolved
@jserv
Copy link
Owner

jserv commented Feb 2, 2022

Best time of sieve benchmark for 5 runs:

compilers execution time
amacc (no squint) 0m7.643s
amacc (squint) 0m2.312s
gcc -O0 0m2.191s

I am curious about the conditions when AMaCC + Squint can be approaching gcc -O1.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 2, 2022

"or example, in squint.c, how the map was constructed and the usage, what the meaning of ambiguous function naming such as simplifyBranchX apply_peepholesX X=1~4, what the NOP use for...etc."

As I explain in the documentation, the order that functions are applied or how many times they are applied improves optimization. The generic names mean that anything could be placed in the function blocks in the future, and each function just provides a scope that can be put it a different order or repeated multiple times.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 2, 2022

I am curious about the conditions when AMaCC + Squint can be approaching gcc -O1.

So far, just fib is faster than gcc -01. I have a function called simplify_frame() that gets rid of fp, so that only lr gets pushed with a function call, and the stack related logic gets modified/eliminated. Fib is all data movement on the stack frame via function calls, so it ends up being faster than gcc -O1 with those modifications.

That said, there is also a ton of use-def analysis that can be done to cut at least 1/3 of the instructions in many loops out, and to better balance pipelines. For instance, there are two load pipelines and two integer pipelines. Keeping them all going at full tilt can be done by balancing LEA location (explicit integer or CISC addressing) and balancing immediate constants vs grabbing them from pc-relative addresses.

Getting performance is not hard. What is hard is balancing it against code size when the philosophy of AMaCC is "small but powerful."

I'll be honest.  I am flying blind here.  My environment does not
support any of the features required for the AMaCC test environemtn.
I just took a wild guess at what was needed, and will likely have
trouble correcting any errors in what I've done.
@jserv
Copy link
Owner

jserv commented Feb 7, 2022

I am thinking of the feasibility to change option -p to -O which indicates the generation of peephole optimization aware code. It would be a bit similar to Link-Time Optimization. Later, when we add more optimizers, the option -O can be acting the role to classify and pass through.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 8, 2022

@jserv To clarify, do you want 'amacc -O -o foo foo.c'? I am worried that people will forget the squint step, and believe that the '-O' is applying the optimization directly on the executable, since that is standard compiler convention. I have found that people do not pay close attention when trying to use tools. Using -Op from the start may add clarity.

@jserv
Copy link
Owner

jserv commented Feb 9, 2022

@jserv To clarify, do you want 'amacc -O -o foo foo.c'? I am worried that people will forget the squint step, and believe that the '-O' is applying the optimization directly on the executable, since that is standard compiler convention. I have found that people do not pay close attention when trying to use tools. Using -Op from the start may add clarity.

I agree with the -Op option. It would be more consistent with peephole optimizer.

squint.c Outdated Show resolved Hide resolved
squint.c Outdated Show resolved Hide resolved
squint.c Outdated Show resolved Hide resolved
@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 14, 2022

FYI I just modified squint for fun to run as a shared object library. Here is a jit() run:

$ gcc -o amacc amacc.c -Wl,-rpath=/home/pi/amacc -ldl libsquint.so

$ time ./amacc tests/sieve.c 
real	0m7.786s
user	0m7.774s
sys	0m0.011s

$ time ./amacc -Op tests/sieve.c 
real	0m2.384s
user	0m2.354s
sys	0m0.022s
diff --git a/amacc.c b/amacc.c
index ee21887..5e9185c 100644
--- a/amacc.c
+++ b/amacc.c
@@ -22,6 +22,7 @@
 #include <sys/stat.h>
 #include <fcntl.h>
 #include <dlfcn.h>
+#include "squint.h"
 
 /* 64-bit host support */
 #if defined(__x86_64__) || defined(__aarch64__)
@@ -1831,6 +1832,8 @@ int jit(int poolsz, int *main, int argc, char **argv)
     if (je >= jitmap) die("jitmem too small");
     *tje = reloc_bl(jitmap[((int) main - (int) text) >> 2] - (int) tje);
 
+    if (peephole) peephole_opt((int *)jitmem, je-1);
+
     // hack to jump into specific function pointer
     __clear_cache(jitmem, je);
     int *res = bsearch(&sym, sym, 1, 1, (void *) _start);

I can configure the makefile to build as standalone or shared library if you are interested. If -Op is enabled I can add libsquint.so to the ELF .so list. (just changed everything to -Op for what it will look like on next commit).

@jserv
Copy link
Owner

jserv commented Feb 17, 2022

FYI I just modified squint for fun to run as a shared object library. Here is a jit() run:
[...]
I can configure the makefile to build as standalone or shared library if you are interested. If -Op is enabled I can add libsquint.so to the ELF .so list. (just changed everything to -Op for what it will look like on next commit).

I like the idea to run Squint as a shared library, which means that AMaCC can compile itself while Squint can be built with optimizing compilers such as gcc and clang. Can you use dlopen and dlsym to load libsquint.so?

After you resolve the issues found by @loganchien, I am about to merge this effort.

HPCguy added 2 commits February 18, 2022 03:42
There is a lot here to keep track of.  It would be nice
to address any further issues in a subsequent pull request.
A very small Squint follow-up pull requests is planned to
improve/finalize squint testing and to add an option to
configure squint as a shared library in addition to current
standalone mode.
@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 19, 2022

I have resolved all outstanding issues I am aware of.

squint.c Show resolved Hide resolved
@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 19, 2022

Let me be blunt. It has been 14 days since this pull request has been essentially ready to go, and 20 days total in a pull request state. At several points, I asked if Squint was too much for AMaCC to accept, and you replied that you thought it was appropriate. The one bug that @loganchien found is one which will never appear in actual code, whereas AMaCC itself is riddled with bugs when trying to write the simplest of code fragments. I have fixed many such bugs since I began contributing to this project six months ago that are now a part of AMaCC. I find it disrespectful that I have had this pull request open for 20 days. You need to accept this pull request by tomorrow or I close it, and continue with AMaCC in my own repository on Github. While I am upset in this comment, I think all my points are both true and reasonable. I think AMaCC is a great project, and I would like to continue contributing here rather than on my own, but I have to trade that against being held back from being able to do good work in a timely fashion.

@loganchien
Copy link
Contributor

loganchien commented Feb 19, 2022

I understand your frustration for the long review time and I feel sorry about it. But I must clarify that I do not intend to be disrespectful. I am happy to see the change and willing to see it to be merged eventually.

The prolonged review time is the direct result of two reasons: (1) the magnitude of squint.cc and (2) the limited spare time I have. To review the code properly, I have to think thoroughly on (1) what invariants are you trying to maintain in the program, (2) are all boundary cases properly handled, (3) what can go wrong, and (4) sometimes I have to check the ARM reference manual to ensure the assumption is correct (even if they are just two line of code like Line 688-689). All of these are time consuming work and unfortunately I only have roughly 8 hours per week for this work (and might be even shorter if some unexpected over-time work or house keeping work is taking away my time).

I think I won't be able to review it at a faster speed. I leave it to jserv@ to decide whether he wants to take this PR as-is and maybe fix any bugs found later. I am only trying to be helpful here. Thank you for your understanding.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 19, 2022

I would feel it as a responsibility to fix any bugs as they arise. Since this is an auxillary program and not a part of AMaCC proper, the risks of impacting AMaCC are very low. Indeed Squint could be easily jettisoned with less than a minute of makefile and repo changes at any time without impact to AMaCC.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 20, 2022

Hi. I have decided to create my own repo, since the turnaround time for review here is too long for my development needs. That said, I will leave Squint as a pull request in case you want to merge it into AMaCC. Feel free to cancel if you don't want it. As I do AMaCC bug fixes on my own repo, I will submit some of them as pull requests to this repo. What you've done with this compiler is really phenomenal, and I chose it vs Github competitors for that reason. Thank you for such an amazing compiler to work with, based mostly on C4 but going further just the right amount. Now it is my turn to split off as you did, and hopefully continue improving this amazing educational compiler. It will be in a new repo called Squint under HPCguy.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 20, 2022

PS @loganchien Thank you for the time you took to review this. I know you sacrificed personal time, and I appreciate your careful analysis to make sure AMaCC would maintain its reputation before allowing the merge.

@jserv
Copy link
Owner

jserv commented Feb 20, 2022

I wrote AMaCC in my spare time as one of volunteer projects. I appreciate the contributions from @HPCguy, @loganchien, and @lecopzer. Based on c4, AMaCC is now moving forward to be more practical and usable -- I could not imagine this when I started this project in 2016. Per the consideration of @loganchien, I think we can leave some messages in git log and merge the effort for experimental purpose. For future commits, more enhancements can be landed and targeting stronger Squint.

@HPCguy
Copy link
Contributor Author

HPCguy commented Feb 20, 2022

And thank you for your incredible work! I have started a new repo here: https://github.com/HPCguy/Squint . I encourage you to Read the README there. My mind does not deal well with outstanding pull requests, and I want to plow ahead rapidly on some new development. I will back-patch most bugfixes from the new repo into AMaCC, along with select features that I think are appropriate for AMaCC. I need the new repo for my mental health, so I can rapidly expand the capabilities of AMaCC without some constraints I have been putting on myself to stay true to the "small but powerful" AMaCC model. I think AMaCC provides a wonderful educational code base, and I don't want to mess that up.

Copy link
Contributor

@loganchien loganchien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Progress so far:

  • apply_peepholes1
  • apply_peepholes2 (partial)
  • apply_peepholes4
  • simplify_branch1
  • simplify_branch2
  • simplify_branch3
  • simplify_branch4 (I couldn't understand)
  • simplify_branch5

squint.c Show resolved Hide resolved
squint.c Show resolved Hide resolved
squint.c Show resolved Hide resolved
squint.c Show resolved Hide resolved
squint.c Show resolved Hide resolved
squint.c Show resolved Hide resolved
squint.c Outdated Show resolved Hide resolved
squint.c Outdated Show resolved Hide resolved
squint.c Outdated Show resolved Hide resolved
squint.c Show resolved Hide resolved
squint.c Outdated Show resolved Hide resolved
squint.c Show resolved Hide resolved
if ((*scan & 0x0e000000) == 0x0a000000) {
hasLink = *scan & (1<<24);
if ( ((mode == 0) && !hasLink) || // intra-function mode
((mode != 0) && hasLink) ) { // inter-function mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what do you mean with the term intra-function and inter-function.

I guess that by intra-function you mean that NOP removal will only be applied to basic blocks within a function and the function address won't be changed; OTOH, by inter-function you mean that NOP removal will be applied across functions (in addition to the basic block).

If my "guess" is correct, shouldn't the condition:

            if ( ((mode == 0) && !hasLink) ||   // intra-function mode
                 ((mode != 0) &&  hasLink) ) {  // inter-function mode

be:

            if ( !hasLink ||   // basic block branch for intra/inter-function mode
                 ((mode != 0) &&  hasLink) ) {  // inter-function mode

Copy link
Contributor Author

@HPCguy HPCguy Feb 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use scripts/disasm to compare the code before applying Squint and after Squint, you will see large NOP blocks between functions. The inter-procedual call to relocate_nop removes those NOP blocks between functions (after I fix the bug, which is on my list, but at low priority right now). The intra-function call to relocate-NOP takes the NOPS produced by peephole optimization out of a given function. I believe this is what you said, but I just want to clarify it is a two step process, because that is important. First, intra-function is applied to all functions, and second you apply inter-function compression. I'm pretty sure the conditional is correct, but if I am wrong, I will fix it when I get back to this within a few months.

squint.c Show resolved Hide resolved
Copy link
Contributor

@loganchien loganchien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left all the comments I would like to say for the portion I've reviewed.

However, the mind set differences between the author and me is huge. I don't see a chance to reach an agreement. To break the tie, I decide to step down as a reviewer. Maybe other reviewers can do a better job then me. Best regards.

for (i = 0, j = 0; i < cnst_pool_size; ++i) {
val = cbegin[cnst_pool[i].data_addr/4];
rotate = 0;
if ((val & 0x0000ff00) && !(val & 0xffff00ff)) rotate = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be generalized by a function that can match all possible modified immediate:

// Returns the ARM modified immediate if x can be encoded as a modified
// immediate; otherwise, return -1.
int encode_arm_modified_immediate(unsigned x) {
  int rotate;
  for (rotate = 0; rotate < 16; ++rotate) {
    if ((x & 0xffffff00) == 0) {
      return (rotate << 8) | x;
    }
    x = (x << 2) | (x >> 30);  // Rotate 2 bits from MSB to LSB
  }
  return -1;
}

(p.s. It is possible to further improve the performance with __builtin_clz and __builtin_ctz, but I don't have time to think about it thoroughly.)

And then check the constant value with:

int mov_imm_inst = 0;
int immediate;
if ((immediate = encode_arm_modified_constant(val)) != -1) {
  mov_imm_inst = 0xe3a00000;  // mov
} else if ((immediate = encoding_arm_modified_constant(~val)) != -1) {
  mov_imm_inst = 0xe3e00000;  // mvn
}
if (mov_imm_inst) {
  inst = cnst_pool[i].inst;
  while (inst != 0) {
    next = inst->next;
    newinst = cbegin + inst->inst_addr/4;
    *newinst = mov_imm_inst | (*newinst & RI_Rd) | immediate;
    free(inst);
  }
}

Copy link
Contributor Author

@HPCguy HPCguy Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x = (x << 2) | (x >> 30); // Rotate 2 bits from MSB to LSB

I believe 0x80000000 >> 30 will smear a 1 through the top 30 bits, corrupting x. I see your point though, and I will try to devise an implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe 0x80000000 >> 30 will smear a 1 through the top 30 bits, corrupting x.

I used unsigned as the type for x. x >> 30 should be translated to logical shift right (LSR), which should pad zeros instead of the most significant bit (MSB).

Copy link
Contributor Author

@HPCguy HPCguy Mar 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I fix issue #94 in AMaCC, amacc will be able to compile Squint, at least in my repo. I'm hoping that if I add optimizations to Squint that beat gcc -O1 on a wide variety of tests, you guys might accept some Squint changes back, assuming you guys were to accept the current pull request. I'm hoping to beat gcc -O1 on a variety of tests with as little as 500 lines of additional optimizations in Squint.

@HPCguy
Copy link
Contributor Author

HPCguy commented Mar 10, 2022

I don't mind the suggested peephole checks being changed after this pull request, or any other changes anyone may want to add. I just won't be the one to do it, on philisophical grounds.

You have the option to discard this pull request with no ill feelings on my part. I suspected this might be too big for AMaCC when I started, which is why I was hesitant in my first post for this pull request.

In any event, I will continue to submit small bug fix pull requests to AMaCC that touch 20 or less lines. I will return to fixing my amacc "issue" submissions around March 14, after I have completed float data type support for this code base on my personal repo @ https://github.com/HPCguy/Squint.

@HPCguy
Copy link
Contributor Author

HPCguy commented May 14, 2022

The compiler is now beating gcc -O3 for the Sieve of Eratosthenes problem. Floating point is decently implemented and runs a physics benchmark in about 2/3 the time of Gcc -O0.

The https://github.com/HPCguy/Squint/blob/main/README.md page now has tables for executable size, compile time, and runtime comparisons.

After a few more performance enhanchements (I want floating point to beat gcc -O1), I will likely start fixing AMaCC bugs, and backporting those changes here.

@HPCguy
Copy link
Contributor Author

HPCguy commented Jun 2, 2022

The floating point execution performance of my extended version of amacc in the repository https://github.com/HPCguy/Squint is now beating gcc -O3 floating point performance.

@HPCguy
Copy link
Contributor Author

HPCguy commented Jul 19, 2022

I'm going to close this pull request, since Github HPCguy/Squint supersedes this work. AMaCC is mentioned liberally within that project, and it never would have been completed without the excellent development in the early AMaCC compiler.

@HPCguy HPCguy closed this Jul 19, 2022
@HPCguy HPCguy deleted the squint0 branch July 19, 2022 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants