Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimization flags #292

Open
catfact opened this issue Nov 15, 2017 · 3 comments
Open

optimization flags #292

catfact opened this issue Nov 15, 2017 · 3 comments

Comments

@catfact
Copy link
Collaborator

@catfact catfact commented Nov 15, 2017

following discussion from lines:
https://llllllll.co/t/modern-c-programming-tips-and-tricks/3039/124?u=zebra

we need to free up some more code space in bees, so it's probably time to move away from -O3

from the full list of gcc optimization options, these are the ones i've found to be supported by avr32-gcc:

OPTIMIZATION = -finline-functions \
-funswitch-loops \
-fpredictive-commoning \
-fgcse-after-reload \
-ftree-loop-distribution \
-fvect-cost-model \
-fpeel-loops \
-fipa-cp-clone \
-fthread-jumps \
-falign-functions \
-falign-jumps \
-falign-loops \
-falign-labels \
-fcaller-saves \
-fcrossjumping \
-fcse-follow-jumps \
-fcse-skip-blocks \
-fdelete-null-pointer-checks \
-fexpensive-optimizations \
-fgcse -fgcse-lm \
-finline-small-functions \
-findirect-inlining \
-fipa-cp \
-foptimize-sibling-calls \
-fpeephole2 \
-freorder-blocks-and-partition \
-freorder-functions \
-frerun-cse-after-loop \
-fsched-interblock \
-fsched-spec \
-fschedule-insns \
-fschedule-insns2 \
-fstrict-aliasing \
-ftree-builtin-call-dce \
-ftree-switch-conversion \
-ftree-pre \

currently experimenting with which flags have the greatest impact on code size.

profiling for speed is quite a bit harder but will construct some suitable scenes and start flipping GPIO in main event loop.

any advice would be greatly appreciated

@catfact catfact changed the title optimization optimization flags Nov 15, 2017
@boqs
Copy link
Contributor

@boqs boqs commented Nov 15, 2017

could we modify the build process such that certain large ops (for example op_kria) get compiled with -Os, whereas the majority of the code gets -O3 ? I was able to mix & match optimisation settings like this for a desktop linux application - there I was trading off compile speed with execution speed...

@catfact
Copy link
Collaborator Author

@catfact catfact commented Nov 15, 2017

oof... i suppose it's certainly possible.. but sounds like an awful slog throught the ASF makefile and so on.

some outputs for the kria op:

with -O3:
[emb@bat ops]$ avr32-size -Bx op_kria.o
text data bss dec hex filename
0x6b20 0x10 0x39 27497 6b69 op_kria.o

with -O2:
[emb@bat ops]$ avr32-size -Bx op_kria.o
text data bss dec hex filename
0x2e90 0x10 0x39 11993 2ed9 op_kria.o

with -Os:
[emb@bat ops]$ avr32-size -Bx op_kria.o
text data bss dec hex filename
0x2cec 0x10 0x39 11573 2d35 op_kria.o

so it doesn't look like Os is buying us much compared to O2 in that case. (by skipping code reordering.) but something in O3 is really putting the hurt on code size

i'm hitting some roadblocks with trying to set all the flags manually. some inline functions in headers (in fat fs lib) are getting stripped out before link... will keep looking at it; i'd like to know whether the size exploder is inlining or unrolling or something else.

but i wouldn't be opposed to just bumping everything down to O2 and seeing what happens.

@catfact
Copy link
Collaborator Author

@catfact catfact commented Nov 15, 2017

more data points. been using -O2 with extra flags on top and trying to find which ones are the size-exploders.


with -O2:

.text 0x20138 0x80008008
.rodata 0x618c 0x80028400
.data 0x1c64 0x8

total (.text + .rodata + .data) = 163588 B


with -O3:

.text = 0x2fe6c
total = 228408


adding the -O3 flags i've determined to be most size-costly:

OPTIMIZATION = -O2 \
-fpeel-loops \
-finline-functions \
-finline-small-functions \
-fipa-cp-clone

.text = 0x2f144
total = 225040 B


and the -O3 flags i've determined to be least costly:

OPTIMIZATION = -O2 \
-fpredictive-commoning \
-ftree-loop-distribution \
-fexpensive-optimizations \
-funswitch-loops \
-fgcse-after-reload

.text 0x20408 0x80008008
total = 164308 B


so.. seems like the biggest culprits are inlining and loop peeling (makes sense.) unfortunately i'd also guess they are also the most effective for speed.

notes on some of the other flags:

  • -fpredictive-commoning : moves array indexing out of loop bodies. we do a lot of array indexing in loops, so probably a good idea

  • -fgcse-after-reload : something i don't quite understand about avoiding register spills. sounds good.

  • -ftree-loop-distribution : parallelize / vectorize loops. can't see this mattering much.

  • -funswitch-loops : if a loop body contains conditionals, move them outside the loop and duplicate the body in each branch. i'm surprised that enabling this doesn't contribute more to code size, but maybe we aren't doing many conditionals in loop bodies (best avoided anyway.)

  • noone seems to know quite what expensive-optimizations does, gcc docs just says the optimizations are "minor." doesn't seem to use much space.


so i'm going to go ahead and use that last block for now. profiling session will focus on just seeing if we get significant gains from the more space-costly -O3 flags: -fpeel-loops, -finline-functions, and -finline-small-functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants