-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inference speed [-Ofast] #20
Comments
Wow, this works too. And it stacks with -O3 and -funsafe-math-optimizations. |
Yeah, the Ofast is pretty standard (supported by clang and gcc for a long time). BTW also clang produces a bit faster inference too
It's likely that he 90% of performance is in matmul when multplying weights. One easy and portable way to tackle that without loosing much simplicity and/or portability would be to use Vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). That would allow to use SIMD in a portable way with minimal changes.
and then arithmetic operations would operate on N elements in parallel. Additional option would be to use openmp for parallelism
but that would require openmp and compiling with -fopenmp to enable it but it would be quite portable and it's 1 line change.
produces 480t/s with one line change |
I appreciate the writeup, thank you! I will take a look. I'm in favor of adding anything that will keep most of the simplicity and portability. |
I wrote a small script to compile the program with different flags, import hashlib
import os
flag_sets = [
"",
"-O3",
"-O3 -mtune=native -march=native",
"-Ofast -mtune=native -march=native",
"-Ofast -funsafe-math-optimizations -ffast-math -mtune=native -march=native",
"-Ofast -funsafe-math-optimizations -ffast-math -fopenmp -mtune=native -march=native",
]
for flag_set in flag_sets:
exec_name = f"out-{hashlib.sha256(flag_set.encode()).hexdigest()[:8]}"
compiler = "/opt/homebrew/opt/llvm/bin/clang" if "openmp" in flag_set else "clang"
command = f"{compiler} {flag_set} -o {exec_name} run.c"
print(command, end=": ", flush=True)
os.system(command)
os.system(f"./{exec_name} out/model.bin | grep ^achieved") and on my M2 Max laptop the results are pretty impressive (though I'm not sure why the OpenMP version ends up being comparatively so slow).
|
Oh, it's because of the clock() is measuring the CPU time. After changing it to gettimeofday the results are much better with openmp.
BTW. ofast implies fast-math and fast-math implies -funsafe-math-optimizations So Ofast + march=native is already the best combination. |
Added #27 for reference ☝️ More experiments can be done with the number of threads, scheduler, etc. - https://www.openmp.org/spec-html/5.0/openmpch6.html#x287-20510006 |
It seems that at least on my intel cpu 4 threads seems to be the sweet spot OMP_NUM_THREADS=4 ./run out/model.bin // achieved tok/s: 622.871046 that can be also defaulted in the code by
but that's probably is coupling too much and complicates the things a bit more. |
Actually, with mmap enabled (#30) and gcc (not clang) the performance looks even better (it seems backend bound). It's still without vectorizing matmul.
perf stat
|
With OMP and mmap enabled (#30), I'm consistently achieving >2600 tokens/s on my Ryzen 7950x.
Another optimization would be to replace
|
Ok wow, go OMP! :) I'll merge the PR @krzysztof-jusiak . Every day I speed up the code too much and have to train a bigger model. I published the 44M model yesterday, but even that one is too fast now. So I'm now training a GPT-1 sized model (768 dim 12 layer 12 head 1024 context) 110M params, which will be done training somewhere later today. |
@tringwald that's blazing fast in comparison to i7-12650H, is that because of Ryzen's 7950x avx512 support? |
@krzysztof-jusiak Well, the comparison is not really fair. The i7-12650H is a mobile segment CPU (with only 6 p-cores) vs. a high-end desktop Ryzen 7950x (with 16 full cores). I also assume my overclocked DDR5 RAM helps quite a lot ;) Edit:
vs.
Using -fopt-info shows that for native many blocks were vectorized with 64 byte vectors instead of 16 byte for generic x86-64. |
Thank you @tringwald, makes sense 👍 Also totally agree that gpu would be the fastest option, but that would require quantization or something for bigger models to fit in the memory which is most likely out of scope for this project but cpu only based inference (float16) still can handle 7B with reasonable speed without quantization. |
Clang vectorizes and unrolls the inner reduction loop in |
I ran a quick test on AWS |
In the spirit of the project adding additional compilation flags seems complicating things, however,
-Ofast
compilation flag seems easy to apply. Ofast is O3 + fast-math and some other optimizations (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html).It almost 2x the inference speed so it might be worth considering.
The results with O3 and Ofast are the same, though fast-math doesn't guarantee that.
The text was updated successfully, but these errors were encountered: