Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference speed [-Ofast] #20

Open
krzysztof-jusiak opened this issue Jul 24, 2023 · 15 comments
Open

Inference speed [-Ofast] #20

krzysztof-jusiak opened this issue Jul 24, 2023 · 15 comments

Comments

@krzysztof-jusiak
Copy link

krzysztof-jusiak commented Jul 24, 2023

In the spirit of the project adding additional compilation flags seems complicating things, however, -Ofast compilation flag seems easy to apply. Ofast is O3 + fast-math and some other optimizations (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html).

It almost 2x the inference speed so it might be worth considering.
The results with O3 and Ofast are the same, though fast-math doesn't guarantee that.

  - O3:    160t/s
  - Ofast: 307t/s
@karpathy
Copy link
Owner

Wow, this works too. And it stacks with -O3 and -funsafe-math-optimizations.
I'm not as familiar with all of these flags, are they all present for all versions of gcc on all platforms? Are they "safe"?
Amazing.

@krzysztof-jusiak
Copy link
Author

krzysztof-jusiak commented Jul 24, 2023

Yeah, the Ofast is pretty standard (supported by clang and gcc for a long time).
Since Ofast is adding -fast-math that's not 100% safe as it disregard strict standards compliance, though the output for the run is the same (with O3 and Ofast), so it's pretty safe in that regards.

BTW also clang produces a bit faster inference too
clang -Ofast run.c -o run

clang-15 Ofast: 321t/s
gcc-12 0fast: 307t/s

It's likely that he 90% of performance is in matmul when multplying weights. One easy and portable way to tackle that without loosing much simplicity and/or portability would be to use Vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html).

That would allow to use SIMD in a portable way with minimal changes.

typedef float v4sf __attribute__ ((vector_size (16)));

and then arithmetic operations would operate on N elements in parallel.

Additional option would be to use openmp for parallelism

void matmul(float* xout, float* x, float* w, int n, int d) {
    #pragma omp parallel for
    for (int i = 0; i < d; i++) {
        float val = 0.0f;
        for (int j = 0; j < n; j++) {
            val += w[i * n + j] * x[j];
        }
        xout[i] = val;
    }
}

but that would require openmp and compiling with -fopenmp to enable it but it would be quite portable and it's 1 line change.

clang -Ofast -fopenmp -o run run.c

produces 480t/s with one line change

@karpathy
Copy link
Owner

I appreciate the writeup, thank you! I will take a look. I'm in favor of adding anything that will keep most of the simplicity and portability.

@akx
Copy link

akx commented Jul 24, 2023

I wrote a small script to compile the program with different flags,

import hashlib
import os

flag_sets = [
    "",
    "-O3",
    "-O3 -mtune=native -march=native",
    "-Ofast -mtune=native -march=native",
    "-Ofast -funsafe-math-optimizations -ffast-math -mtune=native -march=native",
    "-Ofast -funsafe-math-optimizations -ffast-math -fopenmp -mtune=native -march=native",
]

for flag_set in flag_sets:
    exec_name = f"out-{hashlib.sha256(flag_set.encode()).hexdigest()[:8]}"
    compiler = "/opt/homebrew/opt/llvm/bin/clang" if "openmp" in flag_set else "clang"
    command = f"{compiler} {flag_set} -o {exec_name} run.c"
    print(command, end=": ", flush=True)
    os.system(command)
    os.system(f"./{exec_name} out/model.bin | grep ^achieved")

and on my M2 Max laptop the results are pretty impressive (though I'm not sure why the OpenMP version ends up being comparatively so slow).

clang  -o out-e3b0c442 run.c: achieved tok/s: 19.673734
clang -O3 -o out-ab3e680b run.c: achieved tok/s: 127.886245
clang -O3 -mtune=native -march=native -o out-46e214a9 run.c: achieved tok/s: 128.765640
clang -Ofast -mtune=native -march=native -o out-ae7e7e2b run.c: achieved tok/s: 677.293472
clang -Ofast -funsafe-math-optimizations -ffast-math -mtune=native -march=native -o out-380cc2e2 run.c: achieved tok/s: 673.834937
/opt/homebrew/opt/llvm/bin/clang -Ofast -funsafe-math-optimizations -ffast-math -fopenmp -mtune=native -march=native -o out-ca4e3427 run.c: achieved tok/s: 96.132073

@krzysztof-jusiak
Copy link
Author

Oh, it's because of the clock() is measuring the CPU time. After changing it to gettimeofday the results are much better with openmp.

clang -Ofast -march=native  run.c  -lm  -o run // achieved tok/s: 340.878828
clang -Ofast -fopenmp -march=native  run.c  -lm  -o run // achieved tok/s: 524.590164

BTW. ofast implies fast-math and fast-math implies -funsafe-math-optimizations So Ofast + march=native is already the best combination.

@krzysztof-jusiak
Copy link
Author

Added #27 for reference ☝️

More experiments can be done with the number of threads, scheduler, etc. - https://www.openmp.org/spec-html/5.0/openmpch6.html#x287-20510006

@krzysztof-jusiak
Copy link
Author

It seems that at least on my intel cpu 4 threads seems to be the sweet spot

OMP_NUM_THREADS=4 ./run out/model.bin // achieved tok/s: 622.871046

that can be also defaulted in the code by

#pragma omp parallel for num_threads(4)

but that's probably is coupling too much and complicates the things a bit more.

@krzysztof-jusiak
Copy link
Author

krzysztof-jusiak commented Jul 24, 2023

Actually, with mmap enabled (#30) and gcc (not clang) the performance looks even better (it seems backend bound). It's still without vectorizing matmul.

OMP_NUM_THREADS=8 taskset -c 1-8 ./run out/model.bin
achieved tok/s: 907.801418

perf stat

 Performance counter stats for './run out/model.bin':

          1,313.02 msec task-clock                       #    3.949 CPUs utilized             
                 7      context-switches                 #    5.331 /sec                      
                 3      cpu-migrations                   #    2.285 /sec                      
             2,253      page-faults                      #    1.716 K/sec                     
     5,458,396,894      cpu_core/cycles/                 #    4.157 G/sec                     
     <not counted>      cpu_atom/cycles/                                                        (0.00%)
     8,039,822,904      cpu_core/instructions/           #    6.123 G/sec                     
     <not counted>      cpu_atom/instructions/                                                  (0.00%)
     1,144,340,855      cpu_core/branches/               #  871.533 M/sec                     
     <not counted>      cpu_atom/branches/                                                      (0.00%)
           567,726      cpu_core/branch-misses/          #  432.382 K/sec                     
     <not counted>      cpu_atom/branch-misses/                                                 (0.00%)
    32,361,450,930      cpu_core/slots/                  #   24.647 G/sec                     
     7,061,847,839      cpu_core/topdown-retiring/       #     21.8% Retiring                 
        31,373,084      cpu_core/topdown-bad-spec/       #      0.1% Bad Speculation          
     1,142,374,971      cpu_core/topdown-fe-bound/       #      3.5% Frontend Bound           
    24,158,077,283      cpu_core/topdown-be-bound/       #     74.6% Backend Bound            
       158,143,683      cpu_core/topdown-heavy-ops/      #      0.5% Heavy Operations          #     21.3% Light Operations         
        31,338,821      cpu_core/topdown-br-mispredict/  #      0.1% Branch Mispredict         #      0.0% Machine Clears           
       189,379,719      cpu_core/topdown-fetch-lat/      #      0.6% Fetch Latency             #      2.9% Fetch Bandwidth          
    12,049,852,779      cpu_core/topdown-mem-bound/      #     37.2% Memory Bound              #     37.4% Core Bound               

@tringwald
Copy link

With OMP and mmap enabled (#30), I'm consistently achieving >2600 tokens/s on my Ryzen 7950x.

$ gcc -Ofast -fopenmp -mtune=native -march=native run.c -lm -o run
$ OMP_NUM_THREADS=16 ./run out/model.bin 0.9 0 | grep "^achieved"
achieved tok/s: 2694.736842

Another optimization would be to replace exp() in softmax() with expf(). This avoids the conversion to double, but may alter the result slightly. However, it provides another big speedup of ~700 tokens/s.

$ OMP_NUM_THREADS=16 ./run out/model.bin 0.9 0 | grep "^achieved"
achieved tok/s: 3324.675325

@karpathy
Copy link
Owner

karpathy commented Jul 24, 2023

Ok wow, go OMP! :) I'll merge the PR @krzysztof-jusiak . Every day I speed up the code too much and have to train a bigger model. I published the 44M model yesterday, but even that one is too fast now. So I'm now training a GPT-1 sized model (768 dim 12 layer 12 head 1024 context) 110M params, which will be done training somewhere later today.

@krzysztof-jusiak
Copy link
Author

@tringwald that's blazing fast in comparison to i7-12650H, is that because of Ryzen's 7950x avx512 support?
Looking at matmul (https://godbolt.org/z/qndT8Eb1b) the generated code is definitely far from optimal but it's vectorized at least. Improving matmul would defo accelerate the inference but it's hard to do it without adding additional complexity, using external libraries and/or using computation graph so there are trade offs here.

@tringwald
Copy link

tringwald commented Jul 24, 2023

@krzysztof-jusiak Well, the comparison is not really fair. The i7-12650H is a mobile segment CPU (with only 6 p-cores) vs. a high-end desktop Ryzen 7950x (with 16 full cores). I also assume my overclocked DDR5 RAM helps quite a lot ;)
The biggest speedup would probably be CUDA, but that's out of scope for this project, I guess.

Edit:
With regard to AVX-512, you are probably right. Compiling and running with -march=x86-64 (which excludes AVX) results in a large performance downgrade.

$ gcc -Ofast -fopenmp -march=x86-64 run.c -lm -o run
$ OMP_NUM_THREADS=16 ./run out/model.bin 0.9 0 | grep "^achieved"
achieved tok/s: 2098.360656

vs.

$ gcc -Ofast -fopenmp -march=native run.c -lm -o run
$ OMP_NUM_THREADS=16 ./run out/model.bin 0.9 0 | grep "^achieved"
achieved tok/s: 2509.803922

Using -fopt-info shows that for native many blocks were vectorized with 64 byte vectors instead of 16 byte for generic x86-64.

@krzysztof-jusiak
Copy link
Author

Thank you @tringwald, makes sense 👍 Also totally agree that gpu would be the fastest option, but that would require quantization or something for bigger models to fit in the memory which is most likely out of scope for this project but cpu only based inference (float16) still can handle 7B with reasonable speed without quantization.

@emilmelnikov
Copy link

emilmelnikov commented Jul 24, 2023

Looking at matmul (https://godbolt.org/z/qndT8Eb1b) the generated code is definitely far from optimal but it's vectorized at least. Improving matmul would defo accelerate the inference but it's hard to do it without adding additional complexity, using external libraries and/or using computation graph so there are trade offs here.

Clang vectorizes and unrolls the inner reduction loop in matmul very well just with -Ofast. GCC codegen with __attribute__((optimize("unroll-loops"))) is not that good, but also does the job in the basic block at lines 76-95: https://godbolt.org/z/9e163bjEx.

@emilmelnikov
Copy link

emilmelnikov commented Jul 24, 2023

I ran a quick test on AWS c6i.large with Ubuntu 22.04 and GCC 11.2, and the unroll-loops attribute did not result in a higher tokens-per-second. Compiling with clang on the same machine also produced similar numbers. Interestingly, with -march=native GCC used only AVX2 YMM registers, but clang used AVX-512 ZMM. The generated assembly for matmul looked almost identical to the Godbolt snippet from above, so at this point we are probably bottlenecked by memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants