Inference speed [-Ofast] #20

krzysztof-jusiak · 2023-07-24T02:16:44Z

In the spirit of the project adding additional compilation flags seems complicating things, however, -Ofast compilation flag seems easy to apply. Ofast is O3 + fast-math and some other optimizations (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html).

It almost 2x the inference speed so it might be worth considering.
The results with O3 and Ofast are the same, though fast-math doesn't guarantee that.

  - O3:    160t/s
  - Ofast: 307t/s

The text was updated successfully, but these errors were encountered:

karpathy · 2023-07-24T02:18:56Z

Wow, this works too. And it stacks with -O3 and -funsafe-math-optimizations.
I'm not as familiar with all of these flags, are they all present for all versions of gcc on all platforms? Are they "safe"?
Amazing.

krzysztof-jusiak · 2023-07-24T02:38:25Z

Yeah, the Ofast is pretty standard (supported by clang and gcc for a long time).
Since Ofast is adding -fast-math that's not 100% safe as it disregard strict standards compliance, though the output for the run is the same (with O3 and Ofast), so it's pretty safe in that regards.

BTW also clang produces a bit faster inference too
clang -Ofast run.c -o run

clang-15 Ofast: 321t/s
gcc-12 0fast: 307t/s

It's likely that he 90% of performance is in matmul when multplying weights. One easy and portable way to tackle that without loosing much simplicity and/or portability would be to use Vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html).

That would allow to use SIMD in a portable way with minimal changes.

typedef float v4sf __attribute__ ((vector_size (16)));

and then arithmetic operations would operate on N elements in parallel.

Additional option would be to use openmp for parallelism

void matmul(float* xout, float* x, float* w, int n, int d) {
    #pragma omp parallel for
    for (int i = 0; i < d; i++) {
        float val = 0.0f;
        for (int j = 0; j < n; j++) {
            val += w[i * n + j] * x[j];
        }
        xout[i] = val;
    }
}

but that would require openmp and compiling with -fopenmp to enable it but it would be quite portable and it's 1 line change.

clang -Ofast -fopenmp -o run run.c

produces 480t/s with one line change

karpathy · 2023-07-24T03:14:59Z

I appreciate the writeup, thank you! I will take a look. I'm in favor of adding anything that will keep most of the simplicity and portability.

akx · 2023-07-24T05:17:42Z

I wrote a small script to compile the program with different flags,

import hashlib
import os

flag_sets = [
    "",
    "-O3",
    "-O3 -mtune=native -march=native",
    "-Ofast -mtune=native -march=native",
    "-Ofast -funsafe-math-optimizations -ffast-math -mtune=native -march=native",
    "-Ofast -funsafe-math-optimizations -ffast-math -fopenmp -mtune=native -march=native",
]

for flag_set in flag_sets:
    exec_name = f"out-{hashlib.sha256(flag_set.encode()).hexdigest()[:8]}"
    compiler = "/opt/homebrew/opt/llvm/bin/clang" if "openmp" in flag_set else "clang"
    command = f"{compiler} {flag_set} -o {exec_name} run.c"
    print(command, end=": ", flush=True)
    os.system(command)
    os.system(f"./{exec_name} out/model.bin | grep ^achieved")

and on my M2 Max laptop the results are pretty impressive (though I'm not sure why the OpenMP version ends up being comparatively so slow).

clang  -o out-e3b0c442 run.c: achieved tok/s: 19.673734
clang -O3 -o out-ab3e680b run.c: achieved tok/s: 127.886245
clang -O3 -mtune=native -march=native -o out-46e214a9 run.c: achieved tok/s: 128.765640
clang -Ofast -mtune=native -march=native -o out-ae7e7e2b run.c: achieved tok/s: 677.293472
clang -Ofast -funsafe-math-optimizations -ffast-math -mtune=native -march=native -o out-380cc2e2 run.c: achieved tok/s: 673.834937
/opt/homebrew/opt/llvm/bin/clang -Ofast -funsafe-math-optimizations -ffast-math -fopenmp -mtune=native -march=native -o out-ca4e3427 run.c: achieved tok/s: 96.132073

krzysztof-jusiak · 2023-07-24T06:53:45Z

Oh, it's because of the clock() is measuring the CPU time. After changing it to gettimeofday the results are much better with openmp.

clang -Ofast -march=native  run.c  -lm  -o run // achieved tok/s: 340.878828

clang -Ofast -fopenmp -march=native  run.c  -lm  -o run // achieved tok/s: 524.590164

BTW. ofast implies fast-math and fast-math implies -funsafe-math-optimizations So Ofast + march=native is already the best combination.

krzysztof-jusiak · 2023-07-24T07:03:23Z

Added #27 for reference ☝️

More experiments can be done with the number of threads, scheduler, etc. - https://www.openmp.org/spec-html/5.0/openmpch6.html#x287-20510006

krzysztof-jusiak · 2023-07-24T07:27:05Z

It seems that at least on my intel cpu 4 threads seems to be the sweet spot

OMP_NUM_THREADS=4 ./run out/model.bin // achieved tok/s: 622.871046

that can be also defaulted in the code by

#pragma omp parallel for num_threads(4)

but that's probably is coupling too much and complicates the things a bit more.

krzysztof-jusiak · 2023-07-24T09:26:35Z

Actually, with mmap enabled (#30) and gcc (not clang) the performance looks even better (it seems backend bound). It's still without vectorizing matmul.

OMP_NUM_THREADS=8 taskset -c 1-8 ./run out/model.bin

achieved tok/s: 907.801418

perf stat

 Performance counter stats for './run out/model.bin':

          1,313.02 msec task-clock                       #    3.949 CPUs utilized             
                 7      context-switches                 #    5.331 /sec                      
                 3      cpu-migrations                   #    2.285 /sec                      
             2,253      page-faults                      #    1.716 K/sec                     
     5,458,396,894      cpu_core/cycles/                 #    4.157 G/sec                     
     <not counted>      cpu_atom/cycles/                                                        (0.00%)
     8,039,822,904      cpu_core/instructions/           #    6.123 G/sec                     
     <not counted>      cpu_atom/instructions/                                                  (0.00%)
     1,144,340,855      cpu_core/branches/               #  871.533 M/sec                     
     <not counted>      cpu_atom/branches/                                                      (0.00%)
           567,726      cpu_core/branch-misses/          #  432.382 K/sec                     
     <not counted>      cpu_atom/branch-misses/                                                 (0.00%)
    32,361,450,930      cpu_core/slots/                  #   24.647 G/sec                     
     7,061,847,839      cpu_core/topdown-retiring/       #     21.8% Retiring                 
        31,373,084      cpu_core/topdown-bad-spec/       #      0.1% Bad Speculation          
     1,142,374,971      cpu_core/topdown-fe-bound/       #      3.5% Frontend Bound           
    24,158,077,283      cpu_core/topdown-be-bound/       #     74.6% Backend Bound            
       158,143,683      cpu_core/topdown-heavy-ops/      #      0.5% Heavy Operations          #     21.3% Light Operations         
        31,338,821      cpu_core/topdown-br-mispredict/  #      0.1% Branch Mispredict         #      0.0% Machine Clears           
       189,379,719      cpu_core/topdown-fetch-lat/      #      0.6% Fetch Latency             #      2.9% Fetch Bandwidth          
    12,049,852,779      cpu_core/topdown-mem-bound/      #     37.2% Memory Bound              #     37.4% Core Bound

Refs karpathy#20 (comment)

tringwald · 2023-07-24T13:44:55Z

With OMP and mmap enabled (#30), I'm consistently achieving >2600 tokens/s on my Ryzen 7950x.

$ gcc -Ofast -fopenmp -mtune=native -march=native run.c -lm -o run
$ OMP_NUM_THREADS=16 ./run out/model.bin 0.9 0 | grep "^achieved"
achieved tok/s: 2694.736842

Another optimization would be to replace exp() in softmax() with expf(). This avoids the conversion to double, but may alter the result slightly. However, it provides another big speedup of ~700 tokens/s.

$ OMP_NUM_THREADS=16 ./run out/model.bin 0.9 0 | grep "^achieved"
achieved tok/s: 3324.675325

karpathy · 2023-07-24T14:01:27Z

Ok wow, go OMP! :) I'll merge the PR @krzysztof-jusiak . Every day I speed up the code too much and have to train a bigger model. I published the 44M model yesterday, but even that one is too fast now. So I'm now training a GPT-1 sized model (768 dim 12 layer 12 head 1024 context) 110M params, which will be done training somewhere later today.

krzysztof-jusiak · 2023-07-24T15:33:56Z

@tringwald that's blazing fast in comparison to i7-12650H, is that because of Ryzen's 7950x avx512 support?
Looking at matmul (https://godbolt.org/z/qndT8Eb1b) the generated code is definitely far from optimal but it's vectorized at least. Improving matmul would defo accelerate the inference but it's hard to do it without adding additional complexity, using external libraries and/or using computation graph so there are trade offs here.

tringwald · 2023-07-24T15:48:27Z

@krzysztof-jusiak Well, the comparison is not really fair. The i7-12650H is a mobile segment CPU (with only 6 p-cores) vs. a high-end desktop Ryzen 7950x (with 16 full cores). I also assume my overclocked DDR5 RAM helps quite a lot ;)
The biggest speedup would probably be CUDA, but that's out of scope for this project, I guess.

Edit:
With regard to AVX-512, you are probably right. Compiling and running with -march=x86-64 (which excludes AVX) results in a large performance downgrade.

$ gcc -Ofast -fopenmp -march=x86-64 run.c -lm -o run
$ OMP_NUM_THREADS=16 ./run out/model.bin 0.9 0 | grep "^achieved"
achieved tok/s: 2098.360656

vs.

$ gcc -Ofast -fopenmp -march=native run.c -lm -o run
$ OMP_NUM_THREADS=16 ./run out/model.bin 0.9 0 | grep "^achieved"
achieved tok/s: 2509.803922

Using -fopt-info shows that for native many blocks were vectorized with 64 byte vectors instead of 16 byte for generic x86-64.

krzysztof-jusiak · 2023-07-24T16:30:14Z

Thank you @tringwald, makes sense 👍 Also totally agree that gpu would be the fastest option, but that would require quantization or something for bigger models to fit in the memory which is most likely out of scope for this project but cpu only based inference (float16) still can handle 7B with reasonable speed without quantization.

emilmelnikov · 2023-07-24T16:38:25Z

Looking at matmul (https://godbolt.org/z/qndT8Eb1b) the generated code is definitely far from optimal but it's vectorized at least. Improving matmul would defo accelerate the inference but it's hard to do it without adding additional complexity, using external libraries and/or using computation graph so there are trade offs here.

Clang vectorizes and unrolls the inner reduction loop in matmul very well just with -Ofast. GCC codegen with __attribute__((optimize("unroll-loops"))) is not that good, but also does the job in the basic block at lines 76-95: https://godbolt.org/z/9e163bjEx.

emilmelnikov · 2023-07-24T18:20:58Z

I ran a quick test on AWS c6i.large with Ubuntu 22.04 and GCC 11.2, and the unroll-loops attribute did not result in a higher tokens-per-second. Compiling with clang on the same machine also produced similar numbers. Interestingly, with -march=native GCC used only AVX2 YMM registers, but clang used AVX-512 ZMM. The generated assembly for matmul looked almost identical to the Godbolt snippet from above, so at this point we are probably bottlenecked by memory.

akx added a commit to akx/llama2.c that referenced this issue Jul 24, 2023

Use clock_gettime() instead of CPU time for benchmark

855e77a

Refs karpathy#20 (comment)

akx mentioned this issue Jul 24, 2023

Use clock_gettime() instead of CPU time for benchmark #31

Closed

hchenphd mentioned this issue Jul 24, 2023

A simple benchmark for an Android device #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference speed [-Ofast] #20

Inference speed [-Ofast] #20

krzysztof-jusiak commented Jul 24, 2023 •

edited

karpathy commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023 •

edited

karpathy commented Jul 24, 2023

akx commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023 •

edited

tringwald commented Jul 24, 2023

karpathy commented Jul 24, 2023 •

edited

krzysztof-jusiak commented Jul 24, 2023

tringwald commented Jul 24, 2023 •

edited

krzysztof-jusiak commented Jul 24, 2023

emilmelnikov commented Jul 24, 2023 •

edited

emilmelnikov commented Jul 24, 2023 •

edited

Inference speed [-Ofast] #20

Inference speed [-Ofast] #20

Comments

krzysztof-jusiak commented Jul 24, 2023 • edited

karpathy commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023 • edited

karpathy commented Jul 24, 2023

akx commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023 • edited

tringwald commented Jul 24, 2023

karpathy commented Jul 24, 2023 • edited

krzysztof-jusiak commented Jul 24, 2023

tringwald commented Jul 24, 2023 • edited

krzysztof-jusiak commented Jul 24, 2023

emilmelnikov commented Jul 24, 2023 • edited

emilmelnikov commented Jul 24, 2023 • edited

krzysztof-jusiak commented Jul 24, 2023 •

edited

krzysztof-jusiak commented Jul 24, 2023 •

edited

krzysztof-jusiak commented Jul 24, 2023 •

edited

karpathy commented Jul 24, 2023 •

edited

tringwald commented Jul 24, 2023 •

edited

emilmelnikov commented Jul 24, 2023 •

edited

emilmelnikov commented Jul 24, 2023 •

edited