Speed up inference ~4x for 7B model without introducing too much complexity #95

krzysztof-jusiak · 2023-07-26T02:33:41Z

Problem:

inference for 7B model is slow.

Solution:

unroll the loop in matmul to perform 4 operation in parallel with simd.

Result (with float16):

before: 16tok/s
after: 71tok/s

Note:

unrolling the loop is a bit magical so not sure about it but maybe the trade of is worth it?

vgoklani · 2023-07-26T03:13:03Z

@krzysztof-jusiak Hey there - could you please explain how this works:

"unroll the loop in matmul to perform 4 operation in parallel with simd."

Presumably each line is done in parallel, but are there more details?

i'm super-curious, thanks!

Maknee · 2023-07-26T04:41:18Z

I think this patch is quite nice as it adds a minimal amount of lines.

@krzysztof-jusiak Hey there - could you please explain how this works:

"unroll the loop in matmul to perform 4 operation in parallel with simd."

Presumably each line is done in parallel, but are there more details?

i'm super-curious, thanks!

The loop is unrolled four times. See generated assembly comparison.

This loop iterates by 4 (j+=4) ~ Each four element is added/multiplied.
This gives a hint to the compiler to generate AVX2 instructions (XMM registers), which operates on four floats in a batch.

This patch is similar to #55 and #94, which also uses the same technique, but unrolled 16x.

Below are some of my results comparing this with current main branch (using the smaller models listed in README.md)

1 run, two lines for tok/s, first line being the original and second being updated version

run

model

achieved tok/s: 127.680798
achieved tok/s: 129.489125

model44m

achieved tok/s: 40.327662
achieved tok/s: 41.850580

runfast

model

achieved tok/s: 560.175055
achieved tok/s: 510.978044

model44m

achieved tok/s: 150.234742
achieved tok/s: 152.744630

runomp

model

achieved tok/s: 4923.076923
achieved tok/s: 3710.144928

model44m

achieved tok/s: 480.300188
achieved tok/s: 483.931947

here is the result running hyperfine, 10 runs

run

Benchmark 1: ./run model.bin 0.0
  Time (mean ± σ):      2.014 s ±  0.012 s    [User: 2.011 s, System: 0.002 s]
  Range (min … max):    2.005 s …  2.038 s    10 runs
  
Benchmark 2: ./run_qkv model.bin 0.0
  Time (mean ± σ):      1.980 s ±  0.000 s    [User: 1.977 s, System: 0.002 s]
  Range (min … max):    1.979 s …  1.980 s    10 runs
 
Summary
  ./run_qkv model.bin 0.0 ran
    1.02 ± 0.01 times faster than ./run model.bin 0.0

Benchmark 1: ./run model44m.bin 0.0
  Time (mean ± σ):      6.188 s ±  0.120 s    [User: 6.182 s, System: 0.005 s]
  Range (min … max):    6.104 s …  6.421 s    10 runs
  
Benchmark 2: ./run_qkv model44m.bin 0.0
  Time (mean ± σ):      6.148 s ±  0.093 s    [User: 6.140 s, System: 0.007 s]
  Range (min … max):    6.097 s …  6.327 s    10 runs
  
Summary
  ./run_qkv model44m.bin 0.0 ran
    1.01 ± 0.02 times faster than ./run model44m.bin 0.0

runfast

Benchmark 1: ./run model.bin 0.0
  Time (mean ± σ):     461.9 ms ±   2.1 ms    [User: 459.3 ms, System: 2.5 ms]
  Range (min … max):   459.0 ms … 464.9 ms    10 runs
 
Benchmark 2: ./run_qkv model.bin 0.0
  Time (mean ± σ):     508.9 ms ±   6.6 ms    [User: 506.3 ms, System: 2.5 ms]
  Range (min … max):   499.6 ms … 515.9 ms    10 runs
 
Summary
  ./run model.bin 0.0 ran
    1.10 ± 0.02 times faster than ./run_qkv model.bin 0.0

Benchmark 1: ./run model44m.bin 0.0
  Time (mean ± σ):      1.738 s ±  0.067 s    [User: 1.730 s, System: 0.008 s]
  Range (min … max):    1.710 s …  1.928 s    10 runs
 
Benchmark 2: ./run_qkv model44m.bin 0.0
  Time (mean ± σ):      1.685 s ±  0.013 s    [User: 1.677 s, System: 0.007 s]
  Range (min … max):    1.673 s …  1.707 s    10 runs
 
Summary
  ./run_qkv model44m.bin 0.0 ran
    1.03 ± 0.04 times faster than ./run model44m.bin 0.0

runomp

Benchmark 1: ./run model.bin 0.0
  Time (mean ± σ):      44.4 ms ±   0.4 ms    [User: 344.3 ms, System: 1.7 ms]
  Range (min … max):    44.0 ms …  45.2 ms    10 runs
 
Benchmark 2: ./run_qkv model.bin 0.0
  Time (mean ± σ):      72.3 ms ±   4.2 ms    [User: 562.7 ms, System: 5.2 ms]
  Range (min … max):    69.7 ms …  84.0 ms    10 runs
  
Summary
  ./run model.bin 0.0 ran
    1.63 ± 0.10 times faster than ./run_qkv model.bin 0.0

Benchmark 1: ./run model44m.bin 0.0
  Time (mean ± σ):     544.2 ms ±  27.7 ms    [User: 4329.5 ms, System: 11.2 ms]
  Range (min … max):   524.3 ms … 610.7 ms    10 runs
 
Benchmark 2: ./run_qkv model44m.bin 0.0
  Time (mean ± σ):     582.7 ms ± 132.5 ms    [User: 4635.9 ms, System: 11.6 ms]
  Range (min … max):   531.9 ms … 959.2 ms    10 runs
  
Summary
  ./run model44m.bin 0.0 ran
    1.07 ± 0.25 times faster than ./run_qkv model44m.bin 0.0

Here are the specs of the machine/env I'm running on

AMD Ryzen 7 7800X3D 8-Core Processor
128GB DDR5 5200 ram

Linux maknee-gpu 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04.1)

Foundation42 · 2023-07-26T05:16:27Z

Very cool stuff, perhaps we can integrate your loop unrolling with my fused matrix multiplies

Here is what I am getting with your PR on the same box I was testing with in my PR #94

Fast

f42@formica:~/dev/llama2.c$ ./run out44m/model44m.bin
<s>
 Once upon a time, there was a little peanut. The peanut was very small and lived in a big garden. One day, the peanut met a big, tall tree. The tree was very kind and let the peanut live with it.
One day, it was very cold outside. The peanut started to shiver. The big, tall tree saw the squirrel shivering too. The tree said to the peanut, "Come, sit with me. I will keep you warm." The peanut was polite and said, "Thank you, tree."
They became good friends. The peanut, the tree, and the tree were always together. They played and talked every day. The peanut was happy and warm. The big, tall tree was happy too. And they all lived happily ever after.
<s>
 Once upon a time, there was a little boy named Tim. Tim was very excited because he found a big gear in his toy box. He wanted to show it to his friend, Sue.
At school, Tim met Sue and said, "Look at my big gear!" Sue looked at the gear and said, "Wow!
achieved tok/s: 47.832586

OMP

f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run out44m/model44m.bin
<s>
 Once upon a time, there was a little red car. The car had a dream. It wanted to gain something special. One day, the car went on a long trip. It had to leave its friends behind. The car was very happy.
But on the trip, the car saw a big mess. There was a terrible mess everywhere. The car was sad. It thought, "I wanted to gain something special today, but there was no." It did not like the mess.
Then, the car saw a big tree. The tree was full of pretty flowers. The car had a good idea. It started to pick the flowers. The flowers made the terrible mess go away. The car gained something special after all. It gained the pretty flowers. The car was very happy.
<s>
 Once upon a time, there was a loud dog named Max. Max loved to bark all day. He barked at his toys, at the flowers, and even at the people walking by.
One day, Max found a magazine on the ground. It had many fun pictures in it. Max thought it would be fun to bark at the pictures in the magazine, too. So, he barked and barked, and the pictures in
achieved tok/s: 175.222450

These are very impressive numbers for such a small change. It makes me even more confident by combining the strategies it could be a huge potential win

Foundation42 · 2023-07-26T05:27:51Z

You mentioned

I wonder how much speed for larger models (memory bound) could be achieved by quantization and similar techniques which would most likely require custom simd implementation. In case of llama.cpp quantization gives a huge performance boost especially q4, however it's not exactly the same model as the original and I don't think that's the scope of this project.

Your question regarding the potential for performance improvements through techniques such as quantization for larger, memory-bound models is certainly intriguing. As you correctly pointed out, llama.cpp's q4 quantization does lead to significant speed improvements. However, like you said, perhaps this strays away from the vision and could potentially fall outside the scope of this project, which, as we've discussed, strives to strike a balance between simplicity and performance.

Reflecting on @karpathy's work with nanoGPT and miniGPT, it's clear that he has already explored the spectrum from baseline models to more sophisticated implementations. In many ways, this project feels like a step up, pushing the envelope while still keeping educational value high.

It's incredibly fun seeing how far we can take things though. By examining what optimizations can be applied and understanding their impact, we're really pushing what can be achieved with CPU-bound models.. and keeping the complexity at a manageable level (fingers crossed)

Really looking forward to seeing where the project goes next.

Foundation42 · 2023-07-26T06:13:31Z

It's a quite past my bedtime but was finally able to produce a couple of llama2-7b benchmarks.

#95

f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run ../llama/llama2_7b.bin 
<s>
 Here you can discover the Ukrainian brides, that can be found for a wedding in Kiev. There are thousands of Ukrainian women who are very dreamy in regards to the probabilities of getting to know the perfect man on the earth.
Ukrainian girls are very frank, so do not be afraid to ask her how a lot she costs for a date. There are many reasons why Ukrainian brides are so fashionable among men from the United States.
</s>

#94 
<s>
 SMART GOALS (Set Goals, Make a Plan, Accept Responsibility, Track Progress, and Achieve Success)
Great idea and not so great idea checklist for new habits
Prioritize: Can you imagine not prioritizing? Even if we don’t write it down, we do prioritize in life, setting goals and committing to them. Goal setting isn’t new, but it is powerful. What we write down is powerful and then taking concrete action steps to meet our goals.
Write goals for the future: write both long-term goals (five years) and short-term goals (one year).
When writing goals, the SMART checklist helps:
–
achieved tok/s: 2.055086

#94

f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run ../llama/llama2_7b.bin 
<s>
 Tags: linux, bash, shell, makefile

Question: Bash script to remove specific files that I don’t know their names

I’m trying to make a bash script where I have a list files to remove and at the same time I have a list files that I don’t want removed.

It can be like this:

\begin{code}
filesCommon = /home/user/somelongname.zip /home/user/somelongname.txt
filesNontoRemove = /home/user/genre.json /home/user/tags.csv

filesToRemove ?= $filesCommon
remove = cp $filesNontoRemove $filesCommon
\end{code}

I don’t know how to solve this problem. I would really appreciate your help.

Answer: If the common files are simply named `foo/something` and `bar/something`, then you could do `filesCommon=(`. But you'd better to find a solution that does not depend on names.

The fundamental problem is that a bash variable is the value of a variable, not the variable itself. A string that contains a list of strings can be
achieved tok/s: 2.369427

I'll take a look tomorrow and see if it is possible to merge the loop unrolling and the broader work. It certainly was a good start.

Have a good day and look forward to more soon

kroggen · 2023-07-26T07:27:08Z

Wow, way cleaner than #94

Foundation42 · 2023-07-26T07:46:38Z

Wow, way cleaner than #94

Absolutely, individual preferences can sway toward solutions requiring fewer changes, especially in the context of projects like this one where simplicity is a key factor. However, it's important to recognize that to realize substantial performance improvements, certain fundamental alterations may be unavoidable.

In my experience, it's indeed common to attain significant performance gains—up to 100%—with relatively minor adjustments. But when striving for even greater enhancements, one often has to delve deeper and be prepared for more extensive modifications. The introduction of fused matrix multiplication, for instance, isn't something that could be achieved with just a few lines of code; it's intrinsic to its nature.

Consequently, I believe that making these more complex changes earlier, when possible, sets a stronger foundation for future improvements. All while keeping in mind the delicate balance between optimization and maintainability.

In the end, our shared passion for maximizing the potential of this project is what unites us. Despite it being in its early stages—merely two days old—it's awesome to see the diverse range of ideas and approaches being explored. It underscores the importance of evaluating all possibilities, to truly optimize what an be achieved. Passion is certainly a feature.

clebert · 2023-07-26T08:34:49Z

Nice patch. I suggest adding comments to the code to retain its instructive nature and to provide novice code readers with an indication of what it does.

#pragma omp parallel for
for (int i = 0; i < d; i++) {
    float val = 0.0f;
    const int i_n = i * n;
    // Loop is incremented by 4 to perform four calculations at once. This is known as loop unrolling.
    for (int j = 0; j < n; j+=4) {
        // Four calculations are conducted in parallel, utilizing AVX2 instructions to speed up the processing time.
        val += w[i_n + j] * x[j];
        val += w[i_n + j + 1] * x[j + 1];
        val += w[i_n + j + 2] * x[j + 2];
        val += w[i_n + j + 3] * x[j + 3];
    }
    xout[i] = val;
}

Problem: - inference for 7B model is slow. Solution: - unroll the loop in matmul to perform 4 operation in parallel with simd. Result (with float16): - before: 16tok/s - after: 71tok/s

krzysztof-jusiak · 2023-07-26T08:40:42Z

@clebert Thanks, added the comments, agree that they are very useful in this case as there is additional complexity to deal with but hopefully not too much.

leloykun · 2023-07-26T08:46:06Z

I'm not very familiar with this, but: why don't we just parallelize the for loop? I.e. add another #pragma omp parallel for?

aegkmq · 2023-07-26T09:52:40Z

Why not use #pragma GCC unroll 8 instead of manually unrolling?

Ea0011 · 2023-07-26T09:54:32Z

I am curious have you considered using OpenMP SIMD directive. In my case this didn't do much as I suspect the compiler was already taking care of SIMD automatically.

#pragma omp simd
for (int j = 0; j < n; j++) {
      val += w[i * n + j] * x[j];
}

krzysztof-jusiak · 2023-07-26T10:09:39Z

@aegkmq #pragma GCC unroll 4/8 was my initial thought though the optimized code wasn't' as fast
@Ea0011 #pragma om simd - yeah, the code is already vectorized

There are defo ways to optimize it much more whilst keeping the simplicity, I just explored a bit after verifying that the matmul is the bottleneck, noticed the improvement and created a MR as IMHO nice step forward without introducing much complexity but wanted to verify whether that's overall consensus.

godbolt link for the solutions - https://godbolt.org/z/Gb3dbxz6W

Ea0011 · 2023-07-26T10:44:20Z

@kris-jusiak Weirdly enough, I get around 20% speedup by doing this. I specify number of iterations to be vectorized to be 4 instead of letting omp decide it. I guess this happens because 128bit instructions have less latency and/or more instructions per cycle on my machine.

#pragma omp simd simdlen(4)
for (int j = 0; j < n; j++) {
      val += w[i * n + j] * x[j];
}

Edit: this also seems to use xmm registers and do 4 iterations at a time because of specified simdlen(4). Could you have a look of you have time? Thanks.

krzysztof-jusiak · 2023-07-26T11:18:04Z

This is good discussion. Here are some additional results

master: achieved tok/s: 0.622270
pragma omp simdlen(4): achieved tok/s: 0.647372
pragma gcc unroll 8: achieved tok/s: 0.642006
this MR: achieved tok/s: 2.254533

I think the way to go for simplicity, portability and performance would be to get aligned memory (which @Foundation42 has already worked on) and use Vector Extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). That would simplify the code a bit and would perform probably better too. Something worth exploring, IMHO.

karpathy · 2023-07-26T16:22:10Z

@krzysztof-jusiak I don't mind complexifying matmul a little bit because it is the place where 90% of the FLOPS go and I think a good tradeoff. So I'm happy to merge something like this. That said I'm. not able to reproduce this speedup. Both master and this branch run at ~4.5 tok/s on my cloud machine with OMP 48 threads. Can you say a bit more about where you run this, how it was compiled?

krzysztof-jusiak · 2023-07-26T16:38:22Z

Improvement has been tested with fp16 model (#93) on machine with sse3/avx/avx2 support.

python3 export_meta_llama_bin.py 7B llama2_7b.bin float16
gcc -Ofast -march=native  run.c  -lm  -o run -fopenmp -DDTYPE=_Float16
OMP_NUM_THREADS=8 ./run llama2_7b.bin

Ea0011 · 2023-07-26T16:42:43Z

@karpathy One question. Is your cloud machine a single CPU machine or does it have multiple CPUs? In the latter case, your machine might be a Non uniform memry access (NUMA) in which case multithreading can cause slowdowns because of data locality issues. Essentially, data allocated in memory of one node is harder to access from other nodes which causes latency.

karpathy · 2023-07-26T17:56:11Z

@krzysztof-jusiak oops I missed #93 will def take a look after work.

@Ea0011 my lscpu is here:

(pytorch2) ubuntu:~/llamac$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          120
On-line CPU(s) list:             0-119
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       120
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7H12 64-Core Processor
Stepping:                        0
CPU MHz:                         2599.998
BogoMIPS:                        5199.99
Virtualization:                  AMD-V
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       7.5 MiB
L1i cache:                       7.5 MiB
L2 cache:                        60 MiB
L3 cache:                        1.9 GiB
NUMA node0 CPU(s):               0-119
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall n
                                 x mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma 
                                 cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_
                                 legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fs
                                 gsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves
                                  clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities

Ea0011 · 2023-07-26T18:15:06Z

@karpathy Ok. You have a single node and a single CPU. So, you should not worry about any of that :). But using 48 threads can be a bit too much I think. I wonder of you could achieve speedup using less threads.

ozabluda · 2023-07-30T18:08:55Z

godbolt link for the solutions - https://godbolt.org/z/Gb3dbxz6W

If you add -funroll-all-loops to the original vanilla version, it unrolls it by 8x (on godbolt). Probably additional compiler flags can force even more unrolling. On my box it leads to a couple percent faster performance.

https://godbolt.org/z/fM9K8hv5e

krzysztof-jusiak force-pushed the matmul branch from 07d6f0b to 3cdc72a Compare July 26, 2023 02:40

This was referenced Jul 26, 2023

Adds Optional AVX2 Support, Cache Alignment, and Enhances Model Export Speed #94

Open

Running llama2-7B on laptop with a few GB of memory... #98

Open

Speed up inference ~4x for 7B model

6281518

Problem: - inference for 7B model is slow. Solution: - unroll the loop in matmul to perform 4 operation in parallel with simd. Result (with float16): - before: 16tok/s - after: 71tok/s

krzysztof-jusiak force-pushed the matmul branch from 3cdc72a to 6281518 Compare July 26, 2023 08:38

Foundation42 mentioned this pull request Jul 26, 2023

Please keep this simple #104

Closed

ozabluda mentioned this pull request Jul 30, 2023

Add -funroll-all-loops to compiler flags #183

Closed

twobob mentioned this pull request Aug 1, 2023

even moar GCC flags - adding PGO optimizations #207

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up inference ~4x for 7B model without introducing too much complexity #95

Speed up inference ~4x for 7B model without introducing too much complexity #95

krzysztof-jusiak commented Jul 26, 2023

vgoklani commented Jul 26, 2023 •

edited

Maknee commented Jul 26, 2023 •

edited

Foundation42 commented Jul 26, 2023 •

edited

Foundation42 commented Jul 26, 2023

Foundation42 commented Jul 26, 2023

kroggen commented Jul 26, 2023

Foundation42 commented Jul 26, 2023

clebert commented Jul 26, 2023

krzysztof-jusiak commented Jul 26, 2023

leloykun commented Jul 26, 2023

aegkmq commented Jul 26, 2023

Ea0011 commented Jul 26, 2023 •

edited

krzysztof-jusiak commented Jul 26, 2023

Ea0011 commented Jul 26, 2023 •

edited

krzysztof-jusiak commented Jul 26, 2023

karpathy commented Jul 26, 2023

krzysztof-jusiak commented Jul 26, 2023

Ea0011 commented Jul 26, 2023

karpathy commented Jul 26, 2023

Ea0011 commented Jul 26, 2023

ozabluda commented Jul 30, 2023

Speed up inference ~4x for 7B model without introducing too much complexity #95

Are you sure you want to change the base?

Speed up inference ~4x for 7B model without introducing too much complexity #95

Conversation

krzysztof-jusiak commented Jul 26, 2023

vgoklani commented Jul 26, 2023 • edited

Maknee commented Jul 26, 2023 • edited

run

runfast

runomp

here is the result running hyperfine, 10 runs

run

runfast

runomp

Foundation42 commented Jul 26, 2023 • edited

Foundation42 commented Jul 26, 2023

Foundation42 commented Jul 26, 2023

kroggen commented Jul 26, 2023

Foundation42 commented Jul 26, 2023

clebert commented Jul 26, 2023

krzysztof-jusiak commented Jul 26, 2023

leloykun commented Jul 26, 2023

aegkmq commented Jul 26, 2023

Ea0011 commented Jul 26, 2023 • edited

krzysztof-jusiak commented Jul 26, 2023

Ea0011 commented Jul 26, 2023 • edited

krzysztof-jusiak commented Jul 26, 2023

karpathy commented Jul 26, 2023

krzysztof-jusiak commented Jul 26, 2023

Ea0011 commented Jul 26, 2023

karpathy commented Jul 26, 2023

Ea0011 commented Jul 26, 2023

ozabluda commented Jul 30, 2023

vgoklani commented Jul 26, 2023 •

edited

Maknee commented Jul 26, 2023 •

edited

Foundation42 commented Jul 26, 2023 •

edited

Ea0011 commented Jul 26, 2023 •

edited

Ea0011 commented Jul 26, 2023 •

edited