Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up inference ~4x for 7B model without introducing too much complexity #95

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

krzysztof-jusiak
Copy link

Problem:

  • inference for 7B model is slow.

Solution:

  • unroll the loop in matmul to perform 4 operation in parallel with simd.

Result (with float16):

  • before: 16tok/s
  • after: 71tok/s

Note:

  • unrolling the loop is a bit magical so not sure about it but maybe the trade of is worth it?

@vgoklani
Copy link

vgoklani commented Jul 26, 2023

@krzysztof-jusiak Hey there - could you please explain how this works:

"unroll the loop in matmul to perform 4 operation in parallel with simd."

Presumably each line is done in parallel, but are there more details?

i'm super-curious, thanks!

@Maknee
Copy link

Maknee commented Jul 26, 2023

I think this patch is quite nice as it adds a minimal amount of lines.

@krzysztof-jusiak Hey there - could you please explain how this works:

"unroll the loop in matmul to perform 4 operation in parallel with simd."

Presumably each line is done in parallel, but are there more details?

i'm super-curious, thanks!

The loop is unrolled four times. See generated assembly comparison.

This loop iterates by 4 (j+=4) ~ Each four element is added/multiplied.
This gives a hint to the compiler to generate AVX2 instructions (XMM registers), which operates on four floats in a batch.

This patch is similar to #55 and #94, which also uses the same technique, but unrolled 16x.


Below are some of my results comparing this with current main branch (using the smaller models listed in README.md)

1 run, two lines for tok/s, first line being the original and second being updated version

run

model

achieved tok/s: 127.680798
achieved tok/s: 129.489125

model44m

achieved tok/s: 40.327662
achieved tok/s: 41.850580

runfast

model

achieved tok/s: 560.175055
achieved tok/s: 510.978044

model44m

achieved tok/s: 150.234742
achieved tok/s: 152.744630

runomp

model

achieved tok/s: 4923.076923
achieved tok/s: 3710.144928

model44m

achieved tok/s: 480.300188
achieved tok/s: 483.931947

here is the result running hyperfine, 10 runs

run

Benchmark 1: ./run model.bin 0.0
  Time (mean ± σ):      2.014 s ±  0.012 s    [User: 2.011 s, System: 0.002 s]
  Range (min … max):    2.005 s …  2.038 s    10 runs
  
Benchmark 2: ./run_qkv model.bin 0.0
  Time (mean ± σ):      1.980 s ±  0.000 s    [User: 1.977 s, System: 0.002 s]
  Range (min … max):    1.979 s …  1.980 s    10 runs
 
Summary
  ./run_qkv model.bin 0.0 ran
    1.02 ± 0.01 times faster than ./run model.bin 0.0

Benchmark 1: ./run model44m.bin 0.0
  Time (mean ± σ):      6.188 s ±  0.120 s    [User: 6.182 s, System: 0.005 s]
  Range (min … max):    6.104 s …  6.421 s    10 runs
  
Benchmark 2: ./run_qkv model44m.bin 0.0
  Time (mean ± σ):      6.148 s ±  0.093 s    [User: 6.140 s, System: 0.007 s]
  Range (min … max):    6.097 s …  6.327 s    10 runs
  
Summary
  ./run_qkv model44m.bin 0.0 ran
    1.01 ± 0.02 times faster than ./run model44m.bin 0.0

runfast

Benchmark 1: ./run model.bin 0.0
  Time (mean ± σ):     461.9 ms ±   2.1 ms    [User: 459.3 ms, System: 2.5 ms]
  Range (min … max):   459.0 ms … 464.9 ms    10 runs
 
Benchmark 2: ./run_qkv model.bin 0.0
  Time (mean ± σ):     508.9 ms ±   6.6 ms    [User: 506.3 ms, System: 2.5 ms]
  Range (min … max):   499.6 ms … 515.9 ms    10 runs
 
Summary
  ./run model.bin 0.0 ran
    1.10 ± 0.02 times faster than ./run_qkv model.bin 0.0

Benchmark 1: ./run model44m.bin 0.0
  Time (mean ± σ):      1.738 s ±  0.067 s    [User: 1.730 s, System: 0.008 s]
  Range (min … max):    1.710 s …  1.928 s    10 runs
 
Benchmark 2: ./run_qkv model44m.bin 0.0
  Time (mean ± σ):      1.685 s ±  0.013 s    [User: 1.677 s, System: 0.007 s]
  Range (min … max):    1.673 s …  1.707 s    10 runs
 
Summary
  ./run_qkv model44m.bin 0.0 ran
    1.03 ± 0.04 times faster than ./run model44m.bin 0.0

runomp

Benchmark 1: ./run model.bin 0.0
  Time (mean ± σ):      44.4 ms ±   0.4 ms    [User: 344.3 ms, System: 1.7 ms]
  Range (min … max):    44.0 ms …  45.2 ms    10 runs
 
Benchmark 2: ./run_qkv model.bin 0.0
  Time (mean ± σ):      72.3 ms ±   4.2 ms    [User: 562.7 ms, System: 5.2 ms]
  Range (min … max):    69.7 ms …  84.0 ms    10 runs
  
Summary
  ./run model.bin 0.0 ran
    1.63 ± 0.10 times faster than ./run_qkv model.bin 0.0

Benchmark 1: ./run model44m.bin 0.0
  Time (mean ± σ):     544.2 ms ±  27.7 ms    [User: 4329.5 ms, System: 11.2 ms]
  Range (min … max):   524.3 ms … 610.7 ms    10 runs
 
Benchmark 2: ./run_qkv model44m.bin 0.0
  Time (mean ± σ):     582.7 ms ± 132.5 ms    [User: 4635.9 ms, System: 11.6 ms]
  Range (min … max):   531.9 ms … 959.2 ms    10 runs
  
Summary
  ./run model44m.bin 0.0 ran
    1.07 ± 0.25 times faster than ./run_qkv model44m.bin 0.0

Here are the specs of the machine/env I'm running on

AMD Ryzen 7 7800X3D 8-Core Processor
128GB DDR5 5200 ram
Linux maknee-gpu 5.19.0-46-generic #47~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Jun 21 15:35:31 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04.1) 

@Foundation42
Copy link

Foundation42 commented Jul 26, 2023

Very cool stuff, perhaps we can integrate your loop unrolling with my fused matrix multiplies

Here is what I am getting with your PR on the same box I was testing with in my PR #94

Fast

f42@formica:~/dev/llama2.c$ ./run out44m/model44m.bin
<s>
 Once upon a time, there was a little peanut. The peanut was very small and lived in a big garden. One day, the peanut met a big, tall tree. The tree was very kind and let the peanut live with it.
One day, it was very cold outside. The peanut started to shiver. The big, tall tree saw the squirrel shivering too. The tree said to the peanut, "Come, sit with me. I will keep you warm." The peanut was polite and said, "Thank you, tree."
They became good friends. The peanut, the tree, and the tree were always together. They played and talked every day. The peanut was happy and warm. The big, tall tree was happy too. And they all lived happily ever after.
<s>
 Once upon a time, there was a little boy named Tim. Tim was very excited because he found a big gear in his toy box. He wanted to show it to his friend, Sue.
At school, Tim met Sue and said, "Look at my big gear!" Sue looked at the gear and said, "Wow!
achieved tok/s: 47.832586

OMP

f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run out44m/model44m.bin
<s>
 Once upon a time, there was a little red car. The car had a dream. It wanted to gain something special. One day, the car went on a long trip. It had to leave its friends behind. The car was very happy.
But on the trip, the car saw a big mess. There was a terrible mess everywhere. The car was sad. It thought, "I wanted to gain something special today, but there was no." It did not like the mess.
Then, the car saw a big tree. The tree was full of pretty flowers. The car had a good idea. It started to pick the flowers. The flowers made the terrible mess go away. The car gained something special after all. It gained the pretty flowers. The car was very happy.
<s>
 Once upon a time, there was a loud dog named Max. Max loved to bark all day. He barked at his toys, at the flowers, and even at the people walking by.
One day, Max found a magazine on the ground. It had many fun pictures in it. Max thought it would be fun to bark at the pictures in the magazine, too. So, he barked and barked, and the pictures in
achieved tok/s: 175.222450

These are very impressive numbers for such a small change. It makes me even more confident by combining the strategies it could be a huge potential win

@Foundation42
Copy link

You mentioned

I wonder how much speed for larger models (memory bound) could be achieved by quantization and similar techniques which would most likely require custom simd implementation. In case of llama.cpp quantization gives a huge performance boost especially q4, however it's not exactly the same model as the original and I don't think that's the scope of this project.

Your question regarding the potential for performance improvements through techniques such as quantization for larger, memory-bound models is certainly intriguing. As you correctly pointed out, llama.cpp's q4 quantization does lead to significant speed improvements. However, like you said, perhaps this strays away from the vision and could potentially fall outside the scope of this project, which, as we've discussed, strives to strike a balance between simplicity and performance.

Reflecting on @karpathy's work with nanoGPT and miniGPT, it's clear that he has already explored the spectrum from baseline models to more sophisticated implementations. In many ways, this project feels like a step up, pushing the envelope while still keeping educational value high.

It's incredibly fun seeing how far we can take things though. By examining what optimizations can be applied and understanding their impact, we're really pushing what can be achieved with CPU-bound models.. and keeping the complexity at a manageable level (fingers crossed)

Really looking forward to seeing where the project goes next.

@Foundation42
Copy link

It's a quite past my bedtime but was finally able to produce a couple of llama2-7b benchmarks.

#95

f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run ../llama/llama2_7b.bin 
<s>
 Here you can discover the Ukrainian brides, that can be found for a wedding in Kiev. There are thousands of Ukrainian women who are very dreamy in regards to the probabilities of getting to know the perfect man on the earth.
Ukrainian girls are very frank, so do not be afraid to ask her how a lot she costs for a date. There are many reasons why Ukrainian brides are so fashionable among men from the United States.
</s>

#94 
<s>
 SMART GOALS (Set Goals, Make a Plan, Accept Responsibility, Track Progress, and Achieve Success)
Great idea and not so great idea checklist for new habits
Prioritize: Can you imagine not prioritizing? Even if we don’t write it down, we do prioritize in life, setting goals and committing to them. Goal setting isn’t new, but it is powerful. What we write down is powerful and then taking concrete action steps to meet our goals.
Write goals for the future: write both long-term goals (five years) and short-term goals (one year).
When writing goals, the SMART checklist helps:

achieved tok/s: 2.055086

#94

f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run ../llama/llama2_7b.bin 
<s>
 Tags: linux, bash, shell, makefile

Question: Bash script to remove specific files that I don’t know their names

I’m trying to make a bash script where I have a list files to remove and at the same time I have a list files that I don’t want removed.

It can be like this:

\begin{code}
filesCommon = /home/user/somelongname.zip /home/user/somelongname.txt
filesNontoRemove = /home/user/genre.json /home/user/tags.csv

filesToRemove ?= $filesCommon
remove = cp $filesNontoRemove $filesCommon
\end{code}

I don’t know how to solve this problem. I would really appreciate your help.

Answer: If the common files are simply named `foo/something` and `bar/something`, then you could do `filesCommon=(`. But you'd better to find a solution that does not depend on names.

The fundamental problem is that a bash variable is the value of a variable, not the variable itself. A string that contains a list of strings can be
achieved tok/s: 2.369427

I'll take a look tomorrow and see if it is possible to merge the loop unrolling and the broader work. It certainly was a good start.

Have a good day and look forward to more soon

@kroggen
Copy link
Contributor

kroggen commented Jul 26, 2023

Wow, way cleaner than #94

@Foundation42
Copy link

Wow, way cleaner than #94

Absolutely, individual preferences can sway toward solutions requiring fewer changes, especially in the context of projects like this one where simplicity is a key factor. However, it's important to recognize that to realize substantial performance improvements, certain fundamental alterations may be unavoidable.

In my experience, it's indeed common to attain significant performance gains—up to 100%—with relatively minor adjustments. But when striving for even greater enhancements, one often has to delve deeper and be prepared for more extensive modifications. The introduction of fused matrix multiplication, for instance, isn't something that could be achieved with just a few lines of code; it's intrinsic to its nature.

Consequently, I believe that making these more complex changes earlier, when possible, sets a stronger foundation for future improvements. All while keeping in mind the delicate balance between optimization and maintainability.

In the end, our shared passion for maximizing the potential of this project is what unites us. Despite it being in its early stages—merely two days old—it's awesome to see the diverse range of ideas and approaches being explored. It underscores the importance of evaluating all possibilities, to truly optimize what an be achieved. Passion is certainly a feature.

@clebert
Copy link
Contributor

clebert commented Jul 26, 2023

Nice patch. I suggest adding comments to the code to retain its instructive nature and to provide novice code readers with an indication of what it does.

#pragma omp parallel for
for (int i = 0; i < d; i++) {
    float val = 0.0f;
    const int i_n = i * n;
    // Loop is incremented by 4 to perform four calculations at once. This is known as loop unrolling.
    for (int j = 0; j < n; j+=4) {
        // Four calculations are conducted in parallel, utilizing AVX2 instructions to speed up the processing time.
        val += w[i_n + j] * x[j];
        val += w[i_n + j + 1] * x[j + 1];
        val += w[i_n + j + 2] * x[j + 2];
        val += w[i_n + j + 3] * x[j + 3];
    }
    xout[i] = val;
}

Problem:
  - inference for 7B model is slow.

Solution:
  - unroll the loop in matmul to perform 4 operation in parallel with
    simd.

Result (with float16):
  - before: 16tok/s
  - after:  71tok/s
@krzysztof-jusiak
Copy link
Author

@clebert Thanks, added the comments, agree that they are very useful in this case as there is additional complexity to deal with but hopefully not too much.

@leloykun
Copy link
Contributor

I'm not very familiar with this, but: why don't we just parallelize the for loop? I.e. add another #pragma omp parallel for?

@aegkmq
Copy link
Contributor

aegkmq commented Jul 26, 2023

Why not use #pragma GCC unroll 8 instead of manually unrolling?

@Ea0011
Copy link

Ea0011 commented Jul 26, 2023

I am curious have you considered using OpenMP SIMD directive. In my case this didn't do much as I suspect the compiler was already taking care of SIMD automatically.

#pragma omp simd
for (int j = 0; j < n; j++) {
      val += w[i * n + j] * x[j];
}

@krzysztof-jusiak
Copy link
Author

@aegkmq #pragma GCC unroll 4/8 was my initial thought though the optimized code wasn't' as fast
@Ea0011 #pragma om simd - yeah, the code is already vectorized

There are defo ways to optimize it much more whilst keeping the simplicity, I just explored a bit after verifying that the matmul is the bottleneck, noticed the improvement and created a MR as IMHO nice step forward without introducing much complexity but wanted to verify whether that's overall consensus.

godbolt link for the solutions - https://godbolt.org/z/Gb3dbxz6W

perf

@Ea0011
Copy link

Ea0011 commented Jul 26, 2023

@kris-jusiak Weirdly enough, I get around 20% speedup by doing this. I specify number of iterations to be vectorized to be 4 instead of letting omp decide it. I guess this happens because 128bit instructions have less latency and/or more instructions per cycle on my machine.

#pragma omp simd simdlen(4)
for (int j = 0; j < n; j++) {
      val += w[i * n + j] * x[j];
}

Edit: this also seems to use xmm registers and do 4 iterations at a time because of specified simdlen(4). Could you have a look of you have time? Thanks.

@krzysztof-jusiak
Copy link
Author

This is good discussion. Here are some additional results

  • master: achieved tok/s: 0.622270
  • pragma omp simdlen(4): achieved tok/s: 0.647372
  • pragma gcc unroll 8: achieved tok/s: 0.642006
  • this MR: achieved tok/s: 2.254533

I think the way to go for simplicity, portability and performance would be to get aligned memory (which @Foundation42 has already worked on) and use Vector Extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html). That would simplify the code a bit and would perform probably better too. Something worth exploring, IMHO.

@karpathy
Copy link
Owner

@krzysztof-jusiak I don't mind complexifying matmul a little bit because it is the place where 90% of the FLOPS go and I think a good tradeoff. So I'm happy to merge something like this. That said I'm. not able to reproduce this speedup. Both master and this branch run at ~4.5 tok/s on my cloud machine with OMP 48 threads. Can you say a bit more about where you run this, how it was compiled?

@krzysztof-jusiak
Copy link
Author

Improvement has been tested with fp16 model (#93) on machine with sse3/avx/avx2 support.

python3 export_meta_llama_bin.py 7B llama2_7b.bin float16
gcc -Ofast -march=native  run.c  -lm  -o run -fopenmp -DDTYPE=_Float16
OMP_NUM_THREADS=8 ./run llama2_7b.bin

@Ea0011
Copy link

Ea0011 commented Jul 26, 2023

@karpathy One question. Is your cloud machine a single CPU machine or does it have multiple CPUs? In the latter case, your machine might be a Non uniform memry access (NUMA) in which case multithreading can cause slowdowns because of data locality issues. Essentially, data allocated in memory of one node is harder to access from other nodes which causes latency.

@karpathy
Copy link
Owner

@krzysztof-jusiak oops I missed #93 will def take a look after work.

@Ea0011 my lscpu is here:

(pytorch2) ubuntu:~/llamac$ lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          120
On-line CPU(s) list:             0-119
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       120
NUMA node(s):                    1
Vendor ID:                       AuthenticAMD
CPU family:                      23
Model:                           49
Model name:                      AMD EPYC 7H12 64-Core Processor
Stepping:                        0
CPU MHz:                         2599.998
BogoMIPS:                        5199.99
Virtualization:                  AMD-V
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       7.5 MiB
L1i cache:                       7.5 MiB
L2 cache:                        60 MiB
L3 cache:                        1.9 GiB
NUMA node0 CPU(s):               0-119
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall n
                                 x mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma 
                                 cx16 sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_
                                 legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw perfctr_core ssbd ibrs ibpb stibp vmmcall fs
                                 gsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves
                                  clzero xsaveerptr wbnoinvd arat npt nrip_save umip rdpid arch_capabilities

@Ea0011
Copy link

Ea0011 commented Jul 26, 2023

@karpathy Ok. You have a single node and a single CPU. So, you should not worry about any of that :). But using 48 threads can be a bit too much I think. I wonder of you could achieve speedup using less threads.

@ozabluda
Copy link
Contributor

godbolt link for the solutions - https://godbolt.org/z/Gb3dbxz6W

If you add -funroll-all-loops to the original vanilla version, it unrolls it by 8x (on godbolt). Probably additional compiler flags can force even more unrolling. On my box it leads to a couple percent faster performance.

https://godbolt.org/z/fM9K8hv5e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet