Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Optional AVX2 Support, Cache Alignment, and Enhances Model Export Speed #94

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Foundation42
Copy link

This pull request introduces several key optimizations and enhancements to the Llama2.c codebase.

Key Changes:

  1. Optional AVX2 Support: Added optional support for AVX2 intrinsics in matmul and rmsnorm functions. This can provide a substantial performance boost on systems that support these instructions.
  2. Fused Matrix Multiplies: Introduced new functions matmul2 and matmul3 for fused matrix multiplications, further improving the efficiency of the code.
  3. Cache-Aligned Allocations: Memory allocations are now properly cache-aligned. This change boosts performance and ensures compatibility with SIMD/Vector intrinsics.
  4. Updated Config Structure: The Config struct has been updated to support cache alignment and a version field has been added (for future use). Please note that due to this change, previously exported models should be re-exported to ensure compatibility.
  5. Enhanced Model Export Speed: The performance of the serialization process in Llama export code has been significantly improved.
  6. Updated Makefile: The Makefile has been updated to support AVX2 and OMP AVX2 builds.

These changes collectively result in a substantial performance boost and make the code more flexible and efficient. I hope these enhancements prove to be beneficial to the project. I look forward to your feedback and am ready to make any necessary adjustments.

Please be aware that due to the new cache alignment requirements, model re-exporting is necessary. This PR introduces a version field to the file header, which simultaneously serves as padding for cache alignment. As such, the pre-existing models will need to be re-exported to ensure proper functionality.

…ment in Config

This commit includes:
1. Optional AVX2 support for matmul and rmsnorm functions.
2. Fused matrix multiplies with new matmul2 and matmul3 functions.
3. Cache aligned allocations for better performance and compatibility with SIMD/Vector intrinsics.
4. Updated Config struct to support cache alignment. NOTE: Previous models should be re-exported due to this change.
5. Enhanced performance of serialization in Llama export code.
6. Updated Makefile to support AVX2 and OMP AVX2 builds.
@krzysztof-jusiak
Copy link

krzysztof-jusiak commented Jul 26, 2023

This changes are great but they add tons of complexity to the run.c which makes the project not that easy to follow for less experienced developers, nevertheless I think most of them can be applied:

  • changes to the export script are awesome on their own, maybe that can be separated MR as I don't think there will be any problems with getting them in?
  • aligned alloc and local variables is defo a positive change though I think Windows has _aligned_malloc instead of posix_memalign which would have to be considered
  • regarding avx/avx2 - that's add tons of complexity and since there is no backend in llama2.c it makes the code complex as that would require additions for neon support, avx512 etc in the long term, so the 500 LOC wouldn't stay for long. Ideally compiler would help here. Probably using blast would be more maintainable in the long term?. Is the solution faster than using blast, the latter would allow easier switch to gpu too. though I know blast doesn't support fp16 which can be done easily with avx. There are trades off here for sure.

@Foundation42
Copy link
Author

Yes I understand the tradeoffs. Totally get where you are coming from

Thank you for your thoughtful feedback on my PR. I understand the concern regarding added complexity, particularly with the introduction of AVX/AVX2, and I appreciate your perspective on maintaining accessibility for less experienced developers. Here are my thoughts on your points:

Export Script: I agree that the changes to the export script can be separated into a different merge request. I will proceed with that and appreciate your support for this enhancement.

Aligned Alloc and Local Variables: Thank you for recognizing this improvement. I agree that the difference in aligned memory allocation between POSIX and Windows adds a layer of complexity. However, this is a necessary change for significant performance enhancements and compatibility with SIMD/Vector intrinsics. I would be glad to explore options to keep this manageable and clear in the code.

AVX/AVX2: I understand your concerns here. The AVX2 support does indeed add complexity, and adding further support for other SIMD architectures like Neon, AVX512, etc., would increase it further. However, the AVX2 support provides a 30-40% speed increase at least, a significant improvement that might warrant the added complexity for users who need high performance.

While a BLAS-based approach could potentially offer a cleaner solution, it may not provide the same speed benefits, especially given that BLAS doesn't support FP16, which AVX easily can. Also, while BLAS could enable GPU acceleration, that would add another layer of complexity and dependency, moving away from the simplicity of the current "just C" solution.

My intention with this PR was to provide an optional speed enhancement for those who require it, while keeping the base version accessible for less experienced developers or those who don't require the speed-up.

I'm open to suggestions on how we can best balance performance and complexity in this project, and am willing to make necessary changes to ensure the improvements can be integrated effectively.

Adding to my previous points, I also view this project as an excellent piece of educational material for developers who are looking to deepen their understanding of these types of systems. To this end, introducing advanced optimization techniques, such as those enabled by AVX2, can provide an invaluable learning experience. These techniques are widely used in the industry and offering exposure to them in this project can significantly benefit developers in their learning journey.

To balance the goals of education and high performance, we could consider maintaining two versions of the project - a 'vanilla' version for newer developers or those who prefer simplicity, and an 'advanced' version that includes these performance optimizations. This way, users can choose the version that best suits their needs and experience level.

@krzysztof-jusiak
Copy link

krzysztof-jusiak commented Jul 26, 2023

Thanks for your comments, I totally understand where you coming from. There are defo possible trade offs to be made.
Simplicity and performance are both important. One option would be to get perf numbers for the improvements to understand the trade offs a bit better. Also, I believe there is still a lot of gains to be made with just small code changes and compiler help (for example #95 helped a lot but it's border line complex for run.c IMHO). I also see your educational point and totally agree with that. I also think that the advanced version already exists in form of llama.cpp which has all of if and more such as simd, blast, gpu, not even mentioning the quantization, but that doesn't mean llama.c can't have it too, though maybe not in the pure run.c by default? It mainly depends on @karpathy vision for the project in the long term (my personal prediction is that more complex perf improvements will eventually happen (probably via backend/kernels) in some shape of from as the project will get more mature/used (everyone wants to run llama2-70B locally) but I also think the goal of having easy to understand/follow inference is just too powerful to ignore so both strengths should be maintained <- just my personal opinion)

@Foundation42
Copy link
Author

Thank you for your thoughtful feedback. I completely agree with your points on the importance of both simplicity and performance. And I understand the need to maintain a balance so that the project remains accessible to developers of all skill levels while still performing optimally.

My primary intention with these changes was to introduce performance optimization techniques and demonstrate how they can coexist within an accessible codebase like llama2.c. I also wanted to ensure that newer developers have the opportunity to see and learn from these techniques, as they are often crucial in real-world applications.

Your suggestion to obtain performance numbers to better understand the trade-offs is a great idea. I am willing to run performance tests and provide more data to help inform the decision-making process. I do believe that the substantial speed gains we're seeing (up to 30-40%) could justify the added complexity, especially since these optimizations can be wrapped in an #ifdef guard to allow users to enable or disable them according to their needs.

Also, I appreciate your comment on having an advanced version like llama.cpp. It makes sense to have a high-performance variant alongside a simpler, more educational variant. However, given that llama.cpp introduces additional complexities such as GPUs, BLast, quantization, etc., there might be value in considering llama2.c as a middle ground, where we introduce more advanced techniques such as SIMD while keeping the rest of the codebase relatively simple.

Ultimately, the direction of the project lies with @karpathy and the community's vision. But I'm hopeful we can find a way to incorporate these performance enhancements in a way that aligns with that vision. I look forward to further discussions and feedback.

I've created the PR for the exporter changes. Very much hope it works out for you.

@Foundation42
Copy link
Author

Here are some preliminary benchmarks for your consideration

Baseline

f42@formica:~/dev/llama2.c$ ./run out44m/model44m.bin
<s>
 Once upon a time, there was a boy named Timmy. Timmy loved to play outside and look for treasure. One day, he found a big chest buried in the ground. He was so excited!
Timmy ran home to show his mom the chest. "Mommy, mommy, look what I found!" he said, holding up the chest.
His mom looked at the chest and said, "Wow, that's a big chest! Let's open it!"
Inside the chest, there was a stuffed bear that was very hairy. Timmy loved the bear and hugged it tight.
But then, Timmy's little sister came in and wanted to play with the bear too. Timmy didn't want to share, so he folded the bear up upside down and said, "No, it's mine!"
His mom reminded him, "Timmy, it's important to share with others. Remember, it's nice to share."
Timmy thought about what his mom said and realized she was right. He cut the bear in half and gave one half to his sister. They both hugged
achieved tok/s: 20.202020

Fast

f42@formica:~/dev/llama2.c$ ./run out44m/model44m.bin
<s>
 One day, a little girl named Sue wanted to help her mom. Her mom was making tasty food in the kitchen. Sue asked, "Mom, can I help you make the food?" Her mom said, "Yes, you can help me cut the celery."
Sue was very happy to help. She took the celery and started to cut it. But then, something unexpected happened. A big, funny dog came into the kitchen. The dog saw the celery and wanted some too.
The dog jumped up and took the celery from Sue. Sue was sad and cried. The dog ran away with the celery and Sue did not get any tasty food that day. The dog ate the celery and Sue was still sad.
<s>
 One day, a little girl named Lily went for a walk. She saw a wide tree with a happy face. The tree was smiling at her. She was very happy to see the tree.
Lily saw a big red ball under the tree. She wanted to play with it. She tried to push the ball, but it was too wide. Lily tried and tried, but she could not move the ball.
Finally, Lily had an
achieved tok/s: 53.906085

Fast AVX2

f42@formica:~/dev/llama2.c$ ./run out44m/model44m.bin
<s>
 Once upon a time, there was a little girl. Her name was Mary. Mary was very small but she was also very brave.
One sunny day, Mary was playing in the garden when she saw a big, hairy bug on the wall.
"Oh, look at the bug!" Mary said.
When Mary realized what the bug was, she was so surprised. She had never seen a hairy bug before.
"What kind of bug is it?" Mary said.
She walked over to it and jumped up and down. The bug didn't move.
Then her mom called out, "Mary, come here, I want to give you a hug!"
So Mary hugged her mom tight and then she looked at the hairy bug.
"That bug is so hairy," Mary said.
"Yes," said her mom, "And how did it get up so high?"
Mary smiled and said, "I just followed it down to the ground and then put my arm under it."
And so Mary they did just that, just like Mary did that day.
<s>
 Once upon a time, there was a little boy named Timmy. Timmy loved to eat fruit, especially app
achieved tok/s: 78.383344

OMP

f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run out44m/model44m.bin
<s>
 Once upon a time, there was a little girl named Lily. She loved to watch cartoons on TV. One day, she saw a funny cartoon about a silly dog. She laughed and laughed until her mom came in.
"Mommy, can we wrap this cake for your birthday?" asked Lily.
"Sure, sweetie," replied her mom. "Let's go to the kitchen and get the wrapping paper."
Lily helped her mom wrap the cake and put it in the oven. Later that day, Lily and her friends went on a parade. They saw a parade with lots of funny animals like a clown and a princess.
"Look at those funny clowns!" said Lily.
"They're not very normal," agreed her friend, Sarah.
Lily didn't mind though, she loved spending time with her friends and watching the parade. When they got home, they had a big piece of cake and it was the best birthday ever.
<s>
 Once upon a time, there was a little boy named Timmy. Timmy loved to play outside with his friends. One day, Timmy and
achieved tok/s: 254.220457

OMP/AVX2

f42@formica:~/dev/llama2.c$ OMP_NUM_THREADS=12 ./run out44m/model44m.bin
<s>
 Once upon a time, there was a messy dog named Spot. Spot loved to play outside and make his toys lay all over the place. One day, he found a can of paint and decided to have some fun and spread it all over his toys.
As Spot played with his toys, he made a big mess. The paint dripped down the toys like rain. Spot didn't know he was making a mess, so he kept on playing and getting it on his toys.
When Spot's mom saw the mess, she was sad. She told Spot that he needed to clean up and not make a mess for real long. Spot learned that it was better to have fun and not make a mess than to make things right.
<s>
 Once upon a time, there was a little girl named Lily. She was very scared of ghosts. One night, she saw a ghost in her room. She screamed and ran to her mom.
"Mommy, there's a ghost in my room!" Lily cried.
"It's okay, Lily. Ghosts aren't real. Let's measure how far away you go,"
achieved tok/s: 290.249433

@krzysztof-jusiak
Copy link

Nice, thank you. I think the trades off are more clear now.
It's also defo faster, though the improvement is a bit smaller with omp enabled but still noticeable.
BTW Is the baseline with #95 which helps a bit without much complexity added?
Just on the side note but the benchmark also shows that the compiler is doing pretty decent job with vectorizing the code, pretty impressive.
I wonder how much speed for larger models (memory bound) could be achieved by quantization and similar techniques which would most likely require custom simd implementation. In case of llama.cpp quantization gives a huge performance boost especially q4, however it's not exactly the same model as the original and I don't think that's the scope of this project.

@Foundation42
Copy link
Author

BTW Is the baseline with #95 which helps a bit without much complexity added?

No, yours was a separate PR so I haven't integrated that yet. The loop-unrolling stuff is a good idea.

Will post benchmarks on your PR

@krzysztof-jusiak
Copy link

krzysztof-jusiak commented Jul 26, 2023

BTW one possible middle ground solution which potentially wouldn't add too much complexity and would be portable use Vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html).

That would allow to use SIMD in a portable way with minimal changes via:

typedef float v4sf __attribute__ ((vector_size (16)));

and with your changes to the aligned memory.

The most perf is gain either via limiting memory bound or accelerating matmul, so there is still tons of space to improve. It just finding the balance between complexity and performance.

perf

@karpathy
Copy link
Owner

Really appreciate this work. As both of you mentioned I'd like to keep this project readable and not go too wild on optimizations. That space will be best served by llama.cpp. So I'm looking for the simplest tweaks that maintain readability, maintain cross-platformness, as a solid foundation and this should go into master. I'm sure there will be some proliferation of forks/other repos that create extensions with slightly different tradeoffs, which is great.

@cgbur
Copy link
Contributor

cgbur commented Jul 31, 2023

Really awesome PR @Foundation42 thank you for posting this. I've implemented the fused matmul and the aligned allocation based on your PR in my project and seen a 12% lift in token/s. I've added you to credits, thanks for teaching me something new!

@yi yi mentioned this pull request Jul 31, 2023
@karpathy
Copy link
Owner

Bleh I tried to cherrypick minimal changes from this PR to use AVX2 but I get Segmentation Fault... Is there no way to just surgically use a single AVX2 matmul:


void matmul(float* o, const float* x, const float* w, int n, int d) {
    // W (d,n) @ x (n,) -> o (d,)

    int nn = n / 8 * 8;  // ensure n is a multiple of 8
    #pragma omp parallel for
    for (int i = 0; i < d; i++) {
        __m256 sum_vec = _mm256_setzero_ps(); // for AVX2, sum of 8 floats
        int i_n = i * n;
        for (int j = 0; j < nn; j += 8) {
            // Load 8 values from w and x
            __m256 w_vec = _mm256_load_ps(&w[i_n + j]);
            __m256 x_vec = _mm256_load_ps(&x[j]);
            // Multiply and accumulate
            __m256 prod_vec = _mm256_mul_ps(w_vec, x_vec);
            sum_vec = _mm256_add_ps(sum_vec, prod_vec);
        }

        // Perform horizontal add
        sum_vec = _mm256_hadd_ps(sum_vec, sum_vec);
        sum_vec = _mm256_hadd_ps(sum_vec, sum_vec);
        float vals[8];
        _mm256_storeu_ps(vals, sum_vec);
        float val = vals[0] + vals[4];

        // handle remainder if n is not a multiple of 8
        for (int j = nn; j < n; j++) {
            val += w[i_n + j] * x[j];
        }
        o[i] = val;
    }
}

in the code with minal changes? Is it necessary to do the cache alignment part, and the model re-export parts?

@cgbur
Copy link
Contributor

cgbur commented Aug 10, 2023

Yes I am pretty sure for using these intrinsics you need to have the alignment set for 32 bytes. If you want to avoid having to change the allocation patterns, you can try using the unaligned load operations (_mm256_loadu_ps) which will be slower.

#ifdef __AVX2__
#include <immintrin.h> // AVX2

void matmul(float* o, const float* x, const float* w, int n, int d) {
    // W (d,n) @ x (n,) -> o (d,)

    int nn = n / 8 * 8;  // ensure n is a multiple of 8
    #pragma omp parallel for
    for (int i = 0; i < d; i++) {
        __m256 sum_vec = _mm256_setzero_ps(); // for AVX2, sum of 8 floats
        int i_n = i * n;
        for (int j = 0; j < nn; j += 8) {
            // Load 8 values from w and x
            __m256 w_vec = _mm256_loadu_ps(&w[i_n + j]);
            __m256 x_vec = _mm256_loadu_ps(&x[j]);
            // Multiply and accumulate
            __m256 prod_vec = _mm256_mul_ps(w_vec, x_vec);
            sum_vec = _mm256_add_ps(sum_vec, prod_vec);
        }

        // Perform horizontal add
        sum_vec = _mm256_hadd_ps(sum_vec, sum_vec);
        sum_vec = _mm256_hadd_ps(sum_vec, sum_vec);
        float vals[8];
        _mm256_storeu_ps(vals, sum_vec);
        float val = vals[0] + vals[4];

        // handle remainder if n is not a multiple of 8
        for (int j = nn; j < n; j++) {
            val += w[i_n + j] * x[j];
        }
        o[i] = val;
    }
}
#else
void matmul(float* xout, float* x, float* w, int n, int d) {
    // W (d,n) @ x (n,) -> xout (d,)
    // by far the most amount of time is spent inside this little function
    int i;
    #pragma omp parallel for private(i)
    for (i = 0; i < d; i++) {
        float val = 0.0f;
        for (int j = 0; j < n; j++) {
            val += w[i * n + j] * x[j];
        }
        xout[i] = val;
    }
}
#endif

Must pass -march=native for it take effect or similar it seems. I also noticed a slowdown compared to NOT specifying the avx2 matmul.

@karpathy
Copy link
Owner

Thank you @cgbur , that worked!! On my machine this takes 110M model from 26 tok/s -> 33 tok/s. For around 27% boost. Very cool. I like AVX2. I will look into memory alignment.

@karpathy
Copy link
Owner

Random note I love that via llama2c I've learned so much that I didn't know previously. I haven't spent as much time "below Python", but it's really fun here.

@karpathy
Copy link
Owner

karpathy commented Aug 10, 2023

I don't fully understand which part of the export guarantees the memory alignment to 32 bytes. There's a number of changes there (e.g. change to HalfFloat) mixed in. What is the minimal diff that would be needed to make the export work with aligned avx2 intrinsics?

@cgbur
Copy link
Contributor

cgbur commented Aug 10, 2023

Im not too too familiar with the details of the PR changes but it looks like by ensuring that the header is aligned properly the rest of it becomes aligned correctly?

 # header magic version integer added for two reasons
    # 1) so that we can version the header
    # 2) so that the struct maintains strict cache alignment
    #    which is necessary so that the weights that follow the header are also cache aligned
    header_magic_version = 0x42000000
    header = struct.pack('iiiiiiii', header_magic_version, p['dim'], hidden_dim, n_layers, p['n_heads'], 
                                    n_kv_heads, -p['vocab_size'], p['max_seq_len'])
// Config structure needs to be CACHE ALIGNED (Typically 32 Bytes)
// If you change this, it is important that export_meta_llama_bin.py is updated as well

iirc your current config is 28 bytes which might throw off the rest of the weights. Once again, just a guess. I will have to play with it later to be more certain.

@karpathy
Copy link
Owner

Ahhh maybe that makes sense. With the addition of header_magic_version we'd be writing 8 bytes not 7 bytes in the header so 8*4 = 32. Then it still has to be the case that in run.c later, the pointer that holds the weights is aligned. The RunState is aligned because of the use of speciall calloc function. But the weights don't seem to be aligned, but they are still checked if they are aligned. :\

@cgbur
Copy link
Contributor

cgbur commented Aug 10, 2023

Minor nit: the author says cache aligned, but most cache lines today at 64 byte aligned. I think more appropriate is to say that its vector width aligned. A value of 32 here means that you are supporting 32*8=256 bit instructions.

@cgbur
Copy link
Contributor

cgbur commented Aug 10, 2023

But the weights don't seem to be aligned, but they are still checked if they are aligned. :\

I think they are depending on the dimensions of the model to be sane enough to stumble into being aligned. I assume these models are designed for gpus with much greater alignment requirements so its probably a decent bet that the model weights will fit nicely.

Instead of putting magic, you could write a u32 that is the alignment of the model by checking all the dimensions? Just occurred to me. But then I guess its probably safer to make the config 64 bytes to support avx512. I would need to test these things before making more claims :).

Another reason for stumbling so successfully is that mmap aligns to the page boundary when null is passed which run.c is doing. I realized this in my implementation because zig required me to specify page alignment on the return type. This is why ensuring that the weights data has a good alignment is enough to ensure that the in memory mapping is aligned properly.

const data: []align(mem.page_size) u8 = try std.os.mmap(null, file_size, std.os.PROT.READ, std.os.MAP.PRIVATE, mapped_checkpoint.handle, 0);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants