[mmap] Memory mapped file for weights #30

krzysztof-jusiak · 2023-07-24T09:18:44Z

Problem:

Allocating memory for weights and loading them is usually not possible
for bigger models.
There is a lot of boilerplate to allocate and assign weights.

Soution:

Use memory mapped file instead.
Clean it up with mmap.

Notes:

Only one option is kept for simplicity. @karpathy it might be wise to keep both options (mmap and alloc), though that adds complexity and LOC so not sure? mmap will be kinda required for bigger models so preferred that option for the time being as it doesn't add latency and seems to be faster.
The performance is a bit less stable but I had runs faster then
without mmap too. Overall it's pretty close ATM. I had runs with ~700tok/sec with mmap enabled.

Problem: - clock is CPU and doesn't work properly with parallel execution. - perf execution is matmul x weights bound. Solution: - use gettimeofday instead. - utilize openmp to parallelize matmul. Note: - if not compiled with -fopenmp the #pragma is ignored and single execution is performed. - there are additional env variable to setup for openmp (optinally) to setup the number of threads, scheduler etc. Benchmarks: ``` clang -Ofast -march=native run.c -lm -o run // achieved tok/s: 340.878828 clang -Ofast -fopenmp -march=native run.c -lm -o run // achieved tok/s: 524.590164 ```

Problem: - Allocating memory for weights and loading them is usually not possible for bigger models. - There is a lot of boilerplate to allocate and assign weights. Soution: - Use memory mapped file instead. - Clean it up with mmap. Notes: - Only one option is kept for simplicity. - The performance is a bit less stable but I had runs faster then without mmap too. Overall it's pretty close ATM.

karpathy · 2023-07-24T14:36:45Z

Okay this looks interesting, thank you. I think it will take me a bit more to digest this PR and understand each piece, which I'd like to do before merging. Just so I understand - the weights are not loaded into RAM and remain on disk? Would you not expect this to decrease the latency? Any potential impacts on portability to consider?

krzysztof-jusiak · 2023-07-24T14:57:08Z

Yeah, the process will map weights into its virtual address space which is more io bound but it can be also more efficient as OS can optimize the access by limiting page faults and/or introducing better caching.
Currently it seems to be speeding up the inference and it will allow to tackle 7B+ models, eventually where the memory is the constrain. The startup will be faster too as it wont' have to allocate and copy. mmap is pretty portable solution used for example by llama.cpp on linux/mac/windows.

zackangelo · 2023-07-24T14:58:23Z

@karpathy it moves the responsibility of reading from disk to RAM into the kernel. The weights will get paged in on demand as the data is requested. If there's a surplus of RAM, you can usually rely on the kernel to aggressively allocate RAM for the data and copy it in from disk. It will also transparently handle datasets that are too big to fit in RAM.

The tradeoff is you're relinquishing control of the allocation, so performance will be less consistent.

karpathy · 2023-07-24T15:05:47Z

Got it, ty @zackangelo / @krzysztof-jusiak . Definitely seems like something we should get in. I think I'm supposed to go to my real-life work around now, but I'll focus on trying to get this in when I get back in the evening. Thank you!

Zepan · 2023-07-24T16:52:19Z

Hi, mmap is a good idea to save memory for low memory device, and I also did it for llama.cpp 4month ago: ggerganov/llama.cpp@master...Zepan:llama.cpp:master

But, llama2.c is far more than llama.cpp, it is so small that I even want run it on MCUs.
It is not suitable to directly replace normal malloc to mmap, as MCUs may not run Unix-like OS, even don't have MMU, they don't have mmap APIs.

Maybe using macro to switch different implement is better？

Last word, I love this project very much, I love to run tiny models on embeded devices.
Here is my project to run simple CNN models on tiny MCUs, even 2KB RAM arduino: https://github.com/sipeed/TinyMaix
I will try to port your project to RISCV MCU with FP16/INT8 format via RVV instructions.

karpathy · 2023-07-24T22:29:01Z

@Zepan ty for comment. I think it's just a question of what should be the default implementation. I feel like most people will want to run this in contexts where mmap is available? (think Windows, Linux, Mac, Android, iOS, etc). The (potentially fewer?) people who wish to run it on MCUs will probably be savvy enough to make the change to the current malloc regime? Thoughts?

karpathy · 2023-07-25T02:00:10Z

merged a slightly more commented version here ty 133ad3f

krzysztof-jusiak added 2 commits July 24, 2023 01:55

krzysztof-jusiak mentioned this pull request Jul 24, 2023

Inference speed [-Ofast] #20

Open

karpathy closed this Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mmap] Memory mapped file for weights #30

[mmap] Memory mapped file for weights #30

krzysztof-jusiak commented Jul 24, 2023 •

edited

Loading

karpathy commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023

zackangelo commented Jul 24, 2023

karpathy commented Jul 24, 2023

Zepan commented Jul 24, 2023

karpathy commented Jul 24, 2023

karpathy commented Jul 25, 2023

[mmap] Memory mapped file for weights #30

[mmap] Memory mapped file for weights #30

Conversation

krzysztof-jusiak commented Jul 24, 2023 • edited Loading

karpathy commented Jul 24, 2023

krzysztof-jusiak commented Jul 24, 2023

zackangelo commented Jul 24, 2023

karpathy commented Jul 24, 2023

Zepan commented Jul 24, 2023

karpathy commented Jul 24, 2023

karpathy commented Jul 25, 2023

krzysztof-jusiak commented Jul 24, 2023 •

edited

Loading