-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mmap] Memory mapped file for weights #30
Conversation
Problem: - clock is CPU and doesn't work properly with parallel execution. - perf execution is matmul x weights bound. Solution: - use gettimeofday instead. - utilize openmp to parallelize matmul. Note: - if not compiled with -fopenmp the #pragma is ignored and single execution is performed. - there are additional env variable to setup for openmp (optinally) to setup the number of threads, scheduler etc. Benchmarks: ``` clang -Ofast -march=native run.c -lm -o run // achieved tok/s: 340.878828 clang -Ofast -fopenmp -march=native run.c -lm -o run // achieved tok/s: 524.590164 ```
Problem: - Allocating memory for weights and loading them is usually not possible for bigger models. - There is a lot of boilerplate to allocate and assign weights. Soution: - Use memory mapped file instead. - Clean it up with mmap. Notes: - Only one option is kept for simplicity. - The performance is a bit less stable but I had runs faster then without mmap too. Overall it's pretty close ATM.
Okay this looks interesting, thank you. I think it will take me a bit more to digest this PR and understand each piece, which I'd like to do before merging. Just so I understand - the weights are not loaded into RAM and remain on disk? Would you not expect this to decrease the latency? Any potential impacts on portability to consider? |
Yeah, the process will map weights into its virtual address space which is more io bound but it can be also more efficient as OS can optimize the access by limiting page faults and/or introducing better caching. |
@karpathy it moves the responsibility of reading from disk to RAM into the kernel. The weights will get paged in on demand as the data is requested. If there's a surplus of RAM, you can usually rely on the kernel to aggressively allocate RAM for the data and copy it in from disk. It will also transparently handle datasets that are too big to fit in RAM. The tradeoff is you're relinquishing control of the allocation, so performance will be less consistent. |
Got it, ty @zackangelo / @krzysztof-jusiak . Definitely seems like something we should get in. I think I'm supposed to go to my real-life work around now, but I'll focus on trying to get this in when I get back in the evening. Thank you! |
Hi, mmap is a good idea to save memory for low memory device, and I also did it for llama.cpp 4month ago: ggerganov/llama.cpp@master...Zepan:llama.cpp:master But, llama2.c is far more than llama.cpp, it is so small that I even want run it on MCUs. Maybe using macro to switch different implement is better? Last word, I love this project very much, I love to run tiny models on embeded devices. |
@Zepan ty for comment. I think it's just a question of what should be the default implementation. I feel like most people will want to run this in contexts where mmap is available? (think Windows, Linux, Mac, Android, iOS, etc). The (potentially fewer?) people who wish to run it on MCUs will probably be savvy enough to make the change to the current malloc regime? Thoughts? |
merged a slightly more commented version here ty 133ad3f |
Problem:
for bigger models.
Soution:
Notes:
without mmap too. Overall it's pretty close ATM. I had runs with ~700tok/sec with mmap enabled.