Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mmap] Memory mapped file for weights #30

Closed
wants to merge 2 commits into from

Conversation

krzysztof-jusiak
Copy link

@krzysztof-jusiak krzysztof-jusiak commented Jul 24, 2023

Problem:

  • Allocating memory for weights and loading them is usually not possible
    for bigger models.
  • There is a lot of boilerplate to allocate and assign weights.

Soution:

  • Use memory mapped file instead.
  • Clean it up with mmap.

Notes:

  • Only one option is kept for simplicity. @karpathy it might be wise to keep both options (mmap and alloc), though that adds complexity and LOC so not sure? mmap will be kinda required for bigger models so preferred that option for the time being as it doesn't add latency and seems to be faster.
  • The performance is a bit less stable but I had runs faster then
    without mmap too. Overall it's pretty close ATM. I had runs with ~700tok/sec with mmap enabled.

Problem:
- clock is CPU and doesn't work properly with parallel execution.
- perf execution is matmul x weights bound.

Solution:
- use gettimeofday instead.
- utilize openmp to parallelize matmul.

Note:
- if not compiled with -fopenmp the #pragma is ignored and single
  execution is performed.
- there are additional env variable to setup for openmp (optinally)
  to setup the number of threads, scheduler etc.

Benchmarks:
```
clang -Ofast -march=native  run.c  -lm  -o run          // achieved tok/s: 340.878828
clang -Ofast -fopenmp -march=native  run.c  -lm  -o run // achieved tok/s: 524.590164
```
Problem:
- Allocating memory for weights and loading them is usually not possible
  for bigger models.
- There is a lot of boilerplate to allocate and assign weights.

Soution:
- Use memory mapped file instead.
- Clean it up with mmap.

Notes:
- Only one option is kept for simplicity.
- The performance is a bit less stable but I had runs faster then
  without mmap too. Overall it's pretty close ATM.
@karpathy
Copy link
Owner

Okay this looks interesting, thank you. I think it will take me a bit more to digest this PR and understand each piece, which I'd like to do before merging. Just so I understand - the weights are not loaded into RAM and remain on disk? Would you not expect this to decrease the latency? Any potential impacts on portability to consider?

@krzysztof-jusiak
Copy link
Author

Yeah, the process will map weights into its virtual address space which is more io bound but it can be also more efficient as OS can optimize the access by limiting page faults and/or introducing better caching.
Currently it seems to be speeding up the inference and it will allow to tackle 7B+ models, eventually where the memory is the constrain. The startup will be faster too as it wont' have to allocate and copy. mmap is pretty portable solution used for example by llama.cpp on linux/mac/windows.

@zackangelo
Copy link

@karpathy it moves the responsibility of reading from disk to RAM into the kernel. The weights will get paged in on demand as the data is requested. If there's a surplus of RAM, you can usually rely on the kernel to aggressively allocate RAM for the data and copy it in from disk. It will also transparently handle datasets that are too big to fit in RAM.

The tradeoff is you're relinquishing control of the allocation, so performance will be less consistent.

@karpathy
Copy link
Owner

Got it, ty @zackangelo / @krzysztof-jusiak . Definitely seems like something we should get in. I think I'm supposed to go to my real-life work around now, but I'll focus on trying to get this in when I get back in the evening. Thank you!

@Zepan
Copy link

Zepan commented Jul 24, 2023

Hi, mmap is a good idea to save memory for low memory device, and I also did it for llama.cpp 4month ago: ggerganov/llama.cpp@master...Zepan:llama.cpp:master

But, llama2.c is far more than llama.cpp, it is so small that I even want run it on MCUs.
It is not suitable to directly replace normal malloc to mmap, as MCUs may not run Unix-like OS, even don't have MMU, they don't have mmap APIs.

Maybe using macro to switch different implement is better?

Last word, I love this project very much, I love to run tiny models on embeded devices.
Here is my project to run simple CNN models on tiny MCUs, even 2KB RAM arduino: https://github.com/sipeed/TinyMaix
I will try to port your project to RISCV MCU with FP16/INT8 format via RVV instructions.

@karpathy
Copy link
Owner

@Zepan ty for comment. I think it's just a question of what should be the default implementation. I feel like most people will want to run this in contexts where mmap is available? (think Windows, Linux, Mac, Android, iOS, etc). The (potentially fewer?) people who wish to run it on MCUs will probably be savvy enough to make the change to the current malloc regime? Thoughts?

@karpathy
Copy link
Owner

merged a slightly more commented version here ty 133ad3f

@karpathy karpathy closed this Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants