Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should all models now be chunked? #20

Open
flatsiedatsie opened this issue May 12, 2024 · 3 comments
Open

Should all models now be chunked? #20

flatsiedatsie opened this issue May 12, 2024 · 3 comments

Comments

@flatsiedatsie
Copy link

flatsiedatsie commented May 12, 2024

I tried to load NeuralReyna, a relatively small model, but still got an out of memory issue.

Should ALL models be chunked, even ones smaller than 2Gb?

Screenshot 2024-05-12 at 11 44 23

Somewhat off-topic, but perhaps useful for others: I tried to do this, and chunk NeuralReyna. interestingly, it didn't want to be chunked into very small (100Mb) parts:

./gguf-split --split-max-size 100M ./neuralreyna-mini-1.8b-v0.3.q5_k_m.gguf neuralreyna-mini-1.8b-v0.3.q5_k_m     
error: one of splits have 0 tensors. Maybe size or tensors limit is too small

Even 200Mb was too small. Luckily, 250Mb worked.

@ngxson
Copy link
Owner

ngxson commented May 12, 2024

If you get OOM error from ggml, that means the browser doesn't want to give you more RAM. Probably you're also loading other models or instances of wllama at the same time.

Chunked model won't help in this case, since you're already used up all available RAM

@ngxson
Copy link
Owner

ngxson commented May 12, 2024

Also, context length n_ctx seems to be quite big, you should decrease it to save RAM.

@felladrin
Copy link
Contributor

I noticed a significant benefit in splitting the models, mostly due to the cache size constraints of Safari.
Mobile Safari has a cache limit of 300MB, while Desktop Safari has a limit of less than 1GB.
If the model size exceeds the limit, the user has to re-download the model after refreshing the page.
Besides that, as mentioned in the Readme, it helps reduce the time required to download the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants