You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to load NeuralReyna, a relatively small model, but still got an out of memory issue.
Should ALL models be chunked, even ones smaller than 2Gb?
Somewhat off-topic, but perhaps useful for others: I tried to do this, and chunk NeuralReyna. interestingly, it didn't want to be chunked into very small (100Mb) parts:
./gguf-split --split-max-size 100M ./neuralreyna-mini-1.8b-v0.3.q5_k_m.gguf neuralreyna-mini-1.8b-v0.3.q5_k_m
error: one of splits have 0 tensors. Maybe size or tensors limit is too small
Even 200Mb was too small. Luckily, 250Mb worked.
The text was updated successfully, but these errors were encountered:
If you get OOM error from ggml, that means the browser doesn't want to give you more RAM. Probably you're also loading other models or instances of wllama at the same time.
Chunked model won't help in this case, since you're already used up all available RAM
I noticed a significant benefit in splitting the models, mostly due to the cache size constraints of Safari.
Mobile Safari has a cache limit of 300MB, while Desktop Safari has a limit of less than 1GB.
If the model size exceeds the limit, the user has to re-download the model after refreshing the page.
Besides that, as mentioned in the Readme, it helps reduce the time required to download the model.
I tried to load NeuralReyna, a relatively small model, but still got an out of memory issue.
Should ALL models be chunked, even ones smaller than 2Gb?
Somewhat off-topic, but perhaps useful for others: I tried to do this, and chunk NeuralReyna. interestingly, it didn't want to be chunked into very small (100Mb) parts:
Even 200Mb was too small. Luckily, 250Mb worked.
The text was updated successfully, but these errors were encountered: