New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement configurable context length #1749
Conversation
/* TODO(cebtenzzre): after we fix requiredMem, we should change this to happen at | ||
* load time, not construct time. right now n_ctx is incorrectly hardcoded 2048 in | ||
* most (all?) places where this is called, causing underestimation of required | ||
* memory. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@apage43 Do you think it would be relatively easy to switch this to a load-time check instead of a construct-time one? It doesn't matter so much right now since it's not working anyway (unresolved fallout from the switch to GGUF).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason its construct-time is so that we do the fallback to cpu transparently: callers of construct
passing "auto" just get the cpu implementation if the mem req is too high for Metal
if its changed to fail at load time callers will have to handle that fallback themselves - which is likely fine, but would need to be done in all the bindings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The chat UI is already doing load-time fallback for Vulkan. And this is really the only way to do it because it's the user code that decides which GPU to use, which is of course initialized after a backend/implementation is available. We should make sure the bindings are capable of this too.
I think it would make sense to only ever dlopen one build of llamamodel-mainline on Apple silicon, as there's nothing we are currently doing that the Metal build isn't capable of.
2054338
to
358d619
Compare
Tested working with the python bindings and the GUI. The other bindings are still hardcoded to 2048 but it shouldn't be hard to expose the context length via their APIs if desired.
For the python bindings, this is the n_ctx parameter of the GPT4All constructor:
In the UI, this is a per-model parameter:
This doesn't take effect until switching models or restarting. This fact is noted in the tooltip. For now, this is the simplest way to do it, although IMO it would be nice to have a way to reload the model in the future (similar TGWUI's "Reload" button on the model tab).