Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement configurable context length #1749

Merged
merged 1 commit into from Dec 16, 2023
Merged

Implement configurable context length #1749

merged 1 commit into from Dec 16, 2023

Conversation

cebtenzzre
Copy link
Member

@cebtenzzre cebtenzzre commented Dec 12, 2023

Tested working with the python bindings and the GUI. The other bindings are still hardcoded to 2048 but it shouldn't be hard to expose the context length via their APIs if desired.

For the python bindings, this is the n_ctx parameter of the GPT4All constructor:

class GPT4All:
	# ...

    def __init__(
        self,
        # ...
        n_ctx: int = 2048,
        verbose: bool = False,
    ):

In the UI, this is a per-model parameter:
image

This doesn't take effect until switching models or restarting. This fact is noted in the tooltip. For now, this is the simplest way to do it, although IMO it would be nice to have a way to reload the model in the future (similar TGWUI's "Reload" button on the model tab).

Comment on lines +157 to +160
/* TODO(cebtenzzre): after we fix requiredMem, we should change this to happen at
* load time, not construct time. right now n_ctx is incorrectly hardcoded 2048 in
* most (all?) places where this is called, causing underestimation of required
* memory. */
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@apage43 Do you think it would be relatively easy to switch this to a load-time check instead of a construct-time one? It doesn't matter so much right now since it's not working anyway (unresolved fallout from the switch to GGUF).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason its construct-time is so that we do the fallback to cpu transparently: callers of construct passing "auto" just get the cpu implementation if the mem req is too high for Metal

if its changed to fail at load time callers will have to handle that fallback themselves - which is likely fine, but would need to be done in all the bindings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chat UI is already doing load-time fallback for Vulkan. And this is really the only way to do it because it's the user code that decides which GPU to use, which is of course initialized after a backend/implementation is available. We should make sure the bindings are capable of this too.

I think it would make sense to only ever dlopen one build of llamamodel-mainline on Apple silicon, as there's nothing we are currently doing that the Metal build isn't capable of.

@cebtenzzre cebtenzzre marked this pull request as ready for review December 13, 2023 21:21
@cebtenzzre cebtenzzre changed the title WIP: configurable n_ctx Implement configurable context length Dec 13, 2023
@cebtenzzre cebtenzzre added backend gpt4all-backend issues bindings gpt4all-binding issues chat gpt4all-chat issues labels Dec 13, 2023
@cebtenzzre cebtenzzre linked an issue Dec 14, 2023 that may be closed by this pull request
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend gpt4all-backend issues bindings gpt4all-binding issues chat gpt4all-chat issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Stop hard coding the context size and use the correct size per model
3 participants