Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Oh hell yes #2

Open
flatsiedatsie opened this issue Apr 13, 2024 · 9 comments
Open

Oh hell yes #2

flatsiedatsie opened this issue Apr 13, 2024 · 9 comments

Comments

@flatsiedatsie
Copy link

Thank you for this! I've been using llama-cpp-wasm and the 2GB size restriction was a real stumbling block.

@flatsiedatsie
Copy link
Author

flatsiedatsie commented Apr 28, 2024

I'm finally attempting to replace llama_cpp_wasm with Wllama.

I was wondering if you have suggestions on how to replace some callbacks:

  • Is there a downloading and/or loading callback? I'd like to keep the user informed about download progress.
  • Is there a chunk callback, where the model returns the latest generated token? The readme doesn't mention any such ability, and a search for 'chunk' in the codebase only gives results referring to breaking the LLM's into chunks.
  • How do I best abort inference?
  • Should I unload a model before switching to a different one?

Unrelated: have you per chance tried running Phi 3 with Wllama? I know the 128K context is not officially supported yet, but there does seem to be some success with getting 64K context. I'm personally really looking forward to when Phi 3 128 is supported, as I suspect it would be the ultimate small "do it all" model for browser-based contexts.

@flatsiedatsie
Copy link
Author

Here's a sneak preview of what I'm integrating it in. Can't wait to release it to the world :-) I'm happy to give you a link if you're curious.

sneak_preview3

@flatsiedatsie
Copy link
Author

flatsiedatsie commented Apr 28, 2024

More questions as I'm going along:

  • The documentation doesn't mention what the defaults are for the various configuration options? Perhaps those could be added? It would be nice to know what the default context size and temperature are, for example.
  • Options like cache_type_k seem important. What happens if I don't set them, or set them incorrectly? How should I set them? I'm loading a Q4_K_M model, should I set it to q4_0? Or does this mean that only q4_0 quantization is supported?

@flatsiedatsie
Copy link
Author

flatsiedatsie commented Apr 28, 2024

Oh darn, the advanced example answers a lot of my questions, apologies.https://github.com/ngxson/wllama/blob/master/examples/advanced/index.html

// Is this a bug in the example? Setting the same property twice:

@ngxson
Copy link
Owner

ngxson commented Apr 28, 2024

Cool project! Thanks for paying attention for wllama.

Is there a downloading and/or loading callback? I'd like to keep the user informed about download progress.

I planned to add one (and cache control options) but there're still some issues. If you want, you can implement your own download function (with callback), then pass the final buffer to loadModel() (instead of using loadModelFromUrl)

async loadModel(ggufBuffer: Uint8Array | Uint8Array[], config: LoadModelConfig): Promise<void> {

Is there a chunk callback, where the model returns the latest generated token? The readme doesn't mention any such ability, and a search for 'chunk' in the codebase only gives results referring to breaking the LLM's into chunks.

If you want to have more control over the response, you can implement your own createCompletion. All lower-level API like tokenize, decode,... are exposed:

async createCompletion(prompt: string, options: {

How do I best abort inference?

By implementing your own createCompletion, you can abort the inference by interrupting the loop to generate new tokens.

Should I unload a model before switching to a different one?

Yes, since models are loaded into RAM, it's better to unload the model before loading a new one to prevent running out of RAM.


The documentation doesn't mention what the defaults are for the various configuration options?

No because many default options are defined inside llama.cpp (cpp code, not javascript level). I'm planning to copy then into this project in the future. This requires parsing cpp code and convert them either into ts/js, either simply generate a markdown documentation. Either way will be quite complicated.

For now, you can see default values in llama.h file: https://github.com/ggerganov/llama.cpp/blob/master/llama.h

Options like cache_type_k seem important. What happens if I don't set them, or set them incorrectly? How should I set them? I'm loading a Q4_K_M model, should I set it to q4_0? Or does this mean that only q4_0 quantization is supported?

cache_type_k is controlled by llama.cpp, not at the javascript level. For now, llama.cpp use f16 by default, but it also supports q4_0. Please pay attention that the support for quantized cache is still quite experimental in llama.cpp and may degrade response quality

// Is this a bug in the example? Setting the same property twice:

Yes, it's a typo. Because the index.html file is not typescript, I don't have any suggestion from IDE. One should be top_p and the other should be top_k.

@flatsiedatsie
Copy link
Author

Whoop! I've got initial implementation working!

Now to get to the details.

I planned to add one (and cache control options) but there're still some issues. If you want, you can implement your own download function (with callback), then pass the final buffer to loadModel() (instead of using loadModelFromUrl)

I went ahead and created a very minimal implementation of a download progress callback in a PR. It should hold me over until your prefered implementation is done, to which I'll then update.

If you want to have more control over the response, you can implement your own createCompletion

By looking at the advanced example I found the onNewToken: (token, piece, currentText) => { part, which was exactly what I needed.

I'm going to see if I can hack in an abort button next :-)

@flatsiedatsie
Copy link
Author

It seems I can simply call exit() on the Wllama object when the user wants to interrupt inference. The model will then need to be reloaded, but that's ok.

@flatsiedatsie
Copy link
Author

I created an extremely minimalist way to interrupt the inference here:
flatsiedatsie@a9fe166

@flatsiedatsie
Copy link
Author

flatsiedatsie commented May 14, 2024

Wllama now has a built-in interruption ability.

window.interrupt_wllama = false;
let response_so_far = "";

const outputText = await window.llama_cpp_app.createCompletion(total_prompt, {
	nPredict: 500,
	sampling: {
	    temp: 0.7,
	    top_k: 40,
	    top_p: 0.9,
	},
	onNewToken: (token, piece, currentText, { abortSignal }) => {
            if (window.interrupt_wllama) {
		console.log("sending interrupt signal to Wllama");
		abortSignal();
	    }
	    else{
		//console.log("wllama: onNewToken:  token,piece,currentText:", token, piece, currentText);
		let new_chunk = currentText.substr(response_so_far.length);
		window.handle_chunk(my_task,response_so_far,new_chunk);
		response_so_far = currentText;
	    }
	},
});

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants