-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLaVA support #1487
LLaVA support #1487
Conversation
BTW: don't merge it yet! If it should be merged as a built-in extension then I want to clean up script.py(it's feature complete, but can use some work), and if it shouldn't get merged as a built-in extension I need to remove script.py |
I tried it and it doesn't look like you can talk with the model without image, you are obligated to give him one to get it going. That's a shame because I've heard that training the Vicuna model with picture made him smarter, and I wanted to try it out with regular chat |
will it work with ggml models? |
Probably because download-model.bat automatically named it |
Using
|
Added more virtual memory in Window's Advanced System Settings, by also using my second hard drive for virtual memory.
|
I think it works except for this error. I can load the model, talk to it but when I select an image, I get this |
I'm able to use this with 8 GB VRAM (Geforce 3060 Ti) with the following arguments: python server.py --model llava-13b-4bit-128g --wbits 4 --group 128 --chat --model_type=llama --extensions llava --pre_layer 29 Reduce "Max prompt size in tokens" to 500 or less, otherwise you'll get OOM errors after the first response. After further testing I found it's best to use 0 Max Prompt Size, otherwise there's a chance it will respond to multiple images at once. Although the Max Prompt Size setting is 0, the model still gets around 360 tokens of context each time. I guess this must be hard coded. Edit:
|
@BadisG - I added a commit ~30mins before your last message which fixed it, you could've pulled before that |
I can chat with the model without an image but as soon as I enter an image and prompt it, it crashes: "Could not load library libcudnn_cnn_infer.so.8. Error: libcuda.so: cannot open shared object file: No such file or directory WSL2 installation, ubuntu 22.04. RTX 4090 and plenty of VRAM left unused. |
https://discuss.pytorch.org/t/libcudnn-cnn-infer-so-8-library-can-not-found/164661 |
That makes sense, I figured the image must take some number of tokens. I did notice that even with context length 0, the model responds to my questions. For example "What is funny about this image?" it will start with "This image is funny because". I wonder how it's ingesting my prompt when I don't leave any space for it in the context. |
Thanks alot, that fixed it! Should have googled this harder myself... |
Ok, at this point it's cleaned up enough to where I wanted it, so it could maybe get merged. Also, I added a possibility to run CLIP/projector on CPU(or at 32bit in cuda, which is now the new default).
to
Clip doesn't look like it supports run_in_8bit, and I feel like the projector doesn't need it, so there is only 16/32 bit (and 32 bit only for CPU). @jparmstr - you might be able to squeeze some more tokens with CPU CLIP. As for the prompt, you can add |
`(base) cybertimon@server:~/Repositorys/text-generation-webui$ python3 server.py --model llava-13b-4bit-128g --gpu-memory 12 --wbits 4 --model_type llama --groupsize 128 --listen-host 0.0.0.0 --listen --xformers --extension llava --chat --listen-port 21129 To create a public link, set I get embedded 0 images in 0.99s. Maybe this is the problem from earlier. Also it answers only: 88888888.... |
Also, when I change the settings to use cpu ´{'add_all_images_to_prompt': False, 'clip_device': 'cpu', 'clip_bits': 32, 'projector_device': 'cpu', 'projector_bits': 32} cpu torch.float32 cpu torch.float32´ I still get only 888888 as answer. |
@CyberTimon remove settings.json, then restart webui, clear the history, and try with this image: https://github.com/haotian-liu/LLaVA/blob/main/llava/serve/examples/extreme_ironing.jpg, with "What is unusual about this image?" prompt, exactly as in my video. |
Very impressive @Wojtab, I'll try to review and merge it soon. Quick question: I remember reading on the LLaVA README that a custom version of |
@oobabooga If you load the original LLaVA on standard transformers it works, but instead of loading the entire model, it just loads LLaMA part, so it can be used for text-based inference without any modifications. |
Your a hero! Works perfect now. I had to delete the settings.json |
Oh I saw what the issue was. When selecting max_new_tokens over 1600 it generates only garbage. |
|
Regarding tokenizer: there are 4 new tokens, so I don't think the generic one will work:
IMO we can merge it here now, give me like 30 minutes, I'll add a description of the extension. |
A comment: the
But the send_pictures extension does not use a callback. |
@oobabooga ok, I added the docs, also reworded it in |
I have removed this addition because I found it unnecessary, as the yield chat_html_wrapper(shared.history['visible'], state['name1'], state['name2'], state['mode'])
else:
# Yield ' ...'
- last_visible_user = shared.history['visible'][-1][0]
yield chat_html_wrapper(shared.history['visible'][:-1] + [[shared.history['visible'][-1][0], shared.history['visible'][-1][1] + ' ...']], state['name1'], state['name2'], state['mode'])
for history in chatbot_wrapper(shared.history['internal'][-1][0], state, _continue=True):
- shared.history['visible'][-1] = [last_visible_user, history[-1][1]]
yield chat_html_wrapper(shared.history['visible'], state['name1'], state['name2'], state['mode']) Also made some minor changes and improvements. Thanks for submitting this PR, I would never have come up with the LLaVA adaptation on my own and the reworked extensions framework is a huge improvement to this project. |
For reference, these are the commands to download and run the model:
VRAM usage peaked at |
@oobabooga thanks for the review and merge. Now, this addition was necessary, for some reason continue replaces visible_text with internal text on the message from user, so now instead of it being |
@oobabooga actually, instead of reverting it, I'll open a separate PR where both of the representations are the same |
I'll wait for your PR then. I might have used |
I get some errors sometimes, depending on what I choose for parameters.
With characters this would be really interesting if it could be wrangled and less of an "AALM" it works if I add "Assistant" to the stopping strings! Not perfectly of course but its a trip.. it can describe the items as the character or just output nonsense. Luck of the draw. |
hi,can you share the setting.json file.i have same hardware |
Dockerfile is not updated with LLaVA, we still get this error
The solution is as https://discuss.pytorch.org/t/could-not-load-library-libcudnn-cnn-infer-so-8/175139/1 suggested But cuda will prompt something like country code, which cannot be handled by "-y" by apt-get, so we need
The final docker file changes looks like something below
The result : Then, I am able to run LLaVA in within the local docker |
Ok, multimodality is here. To support LLaVA I created an extension, while I can separate it to a different repo with only an extension, I needed text-generation-webui to support overriding the
input_ids
/input_embeds
. While I was at it, I changed extension handling a bit (there should be no need to update anything in the existing extensions, it's mostly backend changes).To try it:
python3 server.py --model llava-13b-4bit-128g --wbits 4 --group 128 --chat --model_type=llama --extensions llava
"\n###"
to custom stopping stringsHere's a video of it in action:
2023-04-23.04-56-35.mp4