-
Notifications
You must be signed in to change notification settings - Fork 391
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Max/4bit #131
Max/4bit #131
Conversation
IIRC this was needed for training and loading weights in different precisions |
llm_studio/python_configs/text_causal_language_modeling_config.py
Outdated
Show resolved
Hide resolved
Thank you Max! Will need to test a bit more, but looking great so far. let's also add some words in the README.md about 4bit training? |
Getting errors when training in 4bit, and then doing inference on float16. Might be related to my comment above. |
One more note: Would it maybe make sense to force 8bit in Chat window? |
Found this: Here they are also setting I think we might also be fine setting it to float16 instead of bfloat16. |
Should not affect 4bit training: |
Not sure this was lucky or has impact, but I got better inference speed with these settings:
|
Let's add this also to int8:
And do you think |
From here:
I haven't tested this option yet; in theory, it should allow you to finetune even larger models. I can compare infernece speed and enable it if it turns out to be similar. |
For |
Is this issue still present? I started a new experiment in fp16, using previous 4bit weights; that works. |
I think after you made the changes above it works now. |
@maxjeblick did you try pushing to HF and reloading? |
I tried downloading the model from UI and subsequently loading from the unzipped folder, following the model card example. That worked. |
Cool, if possible lets also try the HF push. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Checked HF hub, it also works. |
This PR adds 4bit training/inference.
I have tested with
EleutherAI/gpt-neox-20b
where 8 bit yields OOM. I have also comparedEleutherAI/gpt-j-6B
4bit/8bit training and its corresponding chat responses.There is an issue with merging LORA back which can yield to OOM issues: #130 I'll have a look and see if it can be fixed and added to this PR.
I'm also not sure about this line: https://github.com/h2oai/h2o-llmstudio/blob/main/llm_studio/src/utils/modeling_utils.py#L72
It's currently left as is, I haven't found any issues with it so far.