Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max/4bit #131

Merged
merged 19 commits into from
Jun 5, 2023
Merged

Max/4bit #131

merged 19 commits into from
Jun 5, 2023

Conversation

maxjeblick
Copy link
Contributor

This PR adds 4bit training/inference.
I have tested with EleutherAI/gpt-neox-20b where 8 bit yields OOM. I have also compared EleutherAI/gpt-j-6B 4bit/8bit training and its corresponding chat responses.

There is an issue with merging LORA back which can yield to OOM issues: #130 I'll have a look and see if it can be fixed and added to this PR.

I'm also not sure about this line: https://github.com/h2oai/h2o-llmstudio/blob/main/llm_studio/src/utils/modeling_utils.py#L72
It's currently left as is, I haven't found any issues with it so far.

@maxjeblick maxjeblick requested a review from psinger June 1, 2023 12:00
@psinger
Copy link
Collaborator

psinger commented Jun 1, 2023

I'm also not sure about this line: https://github.com/h2oai/h2o-llmstudio/blob/main/llm_studio/src/utils/modeling_utils.py#L72
It's currently left as is, I haven't found any issues with it so far.

IIRC this was needed for training and loading weights in different precisions

@pascal-pfeiffer
Copy link
Collaborator

pascal-pfeiffer commented Jun 2, 2023

Thank you Max! Will need to test a bit more, but looking great so far.

let's also add some words in the README.md about 4bit training?

@psinger
Copy link
Collaborator

psinger commented Jun 2, 2023

Getting errors when training in 4bit, and then doing inference on float16.
While loading the weights.
Score is correct though.

Might be related to my comment above.

@psinger
Copy link
Collaborator

psinger commented Jun 2, 2023

One more note:
Inference in 4bit is extremely slow for me.

Would it maybe make sense to force 8bit in Chat window?
Or even better, make it somehow an option?
Might be a separate issue.

@psinger
Copy link
Collaborator

psinger commented Jun 2, 2023

Found this:
https://github.com/huggingface/peft/blob/main/examples/fp4_finetuning/finetune_fp4_opt_bnb_peft.py

Here they are also setting llm_int8_threshold - does it have an impact in 4bit?

I think we might also be fine setting it to float16 instead of bfloat16.

@maxjeblick
Copy link
Contributor Author

maxjeblick commented Jun 2, 2023

Here they are also setting llm_int8_threshold - does it have an impact in 4bit?

Should not affect 4bit training:
https://github.com/huggingface/transformers/blob/main/src/transformers/utils/bitsandbytes.py#L133

@psinger
Copy link
Collaborator

psinger commented Jun 2, 2023

Not sure this was lucky or has impact, but I got better inference speed with these settings:

elif cfg.architecture.backbone_dtype == "int4":
        kwargs["device_map"] = {"": cfg.environment._device}
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.float16,
            llm_int8_has_fp16_weight=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            llm_int8_threshold=0.0
        )
        # need to force pretrained
        cfg.architecture.pretrained = True
        kwargs["torch_dtype"] = torch.float16

@psinger
Copy link
Collaborator

psinger commented Jun 5, 2023

Let's add this also to int8:

kwargs["torch_dtype"] = torch.float16

And do you think bnb_4bit_use_double_quant=True is useful?

@maxjeblick
Copy link
Contributor Author

And do you think bnb_4bit_use_double_quant=True is useful?

From here:

Other options include bnb_4bit_use_double_quant which uses a second quantization after the first one to save an additional 0.4 bits per parameter

I haven't tested this option yet; in theory, it should allow you to finetune even larger models. I can compare infernece speed and enable it if it turns out to be similar.

@maxjeblick
Copy link
Contributor Author

For h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v2 (evaluate only, that is train with 0 epochs) bnb_4bit_use_double_quant does not seem to have any noticeable impact, both time-wise, as well as max-GPU memory-wise.
Not sure how, if at all, it affects training performance. Let's maybe keep it disabled for now?

@maxjeblick
Copy link
Contributor Author

Getting errors when training in 4bit, and then doing inference on float16.

Is this issue still present? I started a new experiment in fp16, using previous 4bit weights; that works.

@psinger
Copy link
Collaborator

psinger commented Jun 5, 2023

Getting errors when training in 4bit, and then doing inference on float16.

Is this issue still present? I started a new experiment in fp16, using previous 4bit weights; that works.

I think after you made the changes above it works now.

@psinger
Copy link
Collaborator

psinger commented Jun 5, 2023

@maxjeblick did you try pushing to HF and reloading?

@maxjeblick
Copy link
Contributor Author

@maxjeblick did you try pushing to HF and reloading?

I tried downloading the model from UI and subsequently loading from the unzipped folder, following the model card example. That worked.

@psinger
Copy link
Collaborator

psinger commented Jun 5, 2023

Cool, if possible lets also try the HF push.

Copy link
Collaborator

@psinger psinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@maxjeblick
Copy link
Contributor Author

Checked HF hub, it also works.

@maxjeblick maxjeblick merged commit 6b140b1 into main Jun 5, 2023
5 checks passed
@maxjeblick maxjeblick deleted the max/4bit branch June 5, 2023 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants