Add configs to run int4 inference #37

RezaYazdaniAminabadi · 2022-11-18T19:31:33Z

Add some minor config changes to support int4 inference through DeepSpeed-Inference.

The Int4 support will be added to DeepSpeed through this PR.

stas00

Amazing work with adding int4-support, Reza!

stas00 · 2022-11-18T19:37:58Z

bloom-inference-scripts/bloom-ds-inference.py

@@ -191,6 +191,7 @@ def write_checkponts_json():
    mp_size=world_size,
    base_dir=repo_root,
    dtype=getattr(torch, infer_dtype),
+    quantization_bits=8 if args.dtype == 'int8' else 4,


what happens with --dtype float16?

probably best to set this in kwargs only if quantization dtype is provided

The quabtization-bit should not be used when running in half-precision. But, I agree we can do it in the kwargs and only for qunatized inference mode.

these demos are already used by many users so let's make those nice and clean configuration-wise, so it's clear to the reader when what bits should be enabled.

stas00 · 2022-11-18T19:38:54Z

bloom-inference-scripts/bloom-ds-inference.py

@@ -227,7 +228,7 @@ def write_checkponts_json():
    # dynamically extend to support larger bs by repetition
    input_sentences *= math.ceil(args.batch_size / len(input_sentences))

-generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=False)
+generate_kwargs = dict(max_new_tokens=num_tokens, do_sample=True)


this is already a very different type of change.

If int4 requires do_sample=True, then again, let's change it only if it's --dtype int4

Sure, I will double check the do_sample=False again to see if the text generated makes sense. If not, I set it to true for int4

I just checked with do_sample=False and I see the text is produced in the same way as for FP16 and INT8:

in=DeepSpeed is a machine learning framework out=DeepSpeed is a machine learning framework for deep learning. It is a Python library, and it is also a framework. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and it is a library. It is a framework, and

So, I am gonna turn it off for now.

stas00 · 2022-11-18T19:44:45Z

Also should probably assert if int4 attempted to be used w/o deepspeed>=xyz once the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.

I can take care of that.

RezaYazdaniAminabadi · 2022-11-18T21:33:16Z

Also should probably assert if int4 attempted to be used w/o deepspeed>=xyz once the DS PR is merged... could tentatively set to the next deepspeed version? perhaps with XXX to enabled so the script can be used against ds@master.

I can take care of that.

Sounds good to me. Thanks @stas00

stas00 · 2022-11-18T23:53:47Z

bloom-inference-scripts/bloom-ds-inference.py

@@ -100,7 +100,7 @@ def get_checkpoint_files(model_name_or_path):


 model_name = args.name
-infer_dtype = args.dtype
+infer_dtype = args.dtype if args.dtype != 'int4' else 'int8'


would it make for a more user-friendly API to

keep the dtype intact

drop quantization_bits

let deepspeed.init_inference derive the number of bits from dtype?

not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information in dtype and and quantization_bits twice

oh, wait, torch.init4 still doesn't exist, does it?

let's find the feature request.

still not implemented pytorch/pytorch#74627

so that's why you had to do the odd workarounds, right?

I guess we can drop it once its implemented @stas00 ?
For now, this might be the best way to do it.

see #37 (comment)

it's pointless to wait, since they won't have int3 and int12

would it make for a more user-friendly API to

keep the dtype intact

drop quantization_bits

let deepspeed.init_inference derive the number of bits from dtype?

not only the currently suggested override is confusing, I fail to see what purpose serves carrying the same information in dtype and and quantization_bits twice

@stas00 and @RezaYazdaniAminabadi - just clarifying that we have introduced a new DeepSpeedInferenceConfig that can be passed to init_inference. We are keeping it backwards compatible but if we are okay to make changes to this file, I would advocate for writing a config dictionary for DeepSpeed and pass that to init_inference instead of the various kwargs. Please see here for an example: https://gist.github.com/awan-10/6e3d5c756be3a876522e860c6bbf702d#file-bloom-ds-inference-py-L173

Also, see the docs for the new config: https://deepspeed.readthedocs.io/en/latest/inference-init.html

That definitely works.

@awan-10, may I suggest you make the inference config accept dict_or_path just like zero does? it might be for some users easier to write out a separate file.

@stas00 - thanks for the suggestion. Created an issue so we can track it: microsoft/DeepSpeed#2532. Mike and I will work on it.

Thank you very much, @awan-10

stas00 · 2022-11-19T00:12:35Z

OK, I think I understand the limitations of pytorch and it'll get only worse when you try int3, etc. even if int4 is supported.
https://github.com/huggingface/transformers-bloom-inference/pull/37/files#r1026981222

I propose we break the currently proposed API and draw a better one.

I propose to have only 2 user-configurable args related to how deepspeed-inference operates

dtype is the dtype of the original model - so only fp32, fp16 or bf16 - never intX (i.e. we drop int8)
quantization_bits: [None, 12, 8, 4, 3]

Now the API is simple, unambiguous and future proof (as in int12 or int3, Mixture of Precisions support)

For back-compat deepspeed.init_inference can simply set quantization_bits=8 if dtype==torch.int8 is passed. So the API will be unbroken.

What do you think, Reza?

…eepSpeed

mayank31398 · 2022-11-19T00:45:12Z

Huh?
Int4?
I will test this branch surely and let you know.
Thanks a lot for this :)

RezaYazdaniAminabadi · 2022-11-19T01:01:08Z

is simple, unambiguous and future pro

Hi @stas00,
I agree with what you said, and we are going through the same route as you see from my last commit here.
Thanks for the good suggestion :)
Best,
Reza

RezaYazdaniAminabadi · 2022-11-19T01:03:25Z

In that case, we

is simple, unambiguous and future pro

Hi @stas00, I agree with what you said, and we are going through the same route as you see from my last commit here. Thanks for the good suggestion :) Best, Reza

In this case, we can simply pass the bits to the DeepSpeed-inference config: kwargs['quant']['weight']['num_bits'] = quantization_bits

stas00 · 2022-11-19T01:05:49Z

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?

why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

RezaYazdaniAminabadi · 2022-11-19T01:13:53Z

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?

why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

awan-10 · 2022-11-19T01:34:12Z

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?
why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

@RezaYazdaniAminabadi -- please see my comment above. #37 (comment)

RezaYazdaniAminabadi · 2022-11-19T04:32:55Z

may I suggest that the just added kwargs['quant']['weight']['num_bits'] isn't the most user-friendly API as far as kwargs go?
why not have a flat structure of simple key=value pairs and once you got the info in your side you can re-arrange it to any nesting level you want.

I agree, let me work on that and I fix it.

@RezaYazdaniAminabadi -- please see my comment above. #37 (comment)

thanks @awan-10. Please go ahead and push your changes.

Update bloom-ds-inference.py

Add configs to run int4 inference

572e644

stas00 suggested changes Nov 18, 2022

View reviewed changes

fix quantization-bit config & turn off ds_sample

132d99d

stas00 reviewed Nov 18, 2022

View reviewed changes

change the quantization config format to work with the new style at D…

99cd7c9

…eepSpeed

awan-10 and others added 2 commits November 21, 2022 10:05

Update bloom-ds-inference.py

32779e8

Merge pull request #1 from awan-10/patch-1

b472e48

Update bloom-ds-inference.py

mayank31398 mentioned this pull request Dec 5, 2022

Sharding a model checkpoint for deepspeed usage #39

Open

mayank31398 force-pushed the main branch 12 times, most recently from 235af1a to 134b703 Compare January 22, 2023 12:19

mayank31398 force-pushed the main branch 10 times, most recently from abfc97f to 9d48dbf Compare January 24, 2023 11:37

mayank31398 force-pushed the main branch 2 times, most recently from 655179f to 114b912 Compare March 3, 2023 16:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add configs to run int4 inference #37

Add configs to run int4 inference #37

RezaYazdaniAminabadi commented Nov 18, 2022

stas00 left a comment •

edited

stas00 Nov 18, 2022

RezaYazdaniAminabadi Nov 18, 2022

stas00 Nov 18, 2022 •

edited

stas00 Nov 18, 2022

RezaYazdaniAminabadi Nov 18, 2022

RezaYazdaniAminabadi Nov 18, 2022

stas00 commented Nov 18, 2022 •

edited

RezaYazdaniAminabadi commented Nov 18, 2022

stas00 Nov 18, 2022 •

edited

stas00 Nov 18, 2022

stas00 Nov 18, 2022

mayank31398 Nov 19, 2022

stas00 Nov 19, 2022

stas00 Nov 19, 2022

awan-10 Nov 19, 2022

stas00 Nov 19, 2022

awan-10 Nov 22, 2022

stas00 Nov 22, 2022

stas00 commented Nov 19, 2022 •

edited

mayank31398 commented Nov 19, 2022

RezaYazdaniAminabadi commented Nov 19, 2022

RezaYazdaniAminabadi commented Nov 19, 2022

stas00 commented Nov 19, 2022

RezaYazdaniAminabadi commented Nov 19, 2022

awan-10 commented Nov 19, 2022 •

edited by stas00

RezaYazdaniAminabadi commented Nov 19, 2022

Add configs to run int4 inference #37

Are you sure you want to change the base?

Add configs to run int4 inference #37

Conversation

RezaYazdaniAminabadi commented Nov 18, 2022

stas00 left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 Nov 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Nov 18, 2022 • edited

RezaYazdaniAminabadi commented Nov 18, 2022

stas00 Nov 18, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Nov 19, 2022 • edited

mayank31398 commented Nov 19, 2022

RezaYazdaniAminabadi commented Nov 19, 2022

RezaYazdaniAminabadi commented Nov 19, 2022

stas00 commented Nov 19, 2022

RezaYazdaniAminabadi commented Nov 19, 2022

awan-10 commented Nov 19, 2022 • edited by stas00

RezaYazdaniAminabadi commented Nov 19, 2022

stas00 left a comment •

edited

stas00 Nov 18, 2022 •

edited

stas00 commented Nov 18, 2022 •

edited

stas00 Nov 18, 2022 •

edited

stas00 commented Nov 19, 2022 •

edited

awan-10 commented Nov 19, 2022 •

edited by stas00