Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic quantization config #661

Merged
merged 27 commits into from
Apr 22, 2024
Merged

Add dynamic quantization config #661

merged 27 commits into from
Apr 22, 2024

Conversation

echarlaix
Copy link
Collaborator

@echarlaix echarlaix commented Apr 15, 2024

Add dynamic quantization configuration

from optimum.intel import OVDynamicQuantizationConfig, OVModelForCausalLM

quantization_config = OVDynamicQuantizationConfig(bits=8, activations_group_size=32)
int8_model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines -599 to +598
calibration_dataset = kwargs.get("calibration_dataset", None)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed calibration_dataset argument @nikita-savelyevv

@echarlaix echarlaix changed the title OV quantizer Add dynamic quantization config Apr 16, 2024
@echarlaix echarlaix requested a review from AlexKoff88 April 16, 2024 12:51
@AlexKoff88
Copy link
Collaborator

@echarlaix, it looks like we will have two ways to enable weight-only quantization + dynamic quantization:

  1. as you did it in the PR.
  2. Quantize weights and pass OVConfig with runtime option that enables DQ.

Am I right here?

Two more things to note. This flow is still under development for GPU. It is also used along with 8-bit KV-cache quantization which performs pretty well and helps to reduce the memory footprint with almost no accuracy degradation.

Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments

optimum/intel/openvino/configuration.py Outdated Show resolved Hide resolved
optimum/intel/openvino/configuration.py Show resolved Hide resolved
optimum/intel/openvino/modeling_decoder.py Show resolved Hide resolved
Comment on lines -496 to -501
model = model_cls.from_pretrained(
model_id,
export=True,
quantization_config=OVWeightQuantizationConfig(bits=4, sym=True, group_size=-1, ratio=0.8),
calibration_dataset=quantization_dataset,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With calibration_dataset removal from from_pretrained() arguments, the workflow such as here will no longer be available. I personally am ok with this, but as I understand this will limit capabilities of hybrid quantization in the future. Currently, string only dataset is enough for it, but from discussion we had with @l-bat this is not the only use case.

Possibly, we should rework the hybrid quantization workflow to be called only through OVQuantizer which accepts custom calibration dataset.

@AlexKoff88, what do you think?

@echarlaix
Copy link
Collaborator Author

echarlaix commented Apr 19, 2024

@echarlaix, it looks like we will have two ways to enable weight-only quantization + dynamic quantization:

1. as you did it in the PR.

2. Quantize weights and pass OVConfig with runtime option that enables DQ.

Am I right here?

Two more things to note. This flow is still under development for GPU. It is also used along with 8-bit KV-cache quantization which performs pretty well and helps to reduce the memory footprint with almost no accuracy degradation.

Yes but would be in favor of using 1. in our examples / documentation to have the same API across all quantization strategies


q_config = self._openvino_config.quantization_config if self._openvino_config else None
if isinstance(q_config, OVDynamicQuantizationConfig):
self.ov_config["DYNAMIC_QUANTIZATION_GROUP_SIZE"] = str(q_config.activations_group_size)
Copy link
Collaborator

@AlexKoff88 AlexKoff88 Apr 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echarlaix, shall we turn 8-bit KV-cache quantization as well? It is essentially a per-token INT8 quantization and it is safe in terms of accuracy degradation?

@echarlaix echarlaix marked this pull request as ready for review April 22, 2024 12:29
@echarlaix echarlaix merged commit a06522c into main Apr 22, 2024
12 checks passed
@echarlaix echarlaix deleted the ov-quantizer branch April 22, 2024 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants