-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dynamic quantization config #661
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
calibration_dataset = kwargs.get("calibration_dataset", None) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed calibration_dataset
argument @nikita-savelyevv
@echarlaix, it looks like we will have two ways to enable weight-only quantization + dynamic quantization:
Am I right here? Two more things to note. This flow is still under development for GPU. It is also used along with 8-bit KV-cache quantization which performs pretty well and helps to reduce the memory footprint with almost no accuracy degradation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments
model = model_cls.from_pretrained( | ||
model_id, | ||
export=True, | ||
quantization_config=OVWeightQuantizationConfig(bits=4, sym=True, group_size=-1, ratio=0.8), | ||
calibration_dataset=quantization_dataset, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With calibration_dataset
removal from from_pretrained()
arguments, the workflow such as here will no longer be available. I personally am ok with this, but as I understand this will limit capabilities of hybrid quantization in the future. Currently, string only dataset is enough for it, but from discussion we had with @l-bat this is not the only use case.
Possibly, we should rework the hybrid quantization workflow to be called only through OVQuantizer which accepts custom calibration dataset.
@AlexKoff88, what do you think?
Yes but would be in favor of using 1. in our examples / documentation to have the same API across all quantization strategies |
|
||
q_config = self._openvino_config.quantization_config if self._openvino_config else None | ||
if isinstance(q_config, OVDynamicQuantizationConfig): | ||
self.ov_config["DYNAMIC_QUANTIZATION_GROUP_SIZE"] = str(q_config.activations_group_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echarlaix, shall we turn 8-bit KV-cache quantization as well? It is essentially a per-token INT8 quantization and it is safe in terms of accuracy degradation?
Add dynamic quantization configuration