Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved performance of decoders #354

Merged
merged 12 commits into from
Jun 21, 2023
Merged

Improved performance of decoders #354

merged 12 commits into from
Jun 21, 2023

Conversation

AlexKoff88
Copy link
Collaborator

Improved performance of decoders: GPT-like, bloom, etc., and Seq2seq models.
Observed a significant speedup of long sequences, e.g. Dolly-3B, 500 tokens: +30% when running on a client CPU.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 16, 2023

The documentation is not available anymore as the PR was closed or merged.

for i, data in enumerate(calibration_dataloader):
self.model.generate(**data, max_new_tokens=10)
for _, data in enumerate(calibration_dataloader):
self.model.generate(**data, max_new_tokens=100)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this modification added to reduce accuracy degradation resulting from quantization? If yes, what did you observe when varying this parameter ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the process, and will update a bit later.

def __getattr__(self, attr):
if attr in self.__dict__:
return getattr(self, attr)
return getattr(self.request, attr)

self.model.request = InferRequestWrapper(self.model.request)
for i, data in enumerate(calibration_dataloader):
self.model.generate(**data, max_new_tokens=10)
for _, data in enumerate(calibration_dataloader):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to the PR, but what do you think about uniformizing how quantization is applied on causal langage models depending on whether the user gives a torch.nn.Module or a OVBaseDecoderModel (the number of generation steps is currently not the same). We could also instantiate an OVModel in the from_pretrained method when the given model is a PreTrainedModel

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to accomplish this with the current NNCF PTQ API implementation we have for PyTorch. I think we should deprecate PTQ for PyTorch at some point because it also introduces ambiguity for the user about what workflow to use for quantization.

@AlexKoff88
Copy link
Collaborator Author

I think it is ready for merge.

Copy link
Collaborator

@helena-intel helena-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🔥

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks a lot @AlexKoff88

@echarlaix echarlaix merged commit c56d3b4 into main Jun 21, 2023
12 checks passed
@echarlaix echarlaix deleted the ak/decoder_performance branch June 21, 2023 08:11
echarlaix added a commit that referenced this pull request Jun 30, 2023
* Improved performance of decoders

* Improved performance of Seq2seq models

* Style

* Adjusted quantization logic

* Style

* Temporal changes

* Temporal

* Make it working

* Some improvements

* Style

* Update optimum/intel/openvino/modeling_decoder.py

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

---------

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>
echarlaix added a commit that referenced this pull request Jun 30, 2023
* Improved performance of decoders

* Improved performance of Seq2seq models

* Style

* Adjusted quantization logic

* Style

* Temporal changes

* Temporal

* Make it working

* Some improvements

* Style

* Update optimum/intel/openvino/modeling_decoder.py

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>

---------

Co-authored-by: Ella Charlaix <80481427+echarlaix@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants