-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Good First Issue][NNCF]: Add INT8 weight compression conformance test for Tinyllama-1.1b PyTorch model #2527
Comments
Hi, is it possible to take this one? |
Hello @RedShift51, the task is assigned to you. Thank you for looking into this issue! Please let us know if you have any questions or require any help. |
Hey, what metric value is okay for tinyllama/tinyllama-1.1b-step-50k-105b ? |
Hey, Similarity metric between float16 and int8 weight compressed tinyllama-1.1b-step-50k-105b model on whowhatbench: Code to reproduce:
|
The main idea of whowhatbench to compare
|
@RedShift51, are you going to continue work on this issue? do you have any updates? |
Removed assignment due to inactivity. |
.take |
Thank you for looking into this issue! Please let us know if you have any questions or require any help. |
.take |
Thank you for looking into this issue! Please let us know if you have any questions or require any help. |
@alexsu52 @AlexanderDokuchaev If I add the following code to the class LMWeightCompression(BaseTestPipeline):
...
def compress(self) -> None:
if self.backend == BackendType.FP32:
return
elif self.backend == BackendType.TORCH:
start_time = time.perf_counter()
MODEL_ID = "tinyllama/tinyllama-1.1b-step-50k-105b"
tokenizer = transformers.AutoTokenizer.from_pretrained(MODEL_ID)
self.model = transformers.AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.float16, device_map="cpu"
)
text = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}
self.run_info.compression_memory_usage = memory_usage(self._compress_torch(inputs), max_usage=True)
self.run_info.time_compression = time.perf_counter() - start_time
return
print("Weight compression...")
start_time = time.perf_counter()
self.run_info.compression_memory_usage = memory_usage(self._compress, max_usage=True)
self.run_info.time_compression = time.perf_counter() - start_time
def _compress_torch(self, inputs):
self.compressed_model = nncf.compress_weights(self.model, dataset=nncf.Dataset([inputs]))
... |
@alexsu52 @AlexanderDokuchaev following up on the above^ |
Hi @AdiKsOnDev Add Example of _validate function https://github.com/openvinotoolkit/nncf/blob/develop/tests/post_training/pipelines/image_classification_timm.py#L127 Metrics should be stored in nncf/tests/post_training/pipelines/image_classification_timm.py Lines 170 to 171 in 0b407de
|
OK, thanks for the directions |
@AlexanderDokuchaev Git Blame |
@AlexanderDokuchaev I added following code for def compress(self) -> None:
if self.backend == BackendType.FP32:
return
elif self.backend == BackendType.TORCH:
start_time = time.perf_counter()
tokenizer = transformers.AutoTokenizer.from_pretrained(self.model_id)
self.model = transformers.AutoModelForCausalLM.from_pretrained(
self.model_id, torch_dtype=torch.float16, device_map="cpu"
)
text = "The TinyLlama project aims to pretrain a 1.1B Llama model on 3 trillion tokens."
token = tokenizer(text, max_length=500, return_tensors="pt", truncation=True)
inputs = {"input_ids": token["input_ids"], "attention_mask": token["attention_mask"]}
self.run_info.compression_memory_usage = memory_usage(self._compress_torch(inputs), max_usage=True)
self.run_info.time_compression = time.perf_counter() - start_time
return
print("Weight compression...")
start_time = time.perf_counter()
self.run_info.compression_memory_usage = memory_usage(self._compress, max_usage=True)
self.run_info.time_compression = time.perf_counter() - start_time def _compress_torch(self, inputs):
self.compressed_model = nncf.compress_weights(self.model, dataset=nncf.Dataset([inputs])) |
…1b PyTorch model (#2636) ### Changes - Added the `INT8` compression **test suite** to the `model_scope` - Added `TORCH` backend support in `LMWeightCompression` class - For `INT8` compression, _dataset,_ as well as some other parameters (see [model_scope](https://github.com/openvinotoolkit/nncf/blob/f0081037f28af2a829043d4ddaf4902d91864724/tests/post_training/model_scope.py#L329C1-L340C7)) are set to `None` - [metric_value](https://github.com/openvinotoolkit/nncf/blob/f0081037f28af2a829043d4ddaf4902d91864724/tests/post_training/data/wc_reference_data.yaml#L17C1-L20C15) has been set to **0.95944** - Mainly use `save_pretrained()` for `TORCH` models - Omitted a few method calls that are not supported for `TORCH` models (Check the commits for details) ### Reason for changes Requested to Benchmark changes via `whowhatbench` in issue #2527 ### Related tickets ref: 130788 Closes #2527 ### Tests - Added `INT8` _weight compression_ **conformance** test for `Tinyllama-1.1b` **PyTorch** model --------- Co-authored-by: Aleksander <aleksu52@noreply.github.com> Co-authored-by: Alexander Suslov <alexander.suslov@intel.com>
Context
This issue proposes adding a test to the post-training compression conformance suite to verify that the weights of Tinyllama-1.1b PyTorch model can be compressed to INT8 in a given time while preserving an acceptable level of model accuracy on whowhatbench
INT8 weight compression is popular approach to reduce the LLM model size by quantizing the weights from original floating point precision to INT8, leading to smaller model footprints and potentially faster inference on the target devices without significant accuracy drop.
This is code snippet for better understanding how to compress weights of Tinyllama-1.1b PyTorch model using NNCF:
What needs to be done?
Add INT8 weight compression test for for Tinyllama-1.1b PyTorch model to the post-training compression conformance suite so that the test can be run with the following command:
The task steps:
LMWeightCompression
class.Example Pull Requests
#2425
Resources
Contact points
@AlexanderDokuchaev, @alexsu52
Ticket
ref: 130788
The text was updated successfully, but these errors were encountered: