Add support for generic data sets to SliceGPT pass #1145

shaahji · 2024-05-09T11:20:02Z

Add support for generic data sets to SliceGPT pass

Implementation of SliceGPT supported only a handful of specific datasets. Widen the support for any generic dataset via the data_config configuration.

Checklist before requesting a review

Add unit tests for this change.
Make sure all tests can pass.
Update documents if necessary.
Lint and apply fixes to your code by running lintrunner -a
Is this a user-facing change? If yes, give a description of this change to be included in the release notes.
Is this PR including examples changes? If yes, please remember to update example documentation in a follow-up PR.

(Optional) Issue link

olive/passes/pytorch/slicegpt.py

jambayk · 2024-05-09T15:46:28Z

olive/passes/pytorch/slicegpt.py

+        dataloader = data_config.to_data_container().create_dataloader(data_root)
+        dataset = [
+            {
+                "input_ids": data[0]["input_ids"].squeeze(),


is the squeeze to remove the batch dimension so that you can apply the config.calibration_batch_size later? Does this mean the data config must use batch size 1?

Yes, the batch size has to be 1. However, even with batch size of 1, the generated output needs to be squeezed to drop the extra dimension.

thanks! that makes sense. the default_dataloader from data config always inserts a batch dimension even if it is 1.

You could try this dataloader

Olive/olive/data/component/dataloader.py

Line 22 in 40845a3

def no_auto_batch_dataloader(dataset, **kwargs):

if you don't want the dataloader to batch the data.

olive/passes/pytorch/slicegpt.py

Implementation of SliceGPT supported only a handful of specific datasets. Widen the support for any generic dataset via the data_config configuration.

jambayk · 2024-05-09T19:22:59Z

olive/passes/pytorch/slicegpt.py

+        ]
+
+        torch.manual_seed(config.seed)
+        sampler = SubsetRandomSampler(torch.randperm(len(dataset))[: config.calibration_nsamples])


This looks okay for now. But we can probably add an option in the dataloader section of dataconfig in the future for random sampling so that this extra work of rewrapping the data with a dataloader is not needed.

Will also remove the potential confusion between batch_size, max_samples in the data config and calibration_batch_size, calibration_nsamples here.

jambayk reviewed May 9, 2024

View reviewed changes

olive/passes/pytorch/slicegpt.py Outdated Show resolved Hide resolved

jambayk reviewed May 9, 2024

View reviewed changes

olive/passes/pytorch/slicegpt.py Show resolved Hide resolved

Add support for generic data sets to SliceGPT pass

4055086

Implementation of SliceGPT supported only a handful of specific datasets. Widen the support for any generic dataset via the data_config configuration.

shaahji force-pushed the shaahji/slicegpt branch from 54ff082 to 4055086 Compare May 9, 2024 18:54

jambayk reviewed May 9, 2024

View reviewed changes

jambayk approved these changes May 9, 2024

View reviewed changes

shaahji merged commit 2b361b5 into main May 9, 2024
35 checks passed

shaahji deleted the shaahji/slicegpt branch May 9, 2024 22:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for generic data sets to SliceGPT pass #1145

Add support for generic data sets to SliceGPT pass #1145

shaahji commented May 9, 2024

jambayk May 9, 2024

shaahji May 9, 2024

jambayk May 9, 2024

jambayk May 9, 2024 •

edited

Add support for generic data sets to SliceGPT pass #1145

Add support for generic data sets to SliceGPT pass #1145

Conversation

shaahji commented May 9, 2024