update custom data by JustinLin610 · Pull Request #68 · intel/auto-round

JustinLin610 · 2024-04-01T14:53:43Z

this pr adds custom dataset support. the custom data is a jsonl file where each line is a dict with key 'text'. in the shell script, use --dataset custom --data_path path_to_jsonl.

for more information, see https://pre-commit.ci

wenhuach21 · 2024-04-02T02:22:16Z


    autoround = round(model, tokenizer, args.bits, args.group_size, sym=args.sym, batch_size=args.train_bs,
-                      seqlen=seqlen, n_blocks=args.n_blocks, iters=args.iters, lr=args.lr,
+                      dataset=args.dataset, data_path=args.data_path, seqlen=seqlen, n_blocks=args.n_blocks, iters=args.iters, lr=args.lr,


Hi Justin, Thanks for the pr.
As there are many args for dataset now, I think a better way is merging dataset and data_path. This could be implemented by

1 register a local file calibration dataset
@register_dataset("local")
def get_local_dataloader(
tokenizer, seqlen, dataset_name="./tmp.json", split=["train", "validation", "test"], seed=42, bs=4
):

2 then change the code in (https://github.com/intel/auto-round/blob/main/auto_round/autoround.py#L638)
from .calib_dataset import CALIB_DATASETS
if is_local_file(self.dataset_name):
get_dataloader = CALIB_DATASETS.get("local")
else:
get_dataloader = CALIB_DATASETS.get(self.dataset_name, CALIB_DATASETS["NeelNanda/pile-10k"])
self.dataloader = get_dataloader(
self.tokenizer,
self.seqlen,
seed=self.seed,
bs=self.train_bs,
split=self.dataset_split,
dataset_name=self.dataset_name,
)

If this works for you, I can implement based on your pr

sure feel free to make changes. it is all good to me

JustinLin610 and others added 2 commits April 1, 2024 22:51

update custom data

2b8d576

[pre-commit.ci] auto fixes from pre-commit.com hooks

07f4a9c

for more information, see https://pre-commit.ci

wenhuach21 reviewed Apr 2, 2024

View reviewed changes

wenhuach21 changed the base branch from main to local_customized_data April 3, 2024 03:28

wenhuach21 merged commit a963590 into intel:local_customized_data Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update custom data#68

update custom data#68
wenhuach21 merged 2 commits into
intel:local_customized_datafrom
JustinLin610:fix_custom_data

JustinLin610 commented Apr 1, 2024

Uh oh!

wenhuach21 Apr 2, 2024 •

edited

Loading

Uh oh!

JustinLin610 Apr 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JustinLin610 commented Apr 1, 2024

Uh oh!

wenhuach21 Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JustinLin610 Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wenhuach21 Apr 2, 2024 •

edited

Loading