update custom data#68
Conversation
for more information, see https://pre-commit.ci
|
|
||
| autoround = round(model, tokenizer, args.bits, args.group_size, sym=args.sym, batch_size=args.train_bs, | ||
| seqlen=seqlen, n_blocks=args.n_blocks, iters=args.iters, lr=args.lr, | ||
| dataset=args.dataset, data_path=args.data_path, seqlen=seqlen, n_blocks=args.n_blocks, iters=args.iters, lr=args.lr, |
There was a problem hiding this comment.
Hi Justin, Thanks for the pr.
As there are many args for dataset now, I think a better way is merging dataset and data_path. This could be implemented by
1 register a local file calibration dataset
@register_dataset("local")
def get_local_dataloader(
tokenizer, seqlen, dataset_name="./tmp.json", split=["train", "validation", "test"], seed=42, bs=4
):
2 then change the code in (https://github.com/intel/auto-round/blob/main/auto_round/autoround.py#L638)
from .calib_dataset import CALIB_DATASETS
if is_local_file(self.dataset_name):
get_dataloader = CALIB_DATASETS.get("local")
else:
get_dataloader = CALIB_DATASETS.get(self.dataset_name, CALIB_DATASETS["NeelNanda/pile-10k"])
self.dataloader = get_dataloader(
self.tokenizer,
self.seqlen,
seed=self.seed,
bs=self.train_bs,
split=self.dataset_split,
dataset_name=self.dataset_name,
)
If this works for you, I can implement based on your pr
There was a problem hiding this comment.
sure feel free to make changes. it is all good to me
this pr adds custom dataset support. the custom data is a jsonl file where each line is a dict with key 'text'. in the shell script, use
--dataset custom --data_path path_to_jsonl.