Skip to content

update custom data#68

Merged
wenhuach21 merged 2 commits into
intel:local_customized_datafrom
JustinLin610:fix_custom_data
Apr 3, 2024
Merged

update custom data#68
wenhuach21 merged 2 commits into
intel:local_customized_datafrom
JustinLin610:fix_custom_data

Conversation

@JustinLin610
Copy link
Copy Markdown
Contributor

this pr adds custom dataset support. the custom data is a jsonl file where each line is a dict with key 'text'. in the shell script, use --dataset custom --data_path path_to_jsonl.


autoround = round(model, tokenizer, args.bits, args.group_size, sym=args.sym, batch_size=args.train_bs,
seqlen=seqlen, n_blocks=args.n_blocks, iters=args.iters, lr=args.lr,
dataset=args.dataset, data_path=args.data_path, seqlen=seqlen, n_blocks=args.n_blocks, iters=args.iters, lr=args.lr,
Copy link
Copy Markdown
Contributor

@wenhuach21 wenhuach21 Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Justin, Thanks for the pr.
As there are many args for dataset now, I think a better way is merging dataset and data_path. This could be implemented by

1 register a local file calibration dataset
@register_dataset("local")
def get_local_dataloader(
tokenizer, seqlen, dataset_name="./tmp.json", split=["train", "validation", "test"], seed=42, bs=4
):

2 then change the code in (https://github.com/intel/auto-round/blob/main/auto_round/autoround.py#L638)
from .calib_dataset import CALIB_DATASETS
if is_local_file(self.dataset_name):
get_dataloader = CALIB_DATASETS.get("local")
else:
get_dataloader = CALIB_DATASETS.get(self.dataset_name, CALIB_DATASETS["NeelNanda/pile-10k"])
self.dataloader = get_dataloader(
self.tokenizer,
self.seqlen,
seed=self.seed,
bs=self.train_bs,
split=self.dataset_split,
dataset_name=self.dataset_name,
)

If this works for you, I can implement based on your pr

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure feel free to make changes. it is all good to me

@wenhuach21 wenhuach21 changed the base branch from main to local_customized_data April 3, 2024 03:28
@wenhuach21 wenhuach21 merged commit a963590 into intel:local_customized_data Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants