Replies: 1 comment
-
Hi @cheulyop I think most of your questions are answered on this page of the documentation https://huggingface.co/docs/datasets/about_map_batch Also, it's faster to use the following rather than the piece of code that you shared: original_column_names = dataset.column_names
dataset = dataset.map(
prepare_dataset_wrapper(processor, by_utterances),
batched=True if by_utterances else False,
batch_size=1,
num_proc=16
)
dataset = dataset.remove_columns(original_column_names) More information at https://huggingface.co/docs/datasets/process#map |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Can I make
dataset.map
return a batch of examples (multiple rows) instead of an example (single row) whilebatched
is set toFalse
?I'm augmenting my dataset by splitting long speeches into groups of shorter utterances by processing one example at a time.
It seems I must set
batched=True
to havemap
return a batch of examples, otherwise, even though my function returns a dictionary of lists with multiple items, the resulting dataset will have the same number of rows as the original dataset.This doesn't seem like an intuitive result, that when I set
batched=False
, I'm telling the map function that the inputs are batched, not the outputs?Maybe I just need to read the documentation more carefully, but if anyone else also had this problem, how did you get around this issue?
FYI, here's my code:
Beta Was this translation helpful? Give feedback.
All reactions