Can dataset.map return batch with batched=False? #2664

cheulyop · 2021-07-16T23:26:15Z

cheulyop
Jul 16, 2021

Can I make dataset.map return a batch of examples (multiple rows) instead of an example (single row) while batched is set to False?

I'm augmenting my dataset by splitting long speeches into groups of shorter utterances by processing one example at a time.

It seems I must set batched=True to have map return a batch of examples, otherwise, even though my function returns a dictionary of lists with multiple items, the resulting dataset will have the same number of rows as the original dataset.

This doesn't seem like an intuitive result, that when I set batched=False, I'm telling the map function that the inputs are batched, not the outputs?

Maybe I just need to read the documentation more carefully, but if anyone else also had this problem, how did you get around this issue?

FYI, here's my code:

def prepare_dataset_wrapper(processor, by_utterances=False):
    def prepare_dataset(example):
        # check that all files have the correct sampling rate
        assert (
            example['sampling_rate'][0] if by_utterances else example['sampling_rate'] == 16_000
        ), f'Make sure all inputs have the same sampling rate of {processor.feature_extractor.sampling_rate}.'

        utterances = example['utterances'][0] if by_utterances else example['utterances']
        speech = example['speech'][0] if by_utterances else example['speech']
        if by_utterances:
            utterances = [utt for utt in utterances if len(utt['text']) >= 20]
        
        # for each utterance, slice the full speech by the start and end of an utterance
        utt_arrays = [speech[int(utt['start']):int(utt['end'])] for utt in utterances]
        example['input_values'] = processor(utt_arrays, sampling_rate=16_000).input_values
    
        if by_utterances:
            batch = {
                'input_values': example['input_values'],
                'id': example['id'] * len(utterances),
                'age': example['age'] * len(utterances),
                'sex': example['sex'] * len(utterances),
                'group': example['group'] * len(utterances),
                'control': example['control'] * len(utterances),
                'education': example['education'] * len(utterances),
                'sampling_rate': example['sampling_rate'] * len(utterances),
            }
            with processor.as_target_processor():
                batch['labels'] = processor([utt['text'] for utt in utterances]).input_ids
            return batch

        else:
            example['input_values'] = list(chain(example['input_values']))
            with processor.as_target_processor():
                example['labels'] = processor(' '.join([utt['text'] for utt in utterances])).input_ids
            return example
    
    return prepare_dataset


dataset = dataset.map(
    prepare_dataset_wrapper(processor, by_utterances),
    batched=True if by_utterances else False,
    batch_size=1,
    num_proc=16,
    remove_columns=dataset.column_names
)

alvarobartt · 2022-05-31T15:15:42Z

alvarobartt
May 31, 2022
Maintainer

Hi @cheulyop I think most of your questions are answered on this page of the documentation https://huggingface.co/docs/datasets/about_map_batch

Also, it's faster to use the following rather than the piece of code that you shared:

original_column_names = dataset.column_names
dataset = dataset.map(
    prepare_dataset_wrapper(processor, by_utterances),
    batched=True if by_utterances else False,
    batch_size=1,
    num_proc=16
)
dataset = dataset.remove_columns(original_column_names)

More information at https://huggingface.co/docs/datasets/process#map

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can dataset.map return batch with batched=False? #2664

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Can dataset.map return batch with batched=False? #2664

cheulyop Jul 16, 2021

Replies: 1 comment

alvarobartt May 31, 2022 Maintainer

cheulyop
Jul 16, 2021

alvarobartt
May 31, 2022
Maintainer