Which Feature type should I use for a data column that can have many types? #5723

offchan42 · 2023-04-09T15:55:06Z

offchan42
Apr 9, 2023

When creating a dataset using Dataset.from_dict function, which Feature type should I use for a data column that can have many types?
For example, a caption column for training Stable Diffusion might be a string, or sometimes it could be a list of strings because there can be many captions for a single image (at training, a random caption will be selected from the list).

I saw the example training script in diffusers supporting this use case but I'm not sure how to create a dataset because I don't know which Feature type to choose from (e.g. Array2D, Sequence, etc):

def tokenize_captions(examples, is_train=True):
    captions = []
    for caption in examples[caption_column]:
        if random.random() < args.proportion_empty_prompts:
            captions.append("")
        elif isinstance(caption, str):
            captions.append(caption)
        elif isinstance(caption, (list, np.ndarray)):
            # take a random caption if there are multiple
            captions.append(random.choice(caption) if is_train else caption[0])
        else:
            raise ValueError(
                f"Caption column `{caption_column}` should contain either strings or lists of strings."
            )
    inputs = tokenizer(
        captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
    )
    return inputs.input_ids

Answered by mariosasko

May 18, 2023

This would require support for the PyArrow Union type.

In the meantime, you can store captions as a Sequence(Value("string")) and slightly modify the transform as follows:

def tokenize_captions(examples, is_train=True):
    captions = []
    for caption in examples[caption_column]:
        if random.random() < args.proportion_empty_prompts:
            captions.append("")
        elif isinstance(caption, (list, np.ndarray)):
            # take a random caption if there are multiple
            if len(caption) == 1:
                captions.append(caption[0])
            else:
                captions.append(random.choice(caption) if is_train else caption[0])
        else:
            raise …

View full answer

mariosasko · 2023-05-18T17:26:53Z

mariosasko
May 18, 2023
Collaborator

This would require support for the PyArrow Union type.

In the meantime, you can store captions as a Sequence(Value("string")) and slightly modify the transform as follows:

def tokenize_captions(examples, is_train=True):
    captions = []
    for caption in examples[caption_column]:
        if random.random() < args.proportion_empty_prompts:
            captions.append("")
        elif isinstance(caption, (list, np.ndarray)):
            # take a random caption if there are multiple
            if len(caption) == 1:
                captions.append(caption[0])
            else:
                captions.append(random.choice(caption) if is_train else caption[0])
        else:
            raise ValueError(
                f"Caption column `{caption_column}` should contain lists of strings."
            )
    inputs = tokenizer(
        captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
    )
    return inputs.input_ids

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Which Feature type should I use for a data column that can have many types? #5723

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Which Feature type should I use for a data column that can have many types? #5723

offchan42 Apr 9, 2023

Replies: 1 comment

mariosasko May 18, 2023 Collaborator

offchan42
Apr 9, 2023

mariosasko
May 18, 2023
Collaborator