Skip to content

Which Feature type should I use for a data column that can have many types? #5723

Closed Answered by mariosasko
offchan42 asked this question in Q&A
Discussion options

You must be logged in to vote

This would require support for the PyArrow Union type.

In the meantime, you can store captions as a Sequence(Value("string")) and slightly modify the transform as follows:

def tokenize_captions(examples, is_train=True):
    captions = []
    for caption in examples[caption_column]:
        if random.random() < args.proportion_empty_prompts:
            captions.append("")
        elif isinstance(caption, (list, np.ndarray)):
            # take a random caption if there are multiple
            if len(caption) == 1:
                captions.append(caption[0])
            else:
                captions.append(random.choice(caption) if is_train else caption[0])
        else:
            raise 

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by offchan42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants