Which Feature type should I use for a data column that can have many types? #5723
-
When creating a dataset using I saw the example training script in def tokenize_captions(examples, is_train=True):
captions = []
for caption in examples[caption_column]:
if random.random() < args.proportion_empty_prompts:
captions.append("")
elif isinstance(caption, str):
captions.append(caption)
elif isinstance(caption, (list, np.ndarray)):
# take a random caption if there are multiple
captions.append(random.choice(caption) if is_train else caption[0])
else:
raise ValueError(
f"Caption column `{caption_column}` should contain either strings or lists of strings."
)
inputs = tokenizer(
captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
)
return inputs.input_ids |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
This would require support for the PyArrow Union type. In the meantime, you can store captions as a def tokenize_captions(examples, is_train=True):
captions = []
for caption in examples[caption_column]:
if random.random() < args.proportion_empty_prompts:
captions.append("")
elif isinstance(caption, (list, np.ndarray)):
# take a random caption if there are multiple
if len(caption) == 1:
captions.append(caption[0])
else:
captions.append(random.choice(caption) if is_train else caption[0])
else:
raise ValueError(
f"Caption column `{caption_column}` should contain lists of strings."
)
inputs = tokenizer(
captions, max_length=tokenizer.model_max_length, padding="max_length", truncation=True, return_tensors="pt"
)
return inputs.input_ids |
Beta Was this translation helpful? Give feedback.
This would require support for the PyArrow Union type.
In the meantime, you can store captions as a
Sequence(Value("string"))
and slightly modify the transform as follows: