You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to use the huggingface tokenizer to seqio.vocabulary in t5x.
I inherited seqio.vocabulary and created my BBPEVocabulary. However, the values 'inputs' and 'targets' are not accessed as text in tf.data.Dataset.map. Because huggingface tokenizer can get a string but tf.data.Dataset give tf.tensor like Tensor("args_0:0", shape=(), dtype=string).
Since the seqio.sentencepice module load module by using so file in tf_text.sentencepiece, I don't know how to handle it inside.
I would like to ask you about how to get and process tf.tensor as text in order to use huggingface tokenizer in tf.data.Dataset map.
def tokenize(dataset: tf.data.Dataset,
output_features: OutputFeaturesType,
copy_pretokenized: bool = True,
with_eos: bool = False) -> tf.data.Dataset:
tokenize_fn = functools.partial(
tokenize_impl,
output_features=output_features,
copy_pretokenized=copy_pretokenized,
with_eos=with_eos)
return utils.map_over_dataset(fn=tokenize_fn)(dataset)
def tokenize_impl(features: Mapping[str, tf.Tensor],
output_features: OutputFeaturesType,
copy_pretokenized: bool = True,
with_eos: bool = False) -> Mapping[str, tf.Tensor]:
ret = {}
for k, v in features.items():
if k in output_features:
if copy_pretokenized:
ret[f'{k}_pretokenized'] = v
vocab = output_features[k].vocabulary
v = vocab.encode_tf(v) # In this line, the `v` value type is "tf.tensor", and I can't obtain text of `v`
...[omitted]...
ret[k] = v
print(f'tokenize_impl | complete | return : {ret}')
return ret
The text was updated successfully, but these errors were encountered:
Hello.
I would like to use the huggingface tokenizer to seqio.vocabulary in t5x.
I inherited seqio.vocabulary and created my BBPEVocabulary. However, the values 'inputs' and 'targets' are not accessed as text in tf.data.Dataset.map. Because huggingface tokenizer can get a string but tf.data.Dataset give tf.tensor like
Tensor("args_0:0", shape=(), dtype=string)
.Since the seqio.sentencepice module load module by using
so
file in tf_text.sentencepiece, I don't know how to handle it inside.I would like to ask you about how to get and process tf.tensor as text in order to use huggingface tokenizer in tf.data.Dataset map.
I am attaching the code I used below.
Thank you:)
seqio/custom_task.py
seqio/preprocessors.py
The text was updated successfully, but these errors were encountered: