How to apply the huggingface tokenizer in seqio.vocabulary #406

nawnoes · 2022-12-26T08:42:41Z

Hello.

I would like to use the huggingface tokenizer to seqio.vocabulary in t5x.

I inherited seqio.vocabulary and created my BBPEVocabulary. However, the values 'inputs' and 'targets' are not accessed as text in tf.data.Dataset.map. Because huggingface tokenizer can get a string but tf.data.Dataset give tf.tensor like Tensor("args_0:0", shape=(), dtype=string).

Since the seqio.sentencepice module load module by using so file in tf_text.sentencepiece, I don't know how to handle it inside.

I would like to ask you about how to get and process tf.tensor as text in order to use huggingface tokenizer in tf.data.Dataset map.

I am attaching the code I used below.

Thank you:)

seqio/custom_task.py

from src.vocabularies import BBPEVocabulary
bbpe_vocab = BBPEVocabulary('custom_path')

seqio.TaskRegistry.add(
    "my_span_corruption_task",
    source=seqio.TFExampleDataSource(
        split_to_filepattern={"train": os.path.join('[MY_TF_RECORD_PATH]', "*train.tfrecord*")},
        feature_description={"text": tf.io.FixedLenFeature([], tf.string)}
    ),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                "inputs": None,
                "targets": "text"
            }),
        preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        preprocessors.span_corruption,
        seqio.preprocessors.append_eos_after_trim,

    ],
    output_features=BBPE_OUTPUT_FEATURES,
    metric_fns=[])

seqio/preprocessors.py

def tokenize(dataset: tf.data.Dataset,
             output_features: OutputFeaturesType,
             copy_pretokenized: bool = True,
             with_eos: bool = False) -> tf.data.Dataset:
  tokenize_fn = functools.partial(
      tokenize_impl,
      output_features=output_features,
      copy_pretokenized=copy_pretokenized,
      with_eos=with_eos)
  return utils.map_over_dataset(fn=tokenize_fn)(dataset)

def tokenize_impl(features: Mapping[str, tf.Tensor],
                  output_features: OutputFeaturesType,
                  copy_pretokenized: bool = True,
                  with_eos: bool = False) -> Mapping[str, tf.Tensor]:
  ret = {}
  for k, v in features.items():
    if k in output_features:
      if copy_pretokenized:
        ret[f'{k}_pretokenized'] = v
      vocab = output_features[k].vocabulary
      v = vocab.encode_tf(v) # In this line, the `v` value type is "tf.tensor", and I can't obtain text of `v`
      ...[omitted]...

    ret[k] = v
  print(f'tokenize_impl | complete | return : {ret}')
  return ret

The text was updated successfully, but these errors were encountered:

nawnoes closed this as completed Mar 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to apply the huggingface tokenizer in seqio.vocabulary #406

How to apply the huggingface tokenizer in seqio.vocabulary #406

nawnoes commented Dec 26, 2022 •

edited

Loading

How to apply the huggingface tokenizer in seqio.vocabulary #406

How to apply the huggingface tokenizer in seqio.vocabulary #406

Comments

nawnoes commented Dec 26, 2022 • edited Loading

nawnoes commented Dec 26, 2022 •

edited

Loading