Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to apply the huggingface tokenizer in seqio.vocabulary #406

Closed
nawnoes opened this issue Dec 26, 2022 · 0 comments
Closed

How to apply the huggingface tokenizer in seqio.vocabulary #406

nawnoes opened this issue Dec 26, 2022 · 0 comments

Comments

@nawnoes
Copy link

nawnoes commented Dec 26, 2022

Hello.

I would like to use the huggingface tokenizer to seqio.vocabulary in t5x.

I inherited seqio.vocabulary and created my BBPEVocabulary. However, the values 'inputs' and 'targets' are not accessed as text in tf.data.Dataset.map. Because huggingface tokenizer can get a string but tf.data.Dataset give tf.tensor like Tensor("args_0:0", shape=(), dtype=string).

Since the seqio.sentencepice module load module by using so file in tf_text.sentencepiece, I don't know how to handle it inside.

I would like to ask you about how to get and process tf.tensor as text in order to use huggingface tokenizer in tf.data.Dataset map.

I am attaching the code I used below.

Thank you:)

seqio/custom_task.py

from src.vocabularies import BBPEVocabulary
bbpe_vocab = BBPEVocabulary('custom_path')

seqio.TaskRegistry.add(
    "my_span_corruption_task",
    source=seqio.TFExampleDataSource(
        split_to_filepattern={"train": os.path.join('[MY_TF_RECORD_PATH]', "*train.tfrecord*")},
        feature_description={"text": tf.io.FixedLenFeature([], tf.string)}
    ),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                "inputs": None,
                "targets": "text"
            }),
        preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        preprocessors.span_corruption,
        seqio.preprocessors.append_eos_after_trim,

    ],
    output_features=BBPE_OUTPUT_FEATURES,
    metric_fns=[])

seqio/preprocessors.py

def tokenize(dataset: tf.data.Dataset,
             output_features: OutputFeaturesType,
             copy_pretokenized: bool = True,
             with_eos: bool = False) -> tf.data.Dataset:
  tokenize_fn = functools.partial(
      tokenize_impl,
      output_features=output_features,
      copy_pretokenized=copy_pretokenized,
      with_eos=with_eos)
  return utils.map_over_dataset(fn=tokenize_fn)(dataset)

def tokenize_impl(features: Mapping[str, tf.Tensor],
                  output_features: OutputFeaturesType,
                  copy_pretokenized: bool = True,
                  with_eos: bool = False) -> Mapping[str, tf.Tensor]:
  ret = {}
  for k, v in features.items():
    if k in output_features:
      if copy_pretokenized:
        ret[f'{k}_pretokenized'] = v
      vocab = output_features[k].vocabulary
      v = vocab.encode_tf(v) # In this line, the `v` value type is "tf.tensor", and I can't obtain text of `v`
      ...[omitted]...

    ret[k] = v
  print(f'tokenize_impl | complete | return : {ret}')
  return ret
@nawnoes nawnoes closed this as completed Mar 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant