Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPU VM Training Error - tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'SentencepieceOp' in binary running #965

Open
TambourineMan42 opened this issue Jan 27, 2022 · 14 comments

Comments

@TambourineMan42
Copy link

Describe the bug
When I run any of the fine-tuning scripts with my own training tsv file on TPU-VM (using v3-8 and v2-alpha-pod), it prematurely ends training (fails to even start).
To Reproduce
Steps to reproduce the behavior:

  1. In the VM I've tried everything from just pip install t5[gcp] to installing from source both mesh and text-to-text-transfer-transformer.

Expected behavior
Here is the entire stack trace:

INFO:tensorflow:training_loop marked as finished
I0127 00:07:37.772910 139764176627712 error_handling.py:115] training_loop marked as finished
WARNING:tensorflow:Reraising captured error
W0127 00:07:37.773025 139764176627712 error_handling.py:149] Reraising captured error
Traceback (most recent call last):
  File "/home/xx/miniconda3/envs/proj/bin/t5_mesh_transformer", line 33, in <module>
    sys.exit(load_entry_point('t5', 'console_scripts', 't5_mesh_transformer')())
  File "/home/xx/text-to-text-transfer-transformer/t5/models/mesh_transformer_main.py", line 280, in console_entry_point
    app.run(main)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/xx/text-to-text-transfer-transformer/t5/models/mesh_transformer_main.py", line 274, in main
    model_dir=FLAGS.model_dir)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/xx/mesh/mesh_tensorflow/transformer/utils.py", line 2447, in run
    skip_seen_data=skip_seen_data)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/gin/config.py", line 1605, in gin_wrapper
    utils.augment_exception_message_and_reraise(e, err_str)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
    raise proxy.with_traceback(exception.__traceback__) from None
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/home/xx/mesh/mesh_tensorflow/transformer/utils.py", line 1677, in train_model
    estimator.train(input_fn=input_fn, max_steps=train_steps)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3110, in train
    rendezvous.raise_errors()
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 150, in raise_errors
    six.reraise(typ, value, traceback)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/six.py", line 703, in reraise
    raise value
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3105, in train
    saving_listeners=saving_listeners)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1208, in _train_model_default
    saving_listeners)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1510, in _train_with_estimator_spec
    save_graph_def=self._config.checkpoint_save_graph_def) as mon_sess:
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 605, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1039, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 750, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1232, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 1237, in _create_session
    return self._sess_creator.create_session()
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/training/monitored_session.py", line 910, in create_session
    hook.after_create_session(self.tf_sess, self.coord)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/util.py", line 94, in after_create_session
    session.run(self._initializers)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 968, in run
    run_metadata_ptr)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1191, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1369, in _do_run
    run_metadata)
  File "/home/xx/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1394, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'SentencepieceOp' in binary running on t1v-n-e076a4c2-w-0. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
         [[node ParallelMapDatasetV2_1 (defined at /mesh/mesh_tensorflow/transformer/dataset.py:276) ]]

Errors may have originated from an input operation.
Input Source operations connected to node ParallelMapDatasetV2_1:
 ParallelMapDatasetV2 (defined at /mesh/mesh_tensorflow/transformer/dataset.py:274)

Original stack trace for 'ParallelMapDatasetV2_1':
  File "/miniconda3/envs/proj/bin/t5_mesh_transformer", line 33, in <module>
    sys.exit(load_entry_point('t5', 'console_scripts', 't5_mesh_transformer')())
  File "/text-to-text-transfer-transformer/t5/models/mesh_transformer_main.py", line 280, in console_entry_point
    app.run(main)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/text-to-text-transfer-transformer/t5/models/mesh_transformer_main.py", line 274, in main
    model_dir=FLAGS.model_dir)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/mesh/mesh_tensorflow/transformer/utils.py", line 2447, in run
    skip_seen_data=skip_seen_data)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/mesh/mesh_tensorflow/transformer/utils.py", line 1677, in train_model
    estimator.train(input_fn=input_fn, max_steps=train_steps)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3105, in train
    saving_listeners=saving_listeners)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 349, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1175, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1204, in _train_model_default
    self.config)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2942, in _call_model_fn
    config)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1163, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3222, in _model_fn
    input_holders.generate_infeed_enqueue_ops_and_dequeue_fn())
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1484, in generate_infeed_enqueue_ops_and_dequeue_fn
    self._invoke_input_fn_and_record_structure())
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1539, in _invoke_input_fn_and_record_structure
    num_hosts))
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 1143, in generate_broadcast_enqueue_ops_fn
    inputs = _Inputs.from_input_fn(input_fn(user_context))
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3076, in _input_fn
    return input_fn(**kwargs)
  File "/mesh/mesh_tensorflow/transformer/utils.py", line 1662, in input_fn
    dataset_split=dataset_split)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/text-to-text-transfer-transformer/t5/models/mesh_transformer.py", line 219, in tsv_dataset_fn
    eos_id=1)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/gin/config.py", line 1582, in gin_wrapper
    return fn(*new_args, **new_kwargs)
  File "/mesh/mesh_tensorflow/transformer/dataset.py", line 276, in packed_parallel_tsv_dataset
    _encode_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 2770, in map
    preserve_cardinality=False))
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/data/ops/dataset_ops.py", line 4544, in __init__
    **self._flat_structure)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 5550, in parallel_map_dataset_v2
    name=name)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 750, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3565, in _create_op_internal
    op_def=op_def)
  File "/miniconda3/envs/proj/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2045, in __init__
    self._traceback = tf_stack.extract_stack_for_node(self._c_op)

  In call to configurable 'train_model' (<function train_model at 0x7f1a14e977a0>)
  In call to configurable 'run' (<function run at 0x7f1a14e9be60>)
^C[1]+  Exit 1                  
@TambourineMan42
Copy link
Author

TambourineMan42 commented Jan 27, 2022

For context, the same thing runs perfectly on standard gcloud VM, starting TPUs the ol' way (tensorflow 2.7 version). I however strongly prefer TPU VMs and would love to get it running.

@stefan-it
Copy link

stefan-it commented Feb 23, 2022

Hi @TambourineMan42 , I've seen the same error message on TPU VM now.

Were you able to solve the problem?

I could also run the pre-training on a "normal" TPU in combination with a VM and I didn't get that strange error message...

@stefan-it
Copy link

@craffel do you have an idea, what is going wrong here? I'm also using TF 2.8 in the TPU VM (exact version that I've been using in the normal VM), and my (custom) sentence piece model is stored on a GCP bucket.

@ronakice
Copy link

@adarob I'm running into the same error message too.

@adarob
Copy link
Collaborator

adarob commented Feb 23, 2022

@broken can you PTAL?

@broken
Copy link

broken commented Feb 23, 2022

The error indicates that tensorflow-text is not installed on the TPU VMs, which the documentation looks to confirm this as well. It is installed on the standard gcloud VMs.

I was actually discussing releases recently with the that team relating to having the tf text package available. I'll reach out again to get more current info.

In the immediate-term, I think you will need to create your own VM image with the tensorflow-text package installed.

edit: or can you just pip install tensorflow-text on the VM? That would be easiest.

@broken
Copy link

broken commented Feb 23, 2022

It's odd since tensorflow-text should already be installed by pip when t5 is. Can you check the versions on the VM of TF & TF-Text? Are they the same major & minor (ie. tf 2.8.x & tf-text 2.8.x)?

@stefan-it
Copy link

stefan-it commented Feb 23, 2022

Hi @broken , unfortunately, there are some caveats, I can't just install tensorflow_text, because this would require an update of the already installed Tensorflow version (which is tf-nightly==2.7.0), even when I specify tensorflow_text==2.7.0, because this would e.g. install tensorflow-2.7.1!

Update of the TensorFlow version will result in an error, so that TensorFlow (then updated to 2.8) won't be able to find my local TPU. I could verify it via this script. So before an update the script is detecting my TPU, after TF to 2.8 - which tensorflow_text demands - this is not working.

I tried to build tensorflow_text from source using the working TF 2.7 installation but it fails due to a strange error: macro "REGISTER_TF_OP_SHIM" requires 2 arguments, but only 1 given error when compiling the whitespace tokenizer ops.

I'm currently using a v4-8 TPU VM with the v2-alpha-tpuv4 version.

@broken
Copy link

broken commented Feb 23, 2022

Can you try pip install --no-deps tensorflow-text==2.7.3? This should install without reinstalling tensorflow.

@stefan-it
Copy link

stefan-it commented Feb 24, 2022

I've tried it on a fresh new instance:

stefan@t1v-n--w-0:~$ pip show tf-nightly
Name: tf-nightly
Version: 2.7.0
Summary: TensorFlow is an open source machine learning framework for everyone.
Home-page: https://www.tensorflow.org/
Author: Google Inc.
Author-email: packages@tensorflow.org
License: Apache 2.0
Location: /usr/local/lib/python3.8/dist-packages
Requires: absl-py, astunparse, flatbuffers, gast, google-pasta, grpcio, h5py, keras-nightly, keras-preprocessing, libclang, numpy, opt-einsum, protobuf, six, tb-nightly, termcolor, tf-estimator-nightly, typing-extensions, wheel, wrapt
Required-by: 
stefan@t1v-n--w-0:~$ pip install --no-deps tensorflow-text==2.7.3
Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow-text==2.7.3
  Downloading tensorflow_text-2.7.3-cp38-cp38-manylinux2010_x86_64.whl (4.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.9/4.9 MB 62.9 MB/s eta 0:00:00
Installing collected packages: tensorflow-text
Successfully installed tensorflow-text-2.7.3
stefan@t1v-n--w-0:~$ 
stefan@t1v-n--w-0:~$ 
stefan@t1v-n--w-0:~$ 
stefan@t1v-n--w-0:~$ python3
Python 3.8.10 (default, Jun  2 2021, 10:49:15) 
[GCC 9.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow_text
2022-02-24 00:14:28.722217: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stefan/.local/lib/python3.8/site-packages/tensorflow_text/__init__.py", line 20, in <module>
    from tensorflow_text.core.pybinds import tflite_registrar
ImportError: /home/stefan/.local/lib/python3.8/site-packages/tensorflow_text/core/pybinds/tflite_registrar.so: undefined symbol: _ZN4absl12lts_2021032420raw_logging_internal21internal_log_functionE

@broken
Copy link

broken commented Feb 24, 2022

The undefined symbol errors are a result of using tensorflow-text against a version of tf that it wasn't built against. In this case, it was built against the stable version of tf, not nightly. The different build environments could create the symbol tables differently.

What I find odd is that your tf-nightly version is 2.7.0, but tf-nightly versions are generally of the form <version>.dev<date> (ie. 2.8.0.dev20211222).

Can you tell me what you get for tf.__git_version__?

import tensorflow as tf
print(tf.__git_version__)

@stefan-it
Copy link

Unfortunately, it outputs unknown for the git version 🤔

>>> import tensorflow as tf
2022-02-24 08:35:58.067887: I tensorflow/core/tpu/tpu_api_dlsym_initializer.cc:116] Libtpu path is: libtpu.so
>>> print(tf.__git_version__)
unknown

@broken
Copy link

broken commented Feb 28, 2022

Apparently the TF-2.8 TPU VM has tensorflow_text already installed. Can you use it or does it need to be the older image?

@fangia
Copy link

fangia commented May 20, 2023

I don't know if it will be useful, I had a similar problem using Kaggle's TPU VM. Iterating a keras dataset was throwing these errors (no errors with CPU). The dataset pipeline contained a map to a numpy function using tf.py_function. I fixed the errors by removing the @tf.function directive in a sub-function of this function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants