TensorFlow 2.0 with Colab TPU does not work. #1

huan · 2019-09-05T06:56:01Z

Tensorflow 2.0 RC on Colab can not work. @yuefengz

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_host(resolver.master())
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

INFO:tensorflow:Initializing the TPU system: 10.127.143.138:8470
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-5-ccaaa18be0df> in <module>()
      9 resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
     10 tf.config.experimental_connect_to_host(resolver.master())
---> 11 tf.tpu.experimental.initialize_tpu_system(resolver)
     12 strategy = tf.distribute.experimental.TPUStrategy(resolver)
     13 

8 frames
/tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/eager/context.py in add_function(self, fn)
    987     """
    988     self.ensure_initialized()
--> 989     pywrap_tensorflow.TFE_ContextAddFunction(self._handle, fn)
    990 
    991   def add_function_def(self, fdef):

InvalidArgumentError: Unable to find a context_id matching the specified one (-6878417938495808013). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1567666179.109428572","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (-6878417938495808013). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}

rxsang · 2019-09-06T04:03:32Z

Hi Huan,

Could you try experimental_connect_to_cluster instead of experimental_connect_to_host? Basically

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

And what is os.environ['COLAB_TPU_ADDR'] value in your case?

huan · 2019-09-06T09:30:37Z

Hi @rxsang,

Thank you for the suggestion!

I had followed your suggestion and get another error result:

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# tf.config.experimental_connect_to_host(resolver.master())
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

INFO:tensorflow:Initializing the TPU system: 10.88.139.146:8470
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-10-343b36a70d10> in <module>()
     10 tf.config.experimental_connect_to_cluster(resolver)
     11 # tf.config.experimental_connect_to_host(resolver.master())
---> 12 tf.tpu.experimental.initialize_tpu_system(resolver)
     13 strategy = tf.distribute.experimental.TPUStrategy(resolver)
     14 

8 frames
/tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/eager/context.py in add_function(self, fn)
    987     """
    988     self.ensure_initialized()
--> 989     pywrap_tensorflow.TFE_ContextAddFunction(self._handle, fn)
    990 
    991   def add_function_def(self, fdef):

InvalidArgumentError: Unable to find a context_id matching the specified one (-5340281180391244817). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1567762110.807927262","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (-5340281180391244817). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}

You can reproduce it by open my Colab Notebook at here

JahJajaka · 2019-09-06T13:56:26Z

Hi guys, have the same issue. Tried experimental_connect_to_cluster suggestion and got the following error:

tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_address)
tf.config.experimental_connect_to_cluster(resolver.master())
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-9a78436af88c> in <module>()
      1 tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
      2 resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_address)
----> 3 tf.config.experimental_connect_to_cluster(resolver.master())
      4 tf.tpu.experimental.initialize_tpu_system(resolver)
      5 strategy = tf.distribute.experimental.TPUStrategy(resolver)

/tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/eager/remote.py in connect_to_cluster(cluster_spec_or_resolver, job_name, task_index, protocol)
    103   else:
    104     raise ValueError(
--> 105         "`cluster_spec_or_resolver` must be a `ClusterSpec` or a "
    106         "`ClusterResolver`.")
    107 

ValueError: `cluster_spec_or_resolver` must be a `ClusterSpec` or a `ClusterResolver`.

rxsang · 2019-09-26T23:07:31Z

Hi,

Does the issue still exist? If not, could you provide instructions how do you set up the environment and test? The thing I'm suspecting is the tpu_address may not point to the actual TPU worker correctly.

rxsang · 2019-09-27T01:05:21Z

Yuefeng pointed me the example https://colab.sandbox.google.com/github/huan/tensorflow-handbook-tpu/blob/master/tensorflow-handbook-tpu-example.ipynb#scrollTo=03EV61RS5jyR, I can reproduce this issue now.

It seems a version mismatch between colab client and tpu worker. The client version is built with 2.0.0-rc2, but I'm not sure about the tpu worker version. I'll ask around about how does colab release their tpu worker.

huan · 2019-09-27T02:23:17Z

It's great to know that you had reproduced this issue, and I'm looking forward to using the TensorFlow 2.0 with colab, can not waiting for that! :)

huan · 2019-11-12T09:24:01Z

Update: Error message changed after the 2.0 released.

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
<ipython-input-7-79d308ea228d> in <module>()
      4   steps_per_epoch=60,
      5   validation_data=(x_test.astype(np.float32), y_test.astype(np.float32)),
----> 6   validation_freq=5
      7 )
      8 

13 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run AutoShardDataset: Unable to parse tensor proto
Additional GRPC error information:
{"created":"@1573550304.529691735","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to parse tensor proto","grpc_status":3} [Op:AutoShardDataset]

Related to:

TF2.0 TPU Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run AutoShardDataset: Unable to parse tensor proto tensorflow/tensorflow#33747

Reproducable Colab: https://colab.research.google.com/github/huan/tensorflow-handbook-tpu/blob/master/tensorflow-handbook-tpu-example.ipynb

rxsang · 2019-11-12T19:57:24Z

Hi Huan,

Although TF2.0 has been released, but the colab version hasn't been updated. We are working on the colab update and will make sure this is ready with the 2.1 release (should be 2 weeks later).

huan · 2019-11-13T03:32:48Z

@rxsang It's great getting to know that the Colab with TPU will work with TF2.1 two weeks later!

Thank you for the reply, and I'm looking forward to using them, Cheers!

kpe · 2019-11-23T11:29:17Z

it seems like tf.disable_eager_execution() helps workaround the problem above (at least in my case) - see tensorflow/tensorflow#34391

Duan-JM · 2019-11-24T02:01:47Z

I have confirmed the @kpe solution to this issue, disable eager execution with tf.2.0 by adding tf.compat.v1.disable_eager_execution() is work for my case.

huan · 2020-03-30T19:16:53Z

Today I finally fixed this bug, with Colab & TensorFlow 2.2.0-rc1

The most important part is how to construct the tf.distribute strategy, there are lots of tricks, many developers struggling with that.

At last, I learned from Martin Gorner, he has a great Colab notebook that uses TPU without any problem: 07_Keras_Flowers_TPU_xception_fine_tuned_best.ipynb

So I copy/paste the tf.distribute strategy code to my notebook, run it, everything works like a charm!

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)

print("REPLICAS: ", strategy.num_replicas_in_sync)

huan · 2020-03-31T03:02:43Z

I can not believe that I missed the PR #4 from @swghosh, who have already made Colab work with TF2 and TPU on Jan 3!

Thank you very much Swarup Ghosh, I appreciate your sharing!

Selimonder · 2020-04-12T10:14:20Z

The issue persists with TF 2.2. (gcloud, v3-8 TPU, 8c, 30gbram) I was suspecting from resource limits, but it looks like the error is also related to tf.data.

update-1: yes, it seems like the issue is tf.data.Dataset, yet I don't exactly know why. All works well if you feed your data without using tf.data class.

sourcecode369 · 2020-04-25T12:40:53Z

Faced this issue on kaggle

InvalidArgumentError: Unable to find a context_id matching the specified one (5888744478411942292). Perhaps the worker was restarted, or the context was GC'd?

mobassir94 · 2020-04-26T10:03:46Z

@sourcecode369 i am also having same issue in kaggle today,1 day ago i trained a model using tpu,everything worked fine,,i changed nothing,,model is exactly same as it,everything is as before ,i once changed the dataset,previously the dataset i used for training was large but today the dataset i am trying for training is smaller than that,,, except changing dataset i changed nothing but 1 day ago it worked on large train set and today it is not working,please let me know if you find the solution for this issue

rxsang · 2020-04-29T01:24:55Z

Unable to find a context_id matching is a derived error, the root cause could be multiple reasons. Would you mind sharing more details and code for your job? There is one possible reason is that if you are using preemptible TPUs, your job may fail if the TPU gets preempted.

mobassir94 · 2020-04-29T09:55:00Z

@rxsang here is my code :

#tpu config

AUTO = tf.data.experimental.AUTOTUNE

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
print(strategy.num_replicas_in_sync)
#GCS_DS_PATH = KaggleDatasets().get_gcs_path('jigsaw-multilingual-toxic-comment-classification')
BATCH_SIZE = 16 * strategy.num_replicas_in_sync

MODEL = 'jplu/tf-xlm-roberta-large'
# First load the real tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL)

save_path = '/kaggle/working/xlmr_large/'
if not os.path.exists(save_path):
    os.makedirs(save_path)
tokenizer.save_pretrained(save_path)

mypath = '../input/data-creation-for-jigsaw/'
mypath2 = '../input/data-creation-test-and-valid-224/' 
import numpy as np


x_valid = np.load(mypath2 + 'x_valid.npy')
x_test = np.load(mypath2 + 'x_test.npy')
y_valid = np.load(mypath2 + 'y_valid.npy')

x_train1 =  np.load(mypath + 'x_train1.npy')
x_train2 =  np.load(mypath + 'x_train2.npy')
y_train1 = np.load(mypath + 'y_train1.npy')
y_train2 = np.load(mypath + 'y_train2.npy')

x_train1.shape # output : (793611, 224)

x_train = x_train1
y_train = y_train1

#train,val and test dataset

%%time
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(BATCH_SIZE)
)

def build_model(transformer, loss='binary_crossentropy', max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    x = tf.keras.layers.Dropout(0.05)(cls_token)
    out = Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-4), loss=loss, metrics=[tf.keras.metrics.AUC()])
    
    return model

maxlen = 224

%%time
with strategy.scope():
    transformer_layer = transformers.TFXLMRobertaModel.from_pretrained(MODEL)
    model = build_model(transformer_layer,loss='binary_crossentropy', max_len=maxlen)

# training

%%time
N_STEPS = x_train.shape[0] // BATCH_SIZE
EPOCHS = 1
train_history = model.fit(
    train_dataset,
    steps_per_epoch=N_STEPS,
    validation_data=valid_dataset,
    callbacks=callback_list,
    epochs=EPOCHS
)

during executing code block above i get error InvalidArgumentError: Unable to find a context_id matching the specified one (1512067339570197782). Perhaps the worker was restarted, or the context was GC'd?

jianse · 2020-05-01T11:57:35Z

@mobassir94 When you use a larger dataset, the dataset.cache() method consumes all available memory on Kaggle. It might help to remove this line of code

mobassir94 · 2020-05-01T16:56:54Z

@jianse thank you for your kind reply, i reduced the train set size,it's now (408520, 250)..in my code you can see i used .cache in validation set because the validation set is very small,,but ok after reading your comment i just deleted that line of code and ran the kernel and unfortunately i am getting this same error again : InvalidArgumentError: Unable to find a context_id matching the specified one (4466620674281380355). Perhaps the worker was restarted, or the context was GC'd?

zarif98sjs · 2021-08-18T12:29:07Z

@mobassir94 did you manage to solve it later ?

mobassir94 · 2021-08-18T15:32:46Z

@zarif98sjs iirc i had problem in my dataset ( i had both int and float in label)
Unable to find a context_id matching the specified one (1512067339570197782). Perhaps the worker was restarted, or the context was GC'd?

i can't remember clearly this issue that i faced almost 2 year ago,but nowadays using latest tf version and tpu in kaggle i don't face issues like this anymore!

zarif98sjs · 2021-08-18T16:11:03Z

@mobassir94 solved it later ! removing the cache worked for me .

huan added the bug Something isn't working label Sep 5, 2019

jvishnuvardhan mentioned this issue Oct 7, 2019

TPU support in tensorflow 2.0 release tensorflow/tensorflow#33045

Closed

huan mentioned this issue Oct 15, 2019

TPU 分布式计算 snowkylin/tensorflow-handbook#8

Open

huan changed the title ~~InvalidArgumentError: Unable to find a context_id matching the specified one~~ TensorFlow 2.0(2.1) with Colab TPU does not work. Nov 12, 2019

huan changed the title ~~TensorFlow 2.0(2.1) with Colab TPU does not work.~~ TensorFlow 2.0 with Colab TPU does not work. Nov 12, 2019

swghosh mentioned this issue Jan 3, 2020

Fix: Updated code for TF2.0 on Colab TPU #4

Merged

MichaelKarpe mentioned this issue Jan 12, 2020

TPU InternalError with TF 2.1.0 on Google Colab (Assigned device does not have registered OpKernel support for _Arg node iteratorgetnext_iterator) tensorflow/tensorflow#35785

Closed

huan closed this as completed in #4 Mar 31, 2020

amahendrakar mentioned this issue Aug 18, 2020

Model using Conv1D returns socket closed error on TPU while with SeparableConv1D works tensorflow/tensorflow#42271

Closed

UsharaniPagadala mentioned this issue May 26, 2021

RET_CHECK failure: operand != nullptr when trying to run with TPU tensorflow/tensorflow#49626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorFlow 2.0 with Colab TPU does not work. #1

TensorFlow 2.0 with Colab TPU does not work. #1

huan commented Sep 5, 2019

rxsang commented Sep 6, 2019

huan commented Sep 6, 2019

JahJajaka commented Sep 6, 2019

rxsang commented Sep 26, 2019

rxsang commented Sep 27, 2019

huan commented Sep 27, 2019

huan commented Nov 12, 2019 •

edited

Loading

rxsang commented Nov 12, 2019

huan commented Nov 13, 2019

kpe commented Nov 23, 2019

Duan-JM commented Nov 24, 2019

huan commented Mar 30, 2020

huan commented Mar 31, 2020

Selimonder commented Apr 12, 2020 •

edited

Loading

sourcecode369 commented Apr 25, 2020

mobassir94 commented Apr 26, 2020

rxsang commented Apr 29, 2020

mobassir94 commented Apr 29, 2020

jianse commented May 1, 2020

mobassir94 commented May 1, 2020

zarif98sjs commented Aug 18, 2021

mobassir94 commented Aug 18, 2021 •

edited

Loading

zarif98sjs commented Aug 18, 2021

TensorFlow 2.0 with Colab TPU does not work. #1

TensorFlow 2.0 with Colab TPU does not work. #1

Comments

huan commented Sep 5, 2019

rxsang commented Sep 6, 2019

huan commented Sep 6, 2019

JahJajaka commented Sep 6, 2019

rxsang commented Sep 26, 2019

rxsang commented Sep 27, 2019

huan commented Sep 27, 2019

huan commented Nov 12, 2019 • edited Loading

rxsang commented Nov 12, 2019

huan commented Nov 13, 2019

kpe commented Nov 23, 2019

Duan-JM commented Nov 24, 2019

huan commented Mar 30, 2020

huan commented Mar 31, 2020

Selimonder commented Apr 12, 2020 • edited Loading

sourcecode369 commented Apr 25, 2020

mobassir94 commented Apr 26, 2020

rxsang commented Apr 29, 2020

mobassir94 commented Apr 29, 2020

jianse commented May 1, 2020

mobassir94 commented May 1, 2020

zarif98sjs commented Aug 18, 2021

mobassir94 commented Aug 18, 2021 • edited Loading

zarif98sjs commented Aug 18, 2021

huan commented Nov 12, 2019 •

edited

Loading

Selimonder commented Apr 12, 2020 •

edited

Loading

mobassir94 commented Aug 18, 2021 •

edited

Loading