Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorFlow 2.0 with Colab TPU does not work. #1

Closed
huan opened this issue Sep 5, 2019 · 23 comments · Fixed by #4 or snowkylin/tensorflow-handbook#48
Closed

TensorFlow 2.0 with Colab TPU does not work. #1

huan opened this issue Sep 5, 2019 · 23 comments · Fixed by #4 or snowkylin/tensorflow-handbook#48
Labels
bug Something isn't working

Comments

@huan
Copy link
Owner

huan commented Sep 5, 2019

Tensorflow 2.0 RC on Colab can not work. @yuefengz

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_host(resolver.master())
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
INFO:tensorflow:Initializing the TPU system: 10.127.143.138:8470
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-5-ccaaa18be0df> in <module>()
      9 resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
     10 tf.config.experimental_connect_to_host(resolver.master())
---> 11 tf.tpu.experimental.initialize_tpu_system(resolver)
     12 strategy = tf.distribute.experimental.TPUStrategy(resolver)
     13 

8 frames
/tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/eager/context.py in add_function(self, fn)
    987     """
    988     self.ensure_initialized()
--> 989     pywrap_tensorflow.TFE_ContextAddFunction(self._handle, fn)
    990 
    991   def add_function_def(self, fdef):

InvalidArgumentError: Unable to find a context_id matching the specified one (-6878417938495808013). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1567666179.109428572","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (-6878417938495808013). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}
@huan huan added the bug Something isn't working label Sep 5, 2019
@rxsang
Copy link

rxsang commented Sep 6, 2019

Hi Huan,

Could you try experimental_connect_to_cluster instead of experimental_connect_to_host? Basically

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)

And what is os.environ['COLAB_TPU_ADDR'] value in your case?

@huan
Copy link
Owner Author

huan commented Sep 6, 2019

Hi @rxsang,

Thank you for the suggestion!

I had followed your suggestion and get another error result:

resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])
tf.config.experimental_connect_to_cluster(resolver)
# tf.config.experimental_connect_to_host(resolver.master())
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver)
INFO:tensorflow:Initializing the TPU system: 10.88.139.146:8470
---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-10-343b36a70d10> in <module>()
     10 tf.config.experimental_connect_to_cluster(resolver)
     11 # tf.config.experimental_connect_to_host(resolver.master())
---> 12 tf.tpu.experimental.initialize_tpu_system(resolver)
     13 strategy = tf.distribute.experimental.TPUStrategy(resolver)
     14 

8 frames
/tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/eager/context.py in add_function(self, fn)
    987     """
    988     self.ensure_initialized()
--> 989     pywrap_tensorflow.TFE_ContextAddFunction(self._handle, fn)
    990 
    991   def add_function_def(self, fdef):

InvalidArgumentError: Unable to find a context_id matching the specified one (-5340281180391244817). Perhaps the worker was restarted, or the context was GC'd?
Additional GRPC error information:
{"created":"@1567762110.807927262","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to find a context_id matching the specified one (-5340281180391244817). Perhaps the worker was restarted, or the context was GC'd?","grpc_status":3}

You can reproduce it by open my Colab Notebook at here

@JahJajaka
Copy link

Hi guys, have the same issue. Tried experimental_connect_to_cluster suggestion and got the following error:

tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_address)
tf.config.experimental_connect_to_cluster(resolver.master())
tf.tpu.experimental.initialize_tpu_system(resolver)
strategy = tf.distribute.experimental.TPUStrategy(resolver) 
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-9a78436af88c> in <module>()
      1 tpu_address = 'grpc://' + os.environ['COLAB_TPU_ADDR']
      2 resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_address)
----> 3 tf.config.experimental_connect_to_cluster(resolver.master())
      4 tf.tpu.experimental.initialize_tpu_system(resolver)
      5 strategy = tf.distribute.experimental.TPUStrategy(resolver)

/tensorflow-2.0.0-rc0/python3.6/tensorflow_core/python/eager/remote.py in connect_to_cluster(cluster_spec_or_resolver, job_name, task_index, protocol)
    103   else:
    104     raise ValueError(
--> 105         "`cluster_spec_or_resolver` must be a `ClusterSpec` or a "
    106         "`ClusterResolver`.")
    107 

ValueError: `cluster_spec_or_resolver` must be a `ClusterSpec` or a `ClusterResolver`.

@rxsang
Copy link

rxsang commented Sep 26, 2019

Hi,

Does the issue still exist? If not, could you provide instructions how do you set up the environment and test? The thing I'm suspecting is the tpu_address may not point to the actual TPU worker correctly.

@rxsang
Copy link

rxsang commented Sep 27, 2019

Yuefeng pointed me the example https://colab.sandbox.google.com/github/huan/tensorflow-handbook-tpu/blob/master/tensorflow-handbook-tpu-example.ipynb#scrollTo=03EV61RS5jyR, I can reproduce this issue now.

It seems a version mismatch between colab client and tpu worker. The client version is built with 2.0.0-rc2, but I'm not sure about the tpu worker version. I'll ask around about how does colab release their tpu worker.

@huan
Copy link
Owner Author

huan commented Sep 27, 2019

It's great to know that you had reproduced this issue, and I'm looking forward to using the TensorFlow 2.0 with colab, can not waiting for that! :)

@huan
Copy link
Owner Author

huan commented Nov 12, 2019

Update: Error message changed after the 2.0 released.

---------------------------------------------------------------------------
InternalError                             Traceback (most recent call last)
<ipython-input-7-79d308ea228d> in <module>()
      4   steps_per_epoch=60,
      5   validation_data=(x_test.astype(np.float32), y_test.astype(np.float32)),
----> 6   validation_freq=5
      7 )
      8 

13 frames
/usr/local/lib/python3.6/dist-packages/six.py in raise_from(value, from_value)

InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:worker/replica:0/task:0/device:CPU:0 in order to run AutoShardDataset: Unable to parse tensor proto
Additional GRPC error information:
{"created":"@1573550304.529691735","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Unable to parse tensor proto","grpc_status":3} [Op:AutoShardDataset]

Related to:

Reproducable Colab: https://colab.research.google.com/github/huan/tensorflow-handbook-tpu/blob/master/tensorflow-handbook-tpu-example.ipynb

@huan huan changed the title InvalidArgumentError: Unable to find a context_id matching the specified one TensorFlow 2.0(2.1) with Colab TPU does not work. Nov 12, 2019
@huan huan changed the title TensorFlow 2.0(2.1) with Colab TPU does not work. TensorFlow 2.0 with Colab TPU does not work. Nov 12, 2019
@rxsang
Copy link

rxsang commented Nov 12, 2019

Hi Huan,

Although TF2.0 has been released, but the colab version hasn't been updated. We are working on the colab update and will make sure this is ready with the 2.1 release (should be 2 weeks later).

@huan
Copy link
Owner Author

huan commented Nov 13, 2019

@rxsang It's great getting to know that the Colab with TPU will work with TF2.1 two weeks later!

Thank you for the reply, and I'm looking forward to using them, Cheers!

@kpe
Copy link

kpe commented Nov 23, 2019

it seems like tf.disable_eager_execution() helps workaround the problem above (at least in my case) - see tensorflow/tensorflow#34391

@Duan-JM
Copy link
Contributor

Duan-JM commented Nov 24, 2019

I have confirmed the @kpe solution to this issue, disable eager execution with tf.2.0 by adding tf.compat.v1.disable_eager_execution() is work for my case.

@huan
Copy link
Owner Author

huan commented Mar 30, 2020

Today I finally fixed this bug, with Colab & TensorFlow 2.2.0-rc1

The most important part is how to construct the tf.distribute strategy, there are lots of tricks, many developers struggling with that.

At last, I learned from Martin Gorner, he has a great Colab notebook that uses TPU without any problem: 07_Keras_Flowers_TPU_xception_fine_tuned_best.ipynb

So I copy/paste the tf.distribute strategy code to my notebook, run it, everything works like a charm!

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()  # TPU detection
print('Running on TPU ', tpu.cluster_spec().as_dict()['worker'])

tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)

print("REPLICAS: ", strategy.num_replicas_in_sync)

@huan
Copy link
Owner Author

huan commented Mar 31, 2020

I can not believe that I missed the PR #4 from @swghosh, who have already made Colab work with TF2 and TPU on Jan 3!

Thank you very much Swarup Ghosh, I appreciate your sharing!

@huan huan closed this as completed in #4 Mar 31, 2020
@Selimonder
Copy link

Selimonder commented Apr 12, 2020

The issue persists with TF 2.2. (gcloud, v3-8 TPU, 8c, 30gbram) I was suspecting from resource limits, but it looks like the error is also related to tf.data.

update-1: yes, it seems like the issue is tf.data.Dataset, yet I don't exactly know why. All works well if you feed your data without using tf.data class.

@sourcecode369
Copy link

Faced this issue on kaggle

InvalidArgumentError: Unable to find a context_id matching the specified one (5888744478411942292). Perhaps the worker was restarted, or the context was GC'd?

@mobassir94
Copy link

@sourcecode369 i am also having same issue in kaggle today,1 day ago i trained a model using tpu,everything worked fine,,i changed nothing,,model is exactly same as it,everything is as before ,i once changed the dataset,previously the dataset i used for training was large but today the dataset i am trying for training is smaller than that,,, except changing dataset i changed nothing but 1 day ago it worked on large train set and today it is not working,please let me know if you find the solution for this issue

@rxsang
Copy link

rxsang commented Apr 29, 2020

Unable to find a context_id matching is a derived error, the root cause could be multiple reasons. Would you mind sharing more details and code for your job? There is one possible reason is that if you are using preemptible TPUs, your job may fail if the TPU gets preempted.

@mobassir94
Copy link

@rxsang here is my code :

#tpu config

AUTO = tf.data.experimental.AUTOTUNE

tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
tf.config.experimental_connect_to_cluster(tpu)
tf.tpu.experimental.initialize_tpu_system(tpu)
strategy = tf.distribute.experimental.TPUStrategy(tpu)
print(strategy.num_replicas_in_sync)
#GCS_DS_PATH = KaggleDatasets().get_gcs_path('jigsaw-multilingual-toxic-comment-classification')
BATCH_SIZE = 16 * strategy.num_replicas_in_sync

MODEL = 'jplu/tf-xlm-roberta-large'
# First load the real tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL)

save_path = '/kaggle/working/xlmr_large/'
if not os.path.exists(save_path):
    os.makedirs(save_path)
tokenizer.save_pretrained(save_path)

mypath = '../input/data-creation-for-jigsaw/'
mypath2 = '../input/data-creation-test-and-valid-224/' 
import numpy as np


x_valid = np.load(mypath2 + 'x_valid.npy')
x_test = np.load(mypath2 + 'x_test.npy')
y_valid = np.load(mypath2 + 'y_valid.npy')

x_train1 =  np.load(mypath + 'x_train1.npy')
x_train2 =  np.load(mypath + 'x_train2.npy')
y_train1 = np.load(mypath + 'y_train1.npy')
y_train2 = np.load(mypath + 'y_train2.npy')

x_train1.shape # output : (793611, 224)

x_train = x_train1
y_train = y_train1

#train,val and test dataset

%%time
train_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_train, y_train))
    .repeat()
    .shuffle(2048)
    .batch(BATCH_SIZE)
    .prefetch(AUTO)
)

valid_dataset = (
    tf.data.Dataset
    .from_tensor_slices((x_valid, y_valid))
    .batch(BATCH_SIZE)
    .cache()
    .prefetch(AUTO)
)

test_dataset = (
    tf.data.Dataset
    .from_tensor_slices(x_test)
    .batch(BATCH_SIZE)
)

def build_model(transformer, loss='binary_crossentropy', max_len=512):
    input_word_ids = Input(shape=(max_len,), dtype=tf.int32, name="input_word_ids")
    sequence_output = transformer(input_word_ids)[0]
    cls_token = sequence_output[:, 0, :]
    x = tf.keras.layers.Dropout(0.05)(cls_token)
    out = Dense(1, activation='sigmoid')(x)
    
    model = Model(inputs=input_word_ids, outputs=out)
    model.compile(Adam(lr=1e-4), loss=loss, metrics=[tf.keras.metrics.AUC()])
    
    return model

maxlen = 224

%%time
with strategy.scope():
    transformer_layer = transformers.TFXLMRobertaModel.from_pretrained(MODEL)
    model = build_model(transformer_layer,loss='binary_crossentropy', max_len=maxlen)

# training

%%time
N_STEPS = x_train.shape[0] // BATCH_SIZE
EPOCHS = 1
train_history = model.fit(
    train_dataset,
    steps_per_epoch=N_STEPS,
    validation_data=valid_dataset,
    callbacks=callback_list,
    epochs=EPOCHS
)

during executing code block above i get error InvalidArgumentError: Unable to find a context_id matching the specified one (1512067339570197782). Perhaps the worker was restarted, or the context was GC'd?

@jianse
Copy link

jianse commented May 1, 2020

@mobassir94 When you use a larger dataset, the dataset.cache() method consumes all available memory on Kaggle. It might help to remove this line of code

@mobassir94
Copy link

@jianse thank you for your kind reply, i reduced the train set size,it's now (408520, 250)..in my code you can see i used .cache in validation set because the validation set is very small,,but ok after reading your comment i just deleted that line of code and ran the kernel and unfortunately i am getting this same error again : InvalidArgumentError: Unable to find a context_id matching the specified one (4466620674281380355). Perhaps the worker was restarted, or the context was GC'd?

@zarif98sjs
Copy link

@mobassir94 did you manage to solve it later ?

@mobassir94
Copy link

mobassir94 commented Aug 18, 2021

@zarif98sjs iirc i had problem in my dataset ( i had both int and float in label)
Unable to find a context_id matching the specified one (1512067339570197782). Perhaps the worker was restarted, or the context was GC'd?

i can't remember clearly this issue that i faced almost 2 year ago,but nowadays using latest tf version and tpu in kaggle i don't face issues like this anymore!

@zarif98sjs
Copy link

@mobassir94 solved it later ! removing the cache worked for me .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
10 participants