Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERROR:tensorflow: Failed to close session after error.Other threads may hang. #102

Closed
etetteh opened this issue Nov 13, 2020 · 5 comments
Closed

Comments

@etetteh
Copy link

etetteh commented Nov 13, 2020

I am trying to pretrain my ELECTRA base, I keep getting this output:

Running training
================================================================================
2020-11-13 08:00:18.044763: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
Model is built!
2020-11-13 08:00:48.956655: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from infeed: From /job:worker/replica:0/task:0:
{{function_node __inference_tf_data_experimental_map_and_batch_<lambda>_69}} Key: segment_ids.  Can't parse serialized Example.
	 [[{{node ParseSingleExample/ParseSingleExample}}]]
	 [[input_pipeline_task0/while/IteratorGetNext]]
ERROR:tensorflow:Closing session due to error From /job:worker/replica:0/task:0:
{{function_node __inference_tf_data_experimental_map_and_batch_<lambda>_69}} Key: segment_ids.  Can't parse serialized Example.
	 [[{{node ParseSingleExample/ParseSingleExample}}]]
	 [[input_pipeline_task0/while/IteratorGetNext]]
2020-11-13 08:01:08.642776: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1605254468.642525410","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
2020-11-13 08:01:08.642779: W tensorflow/core/distributed_runtime/rpc/grpc_remote_master.cc:157] RPC failed with status = "Unavailable: Socket closed" and grpc_error_string = "{"created":"@1605254468.642549072","description":"Error received from peer","file":"external/grpc/src/core/lib/surface/call.cc","file_line":1039,"grpc_message":"Socket closed","grpc_status":14}", maybe retrying the RPC
ERROR:tensorflow:Error recorded from outfeed: Step was cancelled by an explicit call to `Session::Close()`.
ERROR:tensorflow:


Failed to close session after error.Other threads may hang.



2020-11-13 08:01:50.857700: W tensorflow/core/distributed_runtime/rpc/grpc_session.cc:370] GrpcSession::ListDevices will initialize the session with an empty graph and other defaults because the session has not yet been created.
ERROR:tensorflow:Error recorded from infeed: From /job:worker/replica:0/task:0:
{{function_node __inference_tf_data_experimental_map_and_batch_<lambda>_69}} Key: segment_ids.  Can't parse serialized Example.
	 [[{{node ParseSingleExample/ParseSingleExample}}]]
	 [[input_pipeline_task0/while/IteratorGetNext]]
@briverse17
Copy link

I keep getting these errors, too.

I have tried installing and using other Python versions. Also, I change Tensorflow version to 1.15.dev20190909.
None of the above solved the problem.

Waiting for a possible solution.

@etetteh
Copy link
Author

etetteh commented Nov 16, 2020

I have tried every solution possible, and none is working. I'm just wondering how they trained their model, as I really need this to complete a time-bomb project.

@briverse17
Copy link

briverse17 commented Nov 17, 2020

Hi, @etetteh

I think I figured out the problem: max_seq_length of pretraining configuration must not exceed max_seq_length when building tfrecords.

I built my tfrecords with max_seq_length = 128 (the default) so I cannot pretrain with max_seq_length = 256 or 512.

I tried set max_seq_length = 128 and trained a small model. Things go smoothly!

Regards,

@etetteh
Copy link
Author

etetteh commented Nov 17, 2020

Great. I was about commenting that I fixed mine too. Same stuff I had to change, plus some environment issues

@etetteh etetteh closed this as completed Nov 17, 2020
@etetteh etetteh reopened this Nov 20, 2020
@etetteh
Copy link
Author

etetteh commented Nov 20, 2020

@briverse17 pretraining was successful but, I am get this error during finetuing. Did you have a similar problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants