Training fails after step 900 #112

AngusG · 2017-03-26T16:35:10Z

I'm running the nmt_small.yml model on a cluster with 3 Titan Black GPUs with TF version 1.0.1, and consistently running into issues around step 900. The issue actually seems to have to do with an evaluation that occurs between steps 900-1000.

INFO:tensorflow:Starting evaluation at 2017-03-26-16:21:57
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:03:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX TITAN Black, pci bus id: 0000:83:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX TITAN Black, pci bus id: 0000:84:00.0)
W tensorflow/core/framework/op_kernel.cc:993] Out of range: Reached limit of 1
	 [[Node: parallel_read/filenames/limit_epochs/CountUpTo = CountUpTo[T=DT_INT64, _class=["loc:@parallel_read/filenames/limit_epochs/epochs"], limit=1, _device="/job:localhost/replica:0/task:0/cpu:0"](parallel_read/filenames/limit_epochs/epochs)]]
W tensorflow/core/framework/op_kernel.cc:993] Out of range: Reached limit of 1
	 [[Node: parallel_read_1/filenames/limit_epochs/CountUpTo = CountUpTo[T=DT_INT64, _class=["loc:@parallel_read_1/filenames/limit_epochs/epochs"], limit=1, _device="/job:localhost/replica:0/task:0/cpu:0"](parallel_read_1/filenames/limit_epochs/epochs)]]
W tensorflow/core/framework/op_kernel.cc:993] Invalid argument: Tried to read from index 46 but array size is: 46

I am debugging but wanted to report in case it's an easy fix for someone more familiar with the code. I have attached a more complete error log in case it's helpful.

The text was updated successfully, but these errors were encountered:

AngusG · 2017-03-26T19:13:08Z

Duplicate of #103

AngusG closed this as completed Mar 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training fails after step 900 #112

Training fails after step 900 #112

AngusG commented Mar 26, 2017 •

edited

Loading

AngusG commented Mar 26, 2017

Training fails after step 900 #112

Training fails after step 900 #112

Comments

AngusG commented Mar 26, 2017 • edited Loading

AngusG commented Mar 26, 2017

AngusG commented Mar 26, 2017 •

edited

Loading