Skip to content
This repository has been archived by the owner on Dec 29, 2022. It is now read-only.

Training fails after step 900 #112

Closed
AngusG opened this issue Mar 26, 2017 · 1 comment
Closed

Training fails after step 900 #112

AngusG opened this issue Mar 26, 2017 · 1 comment

Comments

@AngusG
Copy link

AngusG commented Mar 26, 2017

train_stops_step_901.txt

I'm running the nmt_small.yml model on a cluster with 3 Titan Black GPUs with TF version 1.0.1, and consistently running into issues around step 900. The issue actually seems to have to do with an evaluation that occurs between steps 900-1000.

INFO:tensorflow:Starting evaluation at 2017-03-26-16:21:57
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:03:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: GeForce GTX TITAN Black, pci bus id: 0000:83:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:2) -> (device: 2, name: GeForce GTX TITAN Black, pci bus id: 0000:84:00.0)
W tensorflow/core/framework/op_kernel.cc:993] Out of range: Reached limit of 1
	 [[Node: parallel_read/filenames/limit_epochs/CountUpTo = CountUpTo[T=DT_INT64, _class=["loc:@parallel_read/filenames/limit_epochs/epochs"], limit=1, _device="/job:localhost/replica:0/task:0/cpu:0"](parallel_read/filenames/limit_epochs/epochs)]]
W tensorflow/core/framework/op_kernel.cc:993] Out of range: Reached limit of 1
	 [[Node: parallel_read_1/filenames/limit_epochs/CountUpTo = CountUpTo[T=DT_INT64, _class=["loc:@parallel_read_1/filenames/limit_epochs/epochs"], limit=1, _device="/job:localhost/replica:0/task:0/cpu:0"](parallel_read_1/filenames/limit_epochs/epochs)]]
W tensorflow/core/framework/op_kernel.cc:993] Invalid argument: Tried to read from index 46 but array size is: 46

I am debugging but wanted to report in case it's an easy fix for someone more familiar with the code. I have attached a more complete error log in case it's helpful.

@AngusG
Copy link
Author

AngusG commented Mar 26, 2017

Duplicate of #103

@AngusG AngusG closed this as completed Mar 26, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant