You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (most recent call last):
File "~/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 181, in run
new_deps = self._run_get_new_deps()
File "~/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 119, in _run_get_new_deps
task_gen = self.task.run()
File "segway_workflow.py", line 141, in run
segway_run.main(segway_cmd)
File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3787, in main
return runner()
File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3552, in __call__
self.run(*args, **kwargs)
File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3519, in run
self.run_train()
File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 2947, in run_train
instance_params = run_train_func(self.num_segs_range)
File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3030, in run_train_multithread
to find the winning instance anyway.""" % thread.instance_index)
AttributeError: Training instance 0 failed. See previously printed error for reason.
Final params file will not be written. Rerun the instance or use segway-winner
to find the winning instance anyway.
EM training error
The source of the error seem to come from the job: emt0.19.1233.train.637ed75e7b0f11e8975fbd311cda90ee
The job's output file is empty instead of containing the expected: ____ PROGRAM ENDED SUCCESSFULLY WITH STATUS___
Content of the Job's error output:
Traceback (most recent call last):
File "~/anaconda2/bin/segway-task", line 6, in <module>
sys.exit(segway.task.main())
File "~/anaconda2/lib/python2.7/site-packages/segway/task.py", line 592, in main
return task(*args)
File "~/anaconda2/lib/python2.7/site-packages/segway/task.py", line 582, in task
outfilename, *args)
File "~/anaconda2/lib/python2.7/site-packages/segway/task.py", line 476, in run_train
supervision_data=supervision_cells)
File "~/anaconda2/lib/python2.7/site-packages/segway/observations.py", line 479, in _save_window
int_data.tofile(int_filename)
IOError: [Errno 116] Stale file handle
This jobs has two unexpected behavior:
Usually the jobs raising this error are re-submitted. This one is not
The job is not indexed in jobs.tab
The text was updated successfully, but these errors were encountered:
This is likely to have to do with configuration issues on the cluster and not anything to do with Segway's programming. Nor is there anything likely we can really do about this by changing Segway.
Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).
I believe there may be a race condition here in the way observation files are managed in minibatch mode. Currently, observation filenames are not instance-specific. It seems possible, for example, that two instances could simultaneously be attempting to delete or open the same observation file for reading/writing.
Section A.10 of the NFS FAQ also mentions ESTALE errors being reported when referring to items that may have been deleted. The tofile call is a convenience call for an open/write operation and I'd imagine it's possible for the file to have been deleted by another instance and stale by the time a write happens. I'd imagine this possibly could happen on a non-NFS filesystem but it seems far less likely (and far more likely on NFS filesystems).
Don't set TMPDIR to a networked file system. This will be documented as a requirement. Since we only run on Linux we could have segway-task spit out a warning if stat --file-system --format=%T "$TMPDIR" is nfs. But the user would only see that if they looked at the error files. Maybe it wouldn't be such a horrible idea for the console to print any errors that occur as it marks a job complete. This is all a bit complex though.
Eric will introduce a fix to this problem even if that happens, by ensuring filenames are unique.
Original report (BitBucket issue) by Mickaël Mendez (Bitbucket: Mickael Mendez).
While running Segway 2.0.2 in reverse mode I ran into a
Stale file handle error
. Below are the logs and the command of the job that failed.segway command
Segway output
EM training error
The source of the error seem to come from the job:
emt0.19.1233.train.637ed75e7b0f11e8975fbd311cda90ee
____ PROGRAM ENDED SUCCESSFULLY WITH STATUS___
This jobs has two unexpected behavior:
jobs.tab
The text was updated successfully, but these errors were encountered: