Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale file handle error #128

Open
EricR86 opened this issue Jun 29, 2018 · 3 comments
Open

Stale file handle error #128

EricR86 opened this issue Jun 29, 2018 · 3 comments
Labels
bug Something isn't working major

Comments

@EricR86
Copy link
Member

EricR86 commented Jun 29, 2018

Original report (BitBucket issue) by Mickaël Mendez (Bitbucket: Mickael Mendez).


While running Segway 2.0.2 in reverse mode I ran into a Stale file handle error. Below are the logs and the command of the job that failed.

segway command

segway \
    --num-labels=10 \
    --resolution=2 \
    --ruler-scale=2 \
    --num-instances=10 \
    --reverse-world=1 \
    --max-train-rounds=50 \
    --seg-table=seg_table.tab \
    --minibatch-fraction=0.01 \
    --tracks-from=tracks.csv \
    --mem-usage=2,4,8,16 \
    train ...

Segway output

Traceback (most recent call last):
  File "~/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 181, in run
    new_deps = self._run_get_new_deps()
  File "~/anaconda2/lib/python2.7/site-packages/luigi/worker.py", line 119, in _run_get_new_deps
    task_gen = self.task.run()
  File "segway_workflow.py", line 141, in run
    segway_run.main(segway_cmd)
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3787, in main
    return runner()
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3552, in __call__
    self.run(*args, **kwargs)
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3519, in run
    self.run_train()
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 2947, in run_train
    instance_params = run_train_func(self.num_segs_range)
  File "~/anaconda2/lib/python2.7/site-packages/segway/run.py", line 3030, in run_train_multithread
    to find the winning instance anyway.""" % thread.instance_index)
AttributeError: Training instance 0 failed. See previously printed error for reason.
Final params file will not be written. Rerun the instance or use segway-winner
to find the winning instance anyway.

EM training error

The source of the error seem to come from the job: emt0.19.1233.train.637ed75e7b0f11e8975fbd311cda90ee

  • The job's output file is empty instead of containing the expected: ____ PROGRAM ENDED SUCCESSFULLY WITH STATUS___
  • Content of the Job's error output:
Traceback (most recent call last):
  File "~/anaconda2/bin/segway-task", line 6, in <module>
    sys.exit(segway.task.main())
  File "~/anaconda2/lib/python2.7/site-packages/segway/task.py", line 592, in main
    return task(*args)
  File "~/anaconda2/lib/python2.7/site-packages/segway/task.py", line 582, in task
    outfilename, *args)
  File "~/anaconda2/lib/python2.7/site-packages/segway/task.py", line 476, in run_train
    supervision_data=supervision_cells)
  File "~/anaconda2/lib/python2.7/site-packages/segway/observations.py", line 479, in _save_window
    int_data.tofile(int_filename)
IOError: [Errno 116] Stale file handle

This jobs has two unexpected behavior:

  • Usually the jobs raising this error are re-submitted. This one is not
  • The job is not indexed in jobs.tab
@EricR86
Copy link
Member Author

EricR86 commented Jul 1, 2018

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


This is likely to have to do with configuration issues on the cluster and not anything to do with Segway's programming. Nor is there anything likely we can really do about this by changing Segway.

@EricR86
Copy link
Member Author

EricR86 commented Jul 3, 2018

Original comment by Eric Roberts (Bitbucket: ericr86, GitHub: ericr86).


I believe there may be a race condition here in the way observation files are managed in minibatch mode. Currently, observation filenames are not instance-specific. It seems possible, for example, that two instances could simultaneously be attempting to delete or open the same observation file for reading/writing.

Section A.10 of the NFS FAQ also mentions ESTALE errors being reported when referring to items that may have been deleted. The tofile call is a convenience call for an open/write operation and I'd imagine it's possible for the file to have been deleted by another instance and stale by the time a write happens. I'd imagine this possibly could happen on a non-NFS filesystem but it seems far less likely (and far more likely on NFS filesystems).

@EricR86
Copy link
Member Author

EricR86 commented Jul 7, 2018

Original comment by Michael Hoffman (Bitbucket: hoffman, GitHub: michaelmhoffman).


  1. Don't set TMPDIR to a networked file system. This will be documented as a requirement. Since we only run on Linux we could have segway-task spit out a warning if stat --file-system --format=%T "$TMPDIR" is nfs. But the user would only see that if they looked at the error files. Maybe it wouldn't be such a horrible idea for the console to print any errors that occur as it marks a job complete. This is all a bit complex though.
  2. Eric will introduce a fix to this problem even if that happens, by ensuring filenames are unique.

@EricR86 EricR86 added major bug Something isn't working labels Apr 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working major
Projects
None yet
Development

No branches or pull requests

1 participant