Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seemingly non-deterministic "No such file or dir" for async log file #586

Closed
emanuel-metzenthin opened this issue May 28, 2021 · 14 comments
Closed

Comments

@emanuel-metzenthin
Copy link

Hi!

I have some issues with running and logging experiments.

Sometimes (I couldn't figure out any reason for it) I get the following error when executing experiments. If it appears I don't get charts on the neptune dashboard as the async thread gets killed.

Unexpected error occurred. Killing Neptune asynchronous thread. All data is safe on disk.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/neptune/new/internal/threading/daemon.py", line 46, in run
    self.work()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 129, in work
    self.process_batch(batch, version)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/neptune/new/internal/operation_processors/async_operation_processor.py", line 136, in process_batch
    self._processor._queue.ack(version)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/neptune/new/internal/containers/disk_queue.py", line 152, in ack
    os.remove(self._get_log_file(log_versions[i]))
FileNotFoundError: [Errno 2] No such file or directory: '.neptune/async/0dbcf6c3-7248-455c-8056-845d97cf4e73/exec-0-2021-05-28_01.30.23.421819/data-1.log'

The file does exist though.

Running:
Python 3.8.5
neptune-client 0.9.7 (using neptune.new)

Thanks for any suggestions!

@aniezurawski
Copy link
Contributor

Hi @emanuel-metzenthin

What filesystem are you using?

@emanuel-metzenthin
Copy link
Author

Hi @aniezurawski ,

I'm on Linux with a NFS4 file system.

@aniezurawski
Copy link
Contributor

First of all, don't worry about your data. When you see this message (Unexpected error occurred. Killing Neptune asynchronous thread. All data is safe on disk.) caused by any error no data is lost. Everything is still logged to your local disk. You can later sync these data with Neptune servers using neptune sync command. See https://docs.neptune.ai/api-reference/command-line-interface#neptune-sync

Now back to the issue. Does it occur often? Is it easily reproducible or is it random?

@emanuel-metzenthin
Copy link
Author

I'd say it occurs about every second or third time I start an experiment - randomly.

@aniezurawski
Copy link
Contributor

aniezurawski commented Jul 7, 2021

I failed to reproduce the issue on NFS4. Does it work for you on other non-network filesystems like ext4? Could you share your code? Does it still occur on newest version of neptune-client (0.10.0)?

EDIT: Could you share content of .neptune/async/0dbcf6c3-7248-455c-8056-845d97cf4e73/exec-0-2021-05-28_01.30.23.421819/ directory?

@emanuel-metzenthin
Copy link
Author

I think I know now what's happening. The underlying framework that I'm using in my project changes the working directory away from the project directory. This likely is a race condition, where the CWD is changed between creation and access of the neptune log file.

I've found the run directory declared as a constant in

NEPTUNE_RUNS_DIRECTORY = '.neptune'
Is there a way to supply an absolute path for it? If not, this would be a great extension.

@aniezurawski
Copy link
Contributor

Unfortunately, it is not possible for now. Thanks for your suggestion.
@Herudaio what do you think? Looks like good idea to support such environment variable.

@aniezurawski
Copy link
Contributor

@emanuel-metzenthin could say more? What framework was it?

@emanuel-metzenthin
Copy link
Author

It is rllib by ray. They also have a log directory and change the working directory to it unfortunately. There also seems to be no way to switch off that behavior.

@aniezurawski
Copy link
Contributor

Hi @emanuel-metzenthin

Try version 0.10.2 of neptune-client.
There is still no possibility to configure custom log directory but we started to use absolute paths internally so the issue should not occur anymore.

@emanuel-metzenthin
Copy link
Author

Thank you very much! This seems to work.

@KaleabTessera
Copy link

Had the same issue when using a distributed system. Tried 0.10.2 and 0.10.7, but they didn't work, 0.10.0 did. @aniezurawski Any plans for this to be fixed in new versions

@aniezurawski
Copy link
Contributor

Hi @KaleabTessera

We still do not know what exectly is the base cause if this issue. But we have implemented some workaround and it's already merged to master. It will be released soon.
Thanks for all the information. I'm going to give a closer look to what changed between version 0.10.0 and 0.10.2.

@aniezurawski
Copy link
Contributor

Hi @KaleabTessera

Please try version 0.10.8. Since it's only workaround for the base issue let me know if any further problems occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants