-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Bug Report
Issue name
dvc exp run runs but does not store metrics.
Description
I'm running my training script on /dev/mapper/system-home and it outputs data (model checkpoints, metrics) in /data/.cache located on another partition (/dev/sdb1). /dev/sdb1 is a purposely large partition where we are supposed to store large files. Running dvc exp run works fine, but after completion dvc exp show does not show any metrics (aswell as dvc metrics show).
When outputting metrics to a folder on the same partition as the training script (/dev/mapper/system-home), dvc exp show works perfectly and shows metrics.
When using verbose mode, I get the following errors:
2022-06-08 17:49:08,234 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 18] Invalid cross
-device link
------------------------------------------------------------
Traceback (most recent call last):
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
func(from_path, to_path)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/base.py", line 263, in reflink
return self.fs.reflink(from_info, to_info)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/local.py", line 156, in reflink
return System.reflink(path1, path2)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
System._reflink_linux(source, link_name)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 18] Invalid cross-device link
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
return _link(link, from_fs, from_path, to_fs, to_path)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>
The above exception was the direct cause of the following exception:
Traceback (most recent call lastInvalid cross
-device link):
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
_try_links([link], from_fs, from_file, to_fs, to_file)
File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
raise OSError(
OSError: [Errno 95] no more link types left to try out
The full traceback can be found here:
trace.Log
The Invalid cross-device link part seems to show that dvc cannot handle cross-devices operations.
Reproduce
- Create a default project on partition
/sda1/foo1training and evaluating a model, writing metrics to another device/sdb1/foo2
# train.py on /sda1/foo1
from dvclive import Live
live = Live( "/data/metrics") # /data mounted on /sdb1/foo2
for epoch in epochs:
metrics = ...
for metric_name, value in metrics.items():
live.log(metric_name, value)
live.next_step()ex of /data/metrics.json:
{
"step": 1,
"loss": 0.7107148170471191,
"directed_f1_weighed": 0.0,
"undirected_f1_weighed": 0.0,
"oriented_acc": 0.8346456692913385,
"officical_f1_macro": 0.0
}
ex of /data/metrics/scalar/loss.tsv:
timestamp step loss
1654703111346 0 0.8031530231237411
1654703334339 1 0.7107148170471191
Expected
dvc metrics show actually shows metrics columns.
Environment information
Python 3.8.13
Description: Ubuntu 20.04.3 LTS
Release: 20.04
dvclive 0.8.2
Output of dvc doctor:
$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.8.13 on Linux-5.4.0-91-generic-x86_64-with-glibc2.17
Supports:
hdfs (fsspec = 2022.5.0, pyarrow = 3.0.0),
webhdfs (fsspec = 2022.5.0),
http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/system-home
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/system-home
Repo: dvc, git
Additional Information (if any):
I think the error comes from a missing support of cross-device copying (check https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link). Do you have any ideas ? Thanks for this nice piece of software 👍

