Skip to content

dvc exp run: experiment metrics are not reported when metric files are on another device than training code #7863

@AlexandreRozier

Description

@AlexandreRozier

Bug Report

Issue name

dvc exp run runs but does not store metrics.

Description

I'm running my training script on /dev/mapper/system-home and it outputs data (model checkpoints, metrics) in /data/.cache located on another partition (/dev/sdb1). /dev/sdb1 is a purposely large partition where we are supposed to store large files. Running dvc exp run works fine, but after completion dvc exp show does not show any metrics (aswell as dvc metrics show).

When outputting metrics to a folder on the same partition as the training script (/dev/mapper/system-home), dvc exp show works perfectly and shows metrics.

When using verbose mode, I get the following errors:

2022-06-08 17:49:08,234 DEBUG: [Errno 95] no more link types left to try out: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>: [Errno 18] Invalid cross
-device link
------------------------------------------------------------
Traceback (most recent call last):
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 28, in _link
    func(from_path, to_path)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/base.py", line 263, in reflink
    return self.fs.reflink(from_info, to_info)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/local.py", line 156, in reflink
    return System.reflink(path1, path2)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 112, in reflink
    System._reflink_linux(source, link_name)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/system.py", line 96, in _reflink_linux
    fcntl.ioctl(d.fileno(), FICLONE, s.fileno())
OSError: [Errno 18] Invalid cross-device link

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 69, in _try_links
    return _link(link, from_fs, from_path, to_fs, to_path)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 32, in _link
    raise OSError(
OSError: [Errno 95] 'reflink' is not supported by <class 'dvc.fs.local.LocalFileSystem'>

The above exception was the direct cause of the following exception:

Traceback (most recent call lastInvalid cross
-device link):
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 124, in _test_link
    _try_links([link], from_fs, from_file, to_fs, to_file)
  File "/home/hx/miniconda3/envs/origami/lib/python3.8/site-packages/dvc/fs/utils.py", line 77, in _try_links
    raise OSError(
OSError: [Errno 95] no more link types left to try out

The full traceback can be found here:
trace.Log

The Invalid cross-device link part seems to show that dvc cannot handle cross-devices operations.

Reproduce

  1. Create a default project on partition /sda1/foo1 training and evaluating a model, writing metrics to another device /sdb1/foo2
# train.py on /sda1/foo1 
from dvclive import Live
live = Live( "/data/metrics") # /data mounted on /sdb1/foo2
for epoch in epochs:
    metrics = ...
    for metric_name, value in metrics.items():
          live.log(metric_name, value)
    live.next_step()

ex of /data/metrics.json:

{
    "step": 1,
    "loss": 0.7107148170471191,
    "directed_f1_weighed": 0.0,
    "undirected_f1_weighed": 0.0,
    "oriented_acc": 0.8346456692913385,
    "officical_f1_macro": 0.0
}

ex of /data/metrics/scalar/loss.tsv:

timestamp	step	loss
1654703111346	0	0.8031530231237411
1654703334339	1	0.7107148170471191
  1. dvc exp show doesn't show any metrics column
    image
    image

Expected

dvc metrics show actually shows metrics columns.

Environment information

Python 3.8.13

Description: Ubuntu 20.04.3 LTS
Release: 20.04
dvclive 0.8.2

Output of dvc doctor:

$ dvc doctor
DVC version: 2.10.2 (pip)
---------------------------------
Platform: Python 3.8.13 on Linux-5.4.0-91-generic-x86_64-with-glibc2.17
Supports:
        hdfs (fsspec = 2022.5.0, pyarrow = 3.0.0),
        webhdfs (fsspec = 2022.5.0),
        http (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.4.6),
        s3 (s3fs = 2022.5.0, boto3 = 1.21.21)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/mapper/system-home
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/mapper/system-home
Repo: dvc, git

Additional Information (if any):

I think the error comes from a missing support of cross-device copying (check https://stackoverflow.com/questions/42392600/oserror-errno-18-invalid-cross-device-link). Do you have any ideas ? Thanks for this nice piece of software 👍

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: experimentsRelated to dvc expbugDid we break something?p3-nice-to-haveIt should be done this or next sprint

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions