Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: still creates .dvc file even when operation fails #9785

Closed
dberenbaum opened this issue Jul 31, 2023 · 25 comments · Fixed by iterative/dvc-data#412, #9791 or #9800
Closed

import: still creates .dvc file even when operation fails #9785

dberenbaum opened this issue Jul 31, 2023 · 25 comments · Fixed by iterative/dvc-data#412, #9791 or #9800
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. regression Ohh, we broke something :-(

Comments

@dberenbaum
Copy link
Collaborator

Bug Report

Description

Even when dvc import fails, it still creates the .dvc file.

Reproduce

Without AWS sandbox credentials set, try dvc import -v git@github.com:dberenbaum/dataset.git cats-dogs. A cats-dogs.dvc file is generated.

$ dvc import -v git@github.com:dberenbaum/dataset.git cats-dogs
2023-07-31 09:11:29,074 DEBUG: v3.10.2.dev2+g5c2fd67e6, CPython 3.11.4 on macOS-13.4.1-arm64-arm-64bit
2023-07-31 09:11:29,074 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc import -v git@github.com:dberenbaum/dataset.git cats-dogs
2023-07-31 09:11:29,287 DEBUG: Removing output 'cats-dogs' of stage: 'cats-dogs.dvc'.
2023-07-31 09:11:29,287 DEBUG: Removing '/Users/dave/repo/cats-dogs'
Importing 'cats-dogs (git@github.com:dberenbaum/dataset.git)' -> 'cats-dogs'
2023-07-31 09:11:29,288 DEBUG: Computed stage: 'cats-dogs.dvc' md5: '9ce0eabbbaa6ff8015aef3312fd8d213'
2023-07-31 09:11:29,288 DEBUG: 'md5' of stage: 'cats-dogs.dvc' changed.
2023-07-31 09:11:29,288 DEBUG: Creating external repo git@github.com:dberenbaum/dataset.git@None
2023-07-31 09:11:29,289 DEBUG: erepo: git clone 'git@github.com:dberenbaum/dataset.git' to a temporary dir
2023-07-31 09:11:33,155 ERROR: failed to load ('cats-dogs',) - Unable to locate credentials
Traceback (most recent call last):
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 541, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 477, in _load_from_object_storage
    obj = Tree.load(
          ^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 191, in load
    with obj.fs.open(obj.path, "r") as fobj:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 222, in open
    return self.fs.open(path, mode=mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1229, in open
    self.open(
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1241, in open
    f = self._open(
        ^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 659, in _open
    return S3File(
           ^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 2066, in __init__
    super().__init__(
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1597, in __init__
    self.size = self.details["size"]
                ^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/spec.py", line 1610, in details
    self._details = self.fs.info(self.path)
                    ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/asyn.py", line 121, in wrapper
    return sync(self.loop, func, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/asyn.py", line 106, in sync
    raise return_result
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/fsspec/asyn.py", line 61, in _runner
    result[0] = await coro
                ^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 1271, in _info
    out = await self._call_s3(
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 342, in _call_s3
    s3 = await self.get_s3(kwargs.get("Bucket"))
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 336, in get_s3
    return await self._s3creator.get_bucket_client(bucket)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/utils.py", line 39, in get_bucket_client
    response = await general_client.head_bucket(Bucket=bucket_name)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/client.py", line 354, in _make_api_call
    http, parsed_response = await self._make_request(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/client.py", line 379, in _make_request
    return await self._endpoint.make_request(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/endpoint.py", line 96, in _send_request
    request = await self.create_request(request_dict, operation_model)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/endpoint.py", line 84, in create_request
    await self._event_emitter.emit(
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/hooks.py", line 66, in _emit
    response = await resolve_awaitable(handler(**kwargs))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/_helpers.py", line 15, in resolve_awaitable
    return await obj
           ^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/signers.py", line 24, in handler
    return await self.sign(operation_name, request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/signers.py", line 82, in sign
    auth.add_auth(request)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/botocore/auth.py", line 418, in add_auth
    raise NoCredentialsError()
botocore.exceptions.NoCredentialsError: Unable to locate credentials

2023-07-31 09:11:33,163 WARNING: 'cats-dogs' is empty.
2023-07-31 09:11:33,163 DEBUG: Added '/Users/dave/repo/cats-dogs' to gitignore file.
2023-07-31 09:11:33,165 DEBUG: built tree 'object d751713988987e9331980363e24189ce.dir'
2023-07-31 09:11:33,165 DEBUG: Computed stage: 'cats-dogs.dvc' md5: 'd1c1ff68c486e7407aa917addbde4316'
2023-07-31 09:11:33,166 DEBUG: built tree 'object d751713988987e9331980363e24189ce.dir'
2023-07-31 09:11:33,167 DEBUG: Preparing to transfer data from 'memory://dvc-staging-md5/e491ff18e115f87446131d48702971fec1cf49467aabcca1cbce150e09fb0192' to '/Users/dave/repo/.dvc/cache/files/md5'
2023-07-31 09:11:33,167 DEBUG: Preparing to collect status from '/Users/dave/repo/.dvc/cache/files/md5'
2023-07-31 09:11:33,167 DEBUG: Collecting status from '/Users/dave/repo/.dvc/cache/files/md5'
2023-07-31 09:11:33,168 DEBUG: Preparing to collect status from 'memory://dvc-staging-md5/e491ff18e115f87446131d48702971fec1cf49467aabcca1cbce150e09fb0192'
2023-07-31 09:11:33,168 DEBUG: transfer dir: md5: d751713988987e9331980363e24189ce.dir with 0 files
2023-07-31 09:11:33,169 DEBUG: built tree 'object d751713988987e9331980363e24189ce.dir'
2023-07-31 09:11:33,170 DEBUG: Removing '/Users/dave/repo/.2XQmKjhqscixAiPLGzL6ve.tmp'
2023-07-31 09:11:33,170 DEBUG: Removing '/Users/dave/repo/.2XQmKjhqscixAiPLGzL6ve.tmp'
2023-07-31 09:11:33,170 DEBUG: Removing '/Users/dave/repo/.dvc/cache/files/md5/.LHZ5pSSNtj4Nty4zXX7cSe.tmp'
2023-07-31 09:11:33,172 DEBUG: Saving information to 'cats-dogs.dvc'.
2023-07-31 09:11:33,173 DEBUG: Staging files: {'.gitignore', 'cats-dogs.dvc'}
2023-07-31 09:11:33,176 DEBUG: Analytics is disabled.

Expected

No cats-dogs.dvc file to be created.

@dberenbaum dberenbaum added bug Did we break something? p1-important Important, aka current backlog of things to do A: data-sync Related to dvc get/fetch/import/pull/push labels Jul 31, 2023
@dberenbaum dberenbaum added this to DVC Jul 31, 2023
@dberenbaum dberenbaum moved this from Backlog to Todo in DVC Jul 31, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Jul 31, 2023
@dberenbaum
Copy link
Collaborator Author

@farhanhubble
Copy link

I think we do want to create a .dvc file still but do not want it to be updated with incorrect metadata if download fails.

Creating the .dvc file covers the case when --no-download is specified and, later, dvc update is performed.

Right now the .dvc file gets updated with:

  size: 0
  nfiles: 0

on failed downloads.

@dberenbaum
Copy link
Collaborator Author

Related to #9482

@dberenbaum dberenbaum added the regression Ohh, we broke something :-( label Aug 1, 2023
@dberenbaum
Copy link
Collaborator Author

Looks like this was introduced in 3aa9f51. Before that, it will fail like:

$ dvc import -v git@github.com:dberenbaum/dataset2.git cats-dogs
2023-08-01 15:02:24,109 DEBUG: v2.46.1.dev19+g1cfdc2254, CPython 3.11.4 on macOS-13.4.1-arm64-arm-64bit
2023-08-01 15:02:24,109 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc import -v git@github.com:dberenbaum/dataset2.git cats-dogs
2023-08-01 15:02:24,394 DEBUG: Removing output 'cats-dogs' of stage: 'cats-dogs.dvc'.
2023-08-01 15:02:24,394 DEBUG: Removing '/Users/dave/repo/cats-dogs'
Importing 'cats-dogs (git@github.com:dberenbaum/dataset2.git)' -> 'cats-dogs'
2023-08-01 15:02:24,395 DEBUG: Computed stage: 'cats-dogs.dvc' md5: '61f6a7670db2364cdfeb72b5d7b72fcc'
2023-08-01 15:02:24,395 DEBUG: 'md5' of stage: 'cats-dogs.dvc' changed.
2023-08-01 15:02:24,395 DEBUG: Creating external repo git@github.com:dberenbaum/dataset2.git@None
2023-08-01 15:02:24,395 DEBUG: erepo: git clone 'git@github.com:dberenbaum/dataset2.git' to a temporary dir
2023-08-01 15:02:26,022 DEBUG: Checking if stage '/cats-dogs' is in 'dvc.yaml'
2023-08-01 15:02:26,172 DEBUG: Preparing to transfer data from 'dave-sandbox/cache/dataset' to '/Users/dave/repo/.dvc/cache'
2023-08-01 15:02:26,172 DEBUG: Preparing to collect status from '/Users/dave/repo/.dvc/cache'
2023-08-01 15:02:26,172 DEBUG: Collecting status from '/Users/dave/repo/.dvc/cache'
2023-08-01 15:02:26,173 DEBUG: Preparing to collect status from 'dave-sandbox/cache/dataset'
2023-08-01 15:02:26,173 DEBUG: Collecting status from 'dave-sandbox/cache/dataset'
2023-08-01 15:02:26,173 DEBUG: Querying 1 oids via object_exists
2023-08-01 15:02:26,688 ERROR: unexpected error - Forbidden: An error occurred (403) when calling the HeadObject operation: Forbidden
Traceback (most recent call last):
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 113, in _error_wrapper
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/aiobotocore/client.py", line 371, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/dave/Code/dvc/dvc/cli/__init__.py", line 210, in main
    ret = cmd.do_run()
          ^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/cli/command.py", line 26, in do_run
    return self.run()
           ^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/commands/imp.py", line 17, in run
    self.repo.imp(
  File "/Users/dave/Code/dvc/dvc/repo/imp.py", line 6, in imp
    return self.imp_url(path, out=out, fname=fname, erepo=erepo, frozen=True, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 67, in wrapper
    return f(repo, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/scm_context.py", line 151, in run
    return method(repo, *args, **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/imp_url.py", line 91, in imp_url
    stage.run(jobs=jobs, no_download=no_download)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
           ^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 577, in run
    self._sync_import(dry, force, kwargs.get("jobs", None), no_download)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 47, in wrapper
    return deco(call, *dargs, **dkwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/decorators.py", line 43, in rwlocked
    return call()
           ^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/funcy/decorators.py", line 68, in __call__
    return self._func(*self._args, **self._kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 611, in _sync_import
    sync_import(self, dry, force, jobs, no_download)
  File "/Users/dave/Code/dvc/dvc/stage/imports.py", line 61, in sync_import
    stage.deps[0].download(
  File "/Users/dave/Code/dvc/dvc/dependency/repo.py", line 75, in download
    for odb, objs in self.get_used_objs().items():
                     ^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/dependency/repo.py", line 102, in get_used_objs
    used, _, _ = self._get_used_and_obj(**kwargs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/dependency/repo.py", line 124, in _get_used_and_obj
    for odb, obj_ids in repo.used_objs(
                        ^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/__init__.py", line 524, in used_objs
    for odb, objs in self.index.used_objs(
                     ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/repo/index.py", line 449, in used_objs
    for odb, objs in stage.get_used_objs(
                     ^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/stage/__init__.py", line 722, in get_used_objs
    for odb, objs in out.get_used_objs(*args, **kwargs).items():
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/output.py", line 1105, in get_used_objs
    obj = self._collect_used_dir_cache(**kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/output.py", line 1038, in _collect_used_dir_cache
    self.get_dir_cache(jobs=jobs, remote=remote)
  File "/Users/dave/Code/dvc/dvc/output.py", line 1020, in get_dir_cache
    self.repo.cloud.pull([obj.hash_info], **kwargs)
  File "/Users/dave/Code/dvc/dvc/data_cloud.py", line 181, in pull
    return self.transfer(
           ^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc/dvc/data_cloud.py", line 135, in transfer
    return transfer(src_odb, dest_odb, objs, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/transfer.py", line 203, in transfer
    status = compare_status(
             ^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/status.py", line 189, in compare_status
    src_exists, src_missing = status(
                              ^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/status.py", line 134, in status
    exists = hashes.intersection(
             ^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/status.py", line 55, in _indexed_dir_hashes
    dir_exists.update(
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/db.py", line 359, in list_oids_exists
    in_remote = self.fs.exists(paths, batch_size=jobs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 365, in exists
    return fut.result()
           ^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/executors.py", line 134, in batch_coros
    result = fut.result()
             ^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 1004, in _exists
    await self._info(path, bucket, key, version_id=version_id)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 1271, in _info
    out = await self._call_s3(
          ^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 348, in _call_s3
    return await _error_wrapper(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/s3fs/core.py", line 140, in _error_wrapper
    raise err
PermissionError: Forbidden

2023-08-01 15:02:26,715 DEBUG: Version info for developers:
DVC version: 2.46.1.dev19+g1cfdc2254
------------------------------------
Platform: Python 3.11.4 on macOS-13.4.1-arm64-arm-64bit
Subprojects:
        dvc_data = 0.42.3
        dvc_objects = 0.24.1
        dvc_render = 0.1.2
        dvc_task = 0.3.0
        scmrepo = 0.2.1
Supports:
        azure (adlfs = 2023.4.0, knack = 0.10.1, azure-identity = 1.13.0),
        gdrive (pydrive2 = 1.16.1),
        gs (gcsfs = 2023.6.0),
        hdfs (fsspec = 2023.6.0, pyarrow = 12.0.1),
        http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
        oss (ossfs = 2021.8.0),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.161),
        ssh (sshfs = 2023.4.1),
        webdav (webdav4 = 0.9.8),
        webdavs (webdav4 = 0.9.8),
        webhdfs (fsspec = 2023.6.0)
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: local
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2023-08-01 15:02:26,716 DEBUG: Analytics is disabled.

@farhanhubble We can create the .dvc file when using --no-download but still not create it when there is a failure (this was the previous behavior).

@dberenbaum
Copy link
Collaborator Author

@efiop So we need to suppress the errors for studio but seems like we are suppressing them for dvc cli in cases where we shouldn't?

@farhanhubble
Copy link

@dberenbaum so if I do a two step import:

dvc import --no-download # -> Creates a DVC with no outs
dvc update    # -> Deletes the DVC on failed download ?

@dberenbaum
Copy link
Collaborator Author

No, sorry for the confusion @farhanhubble. Currently, it's working like this:

dvc import --no-download # -> Creates a .dvc with empty outs
dvc update    # -> Corrupts the .dvc outs on failed download

The previous expected behavior was:

dvc import --no-download # -> Creates a .dvc with empty outs
dvc update    # -> Leaves .dvc untouched on failed download

@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Aug 2, 2023

Edit: You can skip this entire comment and go to the next one.

It's coming from here:

dvc/dvc/fs/__init__.py

Lines 60 to 64 in b235487

from_infos = [
path
for path in fs.find(fs_path)
if not path.endswith(fs.path.flavour.sep)
]

That throws an error but it gets suppressed (I can't tell how or why). We should be raising that error.

Edit: I found that it's coming from here:

with suppress(FileNotFoundError, NotADirectoryError):

Edit 2: Even if we remove that suppression, we still catch the exception here and fail to raise it 🤦 :

dvc/dvc/fs/dvc.py

Lines 367 to 369 in b235487

except (FileNotFoundError, NotADirectoryError):
if not dvc_info:
raise

@dberenbaum
Copy link
Collaborator Author

Looks like it's ultimately coming from iterative/dvc-data#409. Was this for studio @efiop? Can we handle these exceptions more carefully and raise them?

@efiop
Copy link
Contributor

efiop commented Aug 2, 2023

@dberenbaum Yeah, that's what we've discussed today partially. I'll introduce a mechanism for that shortly.

@shcheklein
Copy link
Member

Let's add a test for this please.

@shcheklein shcheklein reopened this Aug 2, 2023
@github-project-automation github-project-automation bot moved this from Done to Todo in DVC Aug 2, 2023
@efiop
Copy link
Contributor

efiop commented Aug 2, 2023

I've reverted for now so we don't have to wait, but will introduce a proper onerror mechanism to that later.

@shcheklein
Copy link
Member

The way I see it. Revert is also a fix of this issue (it's a detail how specifically we made a change), but we want to have a safeguard to avoid this in the future, especially when we implement the a proper onerror mechanism to that.

@efiop
Copy link
Contributor

efiop commented Aug 2, 2023

@shcheklein I wasn't replying to you, we just posted at the same time, sorry for the confusion.

The test should likely be in datafs. I'll reopen this when handling iterative/dvc-data#413 if need be.

@efiop efiop closed this as completed Aug 2, 2023
@dberenbaum
Copy link
Collaborator Author

dberenbaum commented Aug 3, 2023

Update: it's a bigger problem that's not really about error handling at all. We are corrupting the cache, which is probably at the root of a lot of the recent issues.

$ dvc import git@github.com:dberenbaum/dataset2.git cats-dogs
Importing 'cats-dogs (git@github.com:dberenbaum/dataset2.git)' -> 'cats-dogs'
WARNING: 'cats-dogs' is empty.

$ cat cats-dogs.dvc
md5: 80b00298143632f1990d8979ea98a437
frozen: true
deps:
- path: cats-dogs
  repo:
    url: git@github.com:dberenbaum/dataset2.git
    rev_lock: 00db5587e8b4ee2f78421523c7db7434727e4774
outs:
- md5: d751713988987e9331980363e24189ce.dir
  size: 0
  nfiles: 0
  hash: md5
  path: cats-dogs

$ cat .dvc/cache/files/md5/d7/51713988987e9331980363e24189ce.dir
[]%

Edit: so we aren't getting any s3 errors here because dvc thinks it's an empty directory.

@dberenbaum
Copy link
Collaborator Author

not really about error handling at all

I guess it's still related since the workflow is:

  1. Collect info from .dir file
  2. .dir file not found but no error raised because we catch FileNotFoundError
  3. Assume the dir is empty since no paths are found
  4. Cache gets corrupted with empty .dir file

@efiop
Copy link
Contributor

efiop commented Aug 3, 2023

@dberenbaum Great catch! That indeed answers it. The error should mitigate that. I'm looking into solving it properly.

@efiop efiop added p0-critical Critical issue. Needs to be fixed ASAP. and removed p1-important Important, aka current backlog of things to do labels Aug 3, 2023
@efiop efiop self-assigned this Aug 3, 2023
@daavoo
Copy link
Contributor

daavoo commented Aug 3, 2023

not really about error handling at all

I guess it's still related since the workflow is:

  1. Collect info from .dir file
  2. .dir file not found but no error raised because we catch FileNotFoundError
  3. Assume the dir is empty since no paths are found
  4. Cache gets corrupted with empty .dir file

This is actually the same underlying issue for #9651 and #9786

@daavoo
Copy link
Contributor

daavoo commented Aug 3, 2023

This is actually the same underlying issue for #9651 and #9786

We should probably close and have a single issue

@dberenbaum
Copy link
Collaborator Author

@daavoo Let's keep open for now to track the different use cases for testing and to double check that each scenario works as expected with the fix.

One question I have so far is how we can handle caches that have already been corrupted in this way?

efiop added a commit to efiop/dvc-data that referenced this issue Aug 3, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes iterative#413
Related iterative/dvc#9785
efiop added a commit to efiop/dvc-data that referenced this issue Aug 3, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes iterative#413
Related iterative/dvc#9785
efiop added a commit to efiop/dvc-data that referenced this issue Aug 3, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes iterative#413
Related iterative/dvc#9785
efiop added a commit to efiop/dvc-data that referenced this issue Aug 3, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes iterative#413
Related iterative/dvc#9785
efiop added a commit to efiop/dvc-data that referenced this issue Aug 3, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes iterative#413
Related iterative/dvc#9785
efiop added a commit to efiop/dvc-data that referenced this issue Aug 3, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes iterative#413
Related iterative/dvc#9785
efiop added a commit to efiop/dvc-data that referenced this issue Aug 3, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes iterative#413
Related iterative/dvc#9785
efiop added a commit to efiop/dvc-data that referenced this issue Aug 3, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes iterative#413
Related iterative/dvc#9785
efiop added a commit to efiop/dvc-data that referenced this issue Aug 3, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes iterative#413
Related iterative/dvc#9785
efiop added a commit to iterative/dvc-data that referenced this issue Aug 4, 2023
Also fixes legacy TreeError/DataIndexError mess and doesn't mark directories
that we've failed to load as loaded in cache, preventing cache corruption.

Kudos @dberenbaum for investigation.

Note that behaviour in iterative/dvc#9785 is not
fixed by this, because it is more related to how we treat this error in datafs,
dvcfs and also how fsspec's `fs.walk()` ignores broken directories. So we will
look into maybe raising `DataIndexDirError` instead of treating it as a broken
directory and ignoring in `datafs/dvcfs.ls`.

Fixes #413
Related iterative/dvc#9785
@pmrowla
Copy link
Contributor

pmrowla commented Aug 4, 2023

Update: it's a bigger problem that's not really about error handling at all. We are corrupting the cache, which is probably at the root of a lot of the recent issues.

This isn't really corrupting cache. d751713988987e9331980363e24189ce.dir is the correct hash for an empty directory. The issue is that right now the underlying dvcfs.find() for the import source returns an empty directory instead of erroring out. So DVC is generating the correct cache entry for what it thinks is supposed to be an empty directory.

(DVC is not generating a .dir file that is supposed to contain files and then skipping files the in that .dir)

Edit: so we aren't getting any s3 errors here because dvc thinks it's an empty directory.

This is backwards. The issue is that we are getting s3 errors (in dvc-data), but those errors were being suppressed and dvc-data was just returning that the directory was empty (instead of re-raising the errors to DVC).


The error handling has already been fixed in dvc-data by @efiop, and once the corresponding changes are made in DVC all of the import-related symptoms for this underlying issue should be resolved without the user needing to do anything. After the error handling is fixed, fs.find() will return actual files (assuming the user has fixed their credentials), so dvc status will show that the import has changed (and running dvc update would generate a new dvc file that contains the actual hash for the imported dir).

@daavoo
Copy link
Contributor

daavoo commented Aug 4, 2023

This isn't really corrupting cache. d751713988987e9331980363e24189ce.dir is the correct hash for an empty directory.

We were corrupting the "index cache" in site_cache_dir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p0-critical Critical issue. Needs to be fixed ASAP. regression Ohh, we broke something :-(
Projects
No open projects
Archived in project
6 participants