extract dvc-data from dvc #23

skshetry · 2022-05-19T14:24:37Z

This was extracted out from DVC from commit: 7bc98e3fb3.

Initial simple separation to get the process of extracting dvc-objects and dvc-data packages out of dvc. dvc-objects is for: * defining base Object (aka HashFile) and ObjectDB * providing tools to work with those objects (check/transfer/etc) * etc dvc-data is for: * defining higher level objects (e.g. Tree) * bulding objects from the workspace (aka staging) * projecting objects back into the workspace (aka checkout) * comparing Tree objects (aka diff) * defining misc object databases (e.g LocalObjectDB) * higher-level transfer (e.g. with Tree indexes) * etc For example, the objects we use in run-cache are based on dvc-objects and not on dvc-data, as they have absolutely nothing to do with data management. There are still some mixups in the code that will be addressed in the followups.

With our staging process automatically capturing file metadata, we no longer need to set isexec ourselves. In followup PRs we will also use meta during checkout, so that we won't need to use Output.set_exec by hand.

Introducing simple repo tree that combines all workspaces into one structure, and allows us to detach data representation from pipelines. Pre-requisite for dvc-data extraction and using relpaths in dvcfs/repofs.

the `jobs` kwarg was not being passed down properly, resulting in it being ignored in some cases (push/pull/gc)

This removes callback from internal APIs: `_list_paths`, `list_hashes`, and `_hashes_with_limit`. These are all iterators and hence can just be replaced by wrapping a progressbar over those iterators in the caller side. We still have progressbars in `hashes_exist`, `_estimate_remote_size`, `_list_hashes_traverse` and `list_hashes_exists`, which should gradually lifted up and replaced.

On our way to making dvcfs fsspec-compliant.

Since all of our filesystems are fsspec-compliant, there's no need for compat wrapper.

This PR gets rid of Tqdm progress bars inside `dvc.objects`. Those are replaced with multi-phase callback in `hashes_exist`, and simple callable that takes `completed` value in `odb._estimate_remote_size()`. The `odb.list_hashes_exists()` and `odb.all()` now return iterables, hence the caller has the responsibility of handling progress. This is now handled in `dvc.data` by using a custom `Tqdm` progressbar. This is ugly and is temporary, but my motivation is to remove tqdm from dvc.objects first. We can gradually get rid of these from `dvc.data` too, but that requires more knowledge about `dvc.data` and involves more design work. All in time, I guess. :)

This reverts commit 43cc10381a02dd92caffd9b1dd41e0deb5a88ce6.

1. This PR simplifies `fs.upload`/`fs.download`/`fs.download_file`. 2. Removes unused kwargs. 3. `download` uses `dvc.utils.threadpool.ThreadPoolExecutor` which supports canceling futures. 4. `download` for directories supports nested/branched callbacks and progress bars.

This PR gets rid of Tqdm progress bars inside `dvc.objects`. Those are replaced with multi-phase callback in `hashes_exist`, and simple callable that takes `completed` value in `odb._estimate_remote_size()`. The `odb.list_hashes_exists()` and `odb.all()` now return iterables, hence the caller has the responsibility of handling progress. This is now handled in `dvc.data` by using a custom `Tqdm` progressbar. This is ugly and is temporary, but my motivation is to remove tqdm from dvc.objects first. We can gradually get rid of these from `dvc.data` too, but that requires more knowledge about `dvc.data` and involves more design work. All in time, I guess. :)

There is no support for chunk_size kwarg in fs.open. Also, `upload_fobj` is now able to figure out proper chunk size from fileobj, so I don't think we need this on _upload_file.

The PR also makes odb.add operations atomic, but the checkout is still not atomic.

This reverts commit 1fb4a3a741558c1d7d54e42d54b3b91a38f3c836.

…allbacks

In the spirit of fsspec.generic, it is for cross-filesystem operations. Trying to avoid `fs.utils` as much as possible, but we have some functions that are really utilities coming, so I am making a place for them. :)

Also moves umask to dvc.fs.system

dvc.objects: extend diskcache to handle pickle errors

When staging a directory, always save a "raw dir object" to the odb. If the corresponding ".dir" object has not been added to the odb, `stage()` calls can load the tree from the raw dir object instead of rebuilding it by walking the directory. This can lead to significant speed improvements when calling `dvc status` for modified directories. Fixes #7390

This commit moves `HashedStreamReader`/`get_hash_file`/`file_md5` into `dvc.objects.hash`. The `get_hash_file` has been renamed to `hash_file`. This also copies `relpath` and `is_exec` to the `dvc.fs.utils`.

It turns out that diskcache.Disk saves passed arguments to the database and uses that all the time. So this means that we are not forward compatible as it breaks moving from newer version of DVC to an older one. We don't promise this, but this breaks our benchmarks which are probably not isolated enough. Regarding the use of cache, it is a bit iffy to extend it this way. We only need this type for error messages. We could consider removing this, but since the error message is documented and it's not too hard to extend this :), I decided to preserve the message.

Some of the APIs/FileSystems are reimported inside dvc.fs module.

…ng fobj_md5 Now HashedStreamReader will be used to calculate md5 in fobj_md5/hash_file. Also: * adds in `bytes_md5` that may be useful for testing. * adds a hasher-registry with hashlib * adds a `text=` kwarg to be able to specify to use dos2unix or not (useful for testing) * makes `localfs` the default filesystem when `fs` arg is not passed (makes it easier to test in the REPL)

This commit: * defines a `Ignore` protocol in `dvc.objects`, that expects a `find` method. This will be used in dvc-data and dvc-objects instead of depending on dvc.ignore. dvcignore is expected to respect this protocol, which it does. * renames `dvcignore` kwarg to `ignore` in `dvc-data` and `dvc-objects`. * teaches dvc.cli that `IgnoreInCollectedError` is a known exception and is logged correctly and adds a test for the behaviour. I am not happy that we need to hardcode `.dvcignore` in dvc-data but I don't see any easy way, since we always try to avoid staging if the tree contains .dvcignore file, even if .dvcignore is not passed.

transfer raises internal TransferError, merge raises internal MergeError and local ODB raises ODBError on unprotect. transfer's error is reraised to FileTransferError back in DataCloud.

This moves get_fs_config/get_fs_cloud and ODBManager to DVC as they require DVC related utilities, and we don't need them strictly in dvc-objects/dvc-data. This also adds support to ReferenceHashFile to load from any filesystem, using it's module name and filename. Now we can get rid of DVCFileSystem dependency in ReferenceHashFile. This is a temporary solution, we really need a registry. (I am not quite sure if this works in pyinstaller mode, but I think it should, it's how fsspec does it if I understand correctly.)

Now dvc.objects and dvc.data uses the implementation from dvc.objects for Tqdm progress bars. This is temporary. :)

Also introduces a `validate_status` kwarg in `data.transfer`, which is used in dvc to log for missing version info.

skshetry marked this pull request as draft May 19, 2022 14:24

efiop and others added 29 commits May 20, 2022 14:12

odb: move gc into a separate function

6bec294

utils: add decorator to wrap diskcache pickle errors

a75e4f5

output: move isexec to meta

f9fa305

With our staging process automatically capturing file metadata, we no longer need to set isexec ourselves. In followup PRs we will also use meta during checkout, so that we won't need to use Output.set_exec by hand.

dvcfs: detach from pipeline outputs

e97ad14

Introducing simple repo tree that combines all workspaces into one structure, and allows us to detach data representation from pipelines. Pre-requisite for dvc-data extraction and using relpaths in dvcfs/repofs.

fix passing of jobs kwarg

b5909c1

the `jobs` kwarg was not being passed down properly, resulting in it being ignored in some cases (push/pull/gc)

status: cleanup invocation of hashes_exist

c3f88f9

dvcfs: implement ls

2b51695

On our way to making dvcfs fsspec-compliant.

dvcfs: base on fsspec

e726c3e

repofs: use repo paths

01fdc09

repofs: migrate to fsspec

2f1b12a

fs: unify fs.base.FileSystem with FSSpecWrapper

753cb9c

Since all of our filesystems are fsspec-compliant, there's no need for compat wrapper.

Revert "dvc.objects: add callbacks for listing and estimating hashes"

62f4023

This reverts commit 43cc10381a02dd92caffd9b1dd41e0deb5a88ce6.

fs: rename getsize to size

f77e400

build: try fixing building package

fb01fd7

fs: add cat/cat_file/cat_ranges/pipe/pipe_file/touch methods

0dc4317

fs: remove unused CHUNK_SIZE var

7b409b9

There is no support for chunk_size kwarg in fs.open. Also, `upload_fobj` is now able to figure out proper chunk size from fileobj, so I don't think we need this on _upload_file.

fs.utils: add upload that can be conditionally atomic

ce32eb8

The PR also makes odb.add operations atomic, but the checkout is still not atomic.

Revert "fs.utils: add upload that can be conditionally atomic"

fbcdcc2

This reverts commit 1fb4a3a741558c1d7d54e42d54b3b91a38f3c836.

fs: remove upload function

b4314e3

fs.utils.transfer: remove tqdm progress bars

2ddca96

fs.callbacks: simplify, ensure None does not break them, lazy richc…

8c866f0

…allbacks

dvc.objects: move hash_info, state and meta into dvc.objects

bb01428

fs: rename RepoFileSystem to DvcFileSystem

14fcb47

dvc.fs: rename fs.utils to fs.generic

619364d

In the spirit of fsspec.generic, it is for cross-filesystem operations. Trying to avoid `fs.utils` as much as possible, but we have some functions that are really utilities coming, so I am making a place for them. :)

skshetry and others added 21 commits May 20, 2022 14:12

add typings to istextfile and system

981161d

Also moves umask to dvc.fs.system

dvc.objects: extend diskcache to handle pickle errors (#7734)

c63821b

dvc.objects: extend diskcache to handle pickle errors

dvc.objects: move hashing functions to dvc.objects

dd26b73

This commit moves `HashedStreamReader`/`get_hash_file`/`file_md5` into `dvc.objects.hash`. The `get_hash_file` has been renamed to `hash_file`. This also copies `relpath` and `is_exec` to the `dvc.fs.utils`.

fs.utils: copy dvc.utils.fs to dvc.fs.utils

5ba0be8

dvc.fs: move fs to dvc.objects.fs

488c9b4

Some of the APIs/FileSystems are reimported inside dvc.fs module.

fs: move filesystems to dvc.objects.fs.implementations

4f7cbf9

objects.hash: rename HashedStreamReader to HashStreamFile

6e9bcd1

objects.fs: copy executors to dvc.objects

3afb6e5

objects.fs: move schemes to objects.fs

166773b

dvc.objects: fix typing imports

d70f82b

dvc.objects: use own internal error instead of DvcException

343e02f

transfer raises internal TransferError, merge raises internal MergeError and local ODB raises ODBError on unprotect. transfer's error is reraised to FileTransferError back in DataCloud.

data.checkout: remove dependency on DVC; prompts and DvcException

d38cfd6

objects: use separate Tqdm implementations

bab3397

Now dvc.objects and dvc.data uses the implementation from dvc.objects for Tqdm progress bars. This is temporary. :)

data/objects: reduce log levels in dvc-objects and dvc-data

f824de3

Also introduces a `validate_status` kwarg in `data.transfer`, which is used in dvc to log for missing version info.

dvc: remove dvc.objects, fix imports

80d4a7a

fix formatting

d06d1f6

skshetry force-pushed the dvc-extract branch from ea15547 to d06d1f6 Compare May 20, 2022 08:29

skshetry marked this pull request as ready for review May 20, 2022 08:41

skshetry merged commit 89bf06b into main May 20, 2022

skshetry deleted the dvc-extract branch May 20, 2022 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract dvc-data from dvc #23

extract dvc-data from dvc #23

skshetry commented May 19, 2022 •

edited

extract dvc-data from dvc #23

extract dvc-data from dvc #23

Conversation

skshetry commented May 19, 2022 • edited

skshetry commented May 19, 2022 •

edited