New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
extract dvc-data from dvc #23
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Initial simple separation to get the process of extracting dvc-objects and dvc-data packages out of dvc. dvc-objects is for: * defining base Object (aka HashFile) and ObjectDB * providing tools to work with those objects (check/transfer/etc) * etc dvc-data is for: * defining higher level objects (e.g. Tree) * bulding objects from the workspace (aka staging) * projecting objects back into the workspace (aka checkout) * comparing Tree objects (aka diff) * defining misc object databases (e.g LocalObjectDB) * higher-level transfer (e.g. with Tree indexes) * etc For example, the objects we use in run-cache are based on dvc-objects and not on dvc-data, as they have absolutely nothing to do with data management. There are still some mixups in the code that will be addressed in the followups.
With our staging process automatically capturing file metadata, we no longer need to set isexec ourselves. In followup PRs we will also use meta during checkout, so that we won't need to use Output.set_exec by hand.
Introducing simple repo tree that combines all workspaces into one structure, and allows us to detach data representation from pipelines. Pre-requisite for dvc-data extraction and using relpaths in dvcfs/repofs.
the `jobs` kwarg was not being passed down properly, resulting in it being ignored in some cases (push/pull/gc)
This removes callback from internal APIs: `_list_paths`, `list_hashes`, and `_hashes_with_limit`. These are all iterators and hence can just be replaced by wrapping a progressbar over those iterators in the caller side. We still have progressbars in `hashes_exist`, `_estimate_remote_size`, `_list_hashes_traverse` and `list_hashes_exists`, which should gradually lifted up and replaced.
On our way to making dvcfs fsspec-compliant.
Since all of our filesystems are fsspec-compliant, there's no need for compat wrapper.
This PR gets rid of Tqdm progress bars inside `dvc.objects`. Those are replaced with multi-phase callback in `hashes_exist`, and simple callable that takes `completed` value in `odb._estimate_remote_size()`. The `odb.list_hashes_exists()` and `odb.all()` now return iterables, hence the caller has the responsibility of handling progress. This is now handled in `dvc.data` by using a custom `Tqdm` progressbar. This is ugly and is temporary, but my motivation is to remove tqdm from dvc.objects first. We can gradually get rid of these from `dvc.data` too, but that requires more knowledge about `dvc.data` and involves more design work. All in time, I guess. :)
This reverts commit 43cc10381a02dd92caffd9b1dd41e0deb5a88ce6.
1. This PR simplifies `fs.upload`/`fs.download`/`fs.download_file`. 2. Removes unused kwargs. 3. `download` uses `dvc.utils.threadpool.ThreadPoolExecutor` which supports canceling futures. 4. `download` for directories supports nested/branched callbacks and progress bars.
This PR gets rid of Tqdm progress bars inside `dvc.objects`. Those are replaced with multi-phase callback in `hashes_exist`, and simple callable that takes `completed` value in `odb._estimate_remote_size()`. The `odb.list_hashes_exists()` and `odb.all()` now return iterables, hence the caller has the responsibility of handling progress. This is now handled in `dvc.data` by using a custom `Tqdm` progressbar. This is ugly and is temporary, but my motivation is to remove tqdm from dvc.objects first. We can gradually get rid of these from `dvc.data` too, but that requires more knowledge about `dvc.data` and involves more design work. All in time, I guess. :)
There is no support for chunk_size kwarg in fs.open. Also, `upload_fobj` is now able to figure out proper chunk size from fileobj, so I don't think we need this on _upload_file.
The PR also makes odb.add operations atomic, but the checkout is still not atomic.
This reverts commit 1fb4a3a741558c1d7d54e42d54b3b91a38f3c836.
In the spirit of fsspec.generic, it is for cross-filesystem operations. Trying to avoid `fs.utils` as much as possible, but we have some functions that are really utilities coming, so I am making a place for them. :)
Also moves umask to dvc.fs.system
dvc.objects: extend diskcache to handle pickle errors
When staging a directory, always save a "raw dir object" to the odb. If the corresponding ".dir" object has not been added to the odb, `stage()` calls can load the tree from the raw dir object instead of rebuilding it by walking the directory. This can lead to significant speed improvements when calling `dvc status` for modified directories. Fixes #7390
This commit moves `HashedStreamReader`/`get_hash_file`/`file_md5` into `dvc.objects.hash`. The `get_hash_file` has been renamed to `hash_file`. This also copies `relpath` and `is_exec` to the `dvc.fs.utils`.
It turns out that diskcache.Disk saves passed arguments to the database and uses that all the time. So this means that we are not forward compatible as it breaks moving from newer version of DVC to an older one. We don't promise this, but this breaks our benchmarks which are probably not isolated enough. Regarding the use of cache, it is a bit iffy to extend it this way. We only need this type for error messages. We could consider removing this, but since the error message is documented and it's not too hard to extend this :), I decided to preserve the message.
Some of the APIs/FileSystems are reimported inside dvc.fs module.
…ng fobj_md5 Now HashedStreamReader will be used to calculate md5 in fobj_md5/hash_file. Also: * adds in `bytes_md5` that may be useful for testing. * adds a hasher-registry with hashlib * adds a `text=` kwarg to be able to specify to use dos2unix or not (useful for testing) * makes `localfs` the default filesystem when `fs` arg is not passed (makes it easier to test in the REPL)
This commit: * defines a `Ignore` protocol in `dvc.objects`, that expects a `find` method. This will be used in dvc-data and dvc-objects instead of depending on dvc.ignore. dvcignore is expected to respect this protocol, which it does. * renames `dvcignore` kwarg to `ignore` in `dvc-data` and `dvc-objects`. * teaches dvc.cli that `IgnoreInCollectedError` is a known exception and is logged correctly and adds a test for the behaviour. I am not happy that we need to hardcode `.dvcignore` in dvc-data but I don't see any easy way, since we always try to avoid staging if the tree contains .dvcignore file, even if .dvcignore is not passed.
transfer raises internal TransferError, merge raises internal MergeError and local ODB raises ODBError on unprotect. transfer's error is reraised to FileTransferError back in DataCloud.
This moves get_fs_config/get_fs_cloud and ODBManager to DVC as they require DVC related utilities, and we don't need them strictly in dvc-objects/dvc-data. This also adds support to ReferenceHashFile to load from any filesystem, using it's module name and filename. Now we can get rid of DVCFileSystem dependency in ReferenceHashFile. This is a temporary solution, we really need a registry. (I am not quite sure if this works in pyinstaller mode, but I think it should, it's how fsspec does it if I understand correctly.)
Now dvc.objects and dvc.data uses the implementation from dvc.objects for Tqdm progress bars. This is temporary. :)
Also introduces a `validate_status` kwarg in `data.transfer`, which is used in dvc to log for missing version info.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This was extracted out from DVC from commit: 7bc98e3fb3.