Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract dvc-data from dvc #23

Merged
merged 50 commits into from May 20, 2022
Merged

extract dvc-data from dvc #23

merged 50 commits into from May 20, 2022

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented May 19, 2022

This was extracted out from DVC from commit: 7bc98e3fb3.

@skshetry skshetry marked this pull request as draft May 19, 2022 14:24
efiop and others added 29 commits May 20, 2022 14:12
Initial simple separation to get the process of extracting dvc-objects
and dvc-data packages out of dvc.

dvc-objects is for:
 * defining base Object (aka HashFile) and ObjectDB
 * providing tools to work with those objects (check/transfer/etc)
 * etc

dvc-data is for:
 * defining higher level objects (e.g. Tree)
 * bulding objects from the workspace (aka staging)
 * projecting objects back into the workspace (aka checkout)
 * comparing Tree objects (aka diff)
 * defining misc object databases (e.g LocalObjectDB)
 * higher-level transfer (e.g. with Tree indexes)
 * etc

For example, the objects we use in run-cache are based on dvc-objects
and not on dvc-data, as they have absolutely nothing to do with data
management.

There are still some mixups in the code that will be addressed in the
followups.
With our staging process automatically capturing file metadata, we no longer
need to set isexec ourselves.

In followup PRs we will also use meta during checkout, so that we won't need
to use Output.set_exec by hand.
Introducing simple repo tree that combines all workspaces into one structure,
and allows us to detach data representation from pipelines.

Pre-requisite for dvc-data extraction and using relpaths in dvcfs/repofs.
the `jobs` kwarg was not being passed down properly,
resulting in it being ignored in some cases (push/pull/gc)
This removes callback from internal APIs: `_list_paths`, `list_hashes`,
and `_hashes_with_limit`. These are all iterators and hence can just be
replaced by wrapping a progressbar over those iterators in the caller
side.

We still have progressbars in `hashes_exist`, `_estimate_remote_size`,
`_list_hashes_traverse` and `list_hashes_exists`, which should gradually
lifted up and replaced.
On our way to making dvcfs fsspec-compliant.
Since all of our filesystems are fsspec-compliant, there's no need for compat wrapper.
This PR gets rid of Tqdm progress bars inside `dvc.objects`.
Those are replaced with multi-phase callback in `hashes_exist`, and
simple callable that takes `completed` value in
`odb._estimate_remote_size()`.

The `odb.list_hashes_exists()` and `odb.all()` now return iterables,
hence the caller has the responsibility of handling progress.

This is now handled in `dvc.data` by using a custom `Tqdm` progressbar.
This is ugly and is temporary, but my motivation is to remove tqdm from dvc.objects
first. We can gradually get rid of these from `dvc.data` too, but
that requires more knowledge about `dvc.data` and involves more
design work. All in time, I guess. :)
This reverts commit 43cc10381a02dd92caffd9b1dd41e0deb5a88ce6.
1. This PR simplifies `fs.upload`/`fs.download`/`fs.download_file`.
2. Removes unused kwargs.
3. `download` uses `dvc.utils.threadpool.ThreadPoolExecutor` which
   supports canceling futures.
4. `download` for directories supports nested/branched callbacks
   and progress bars.
This PR gets rid of Tqdm progress bars inside `dvc.objects`.
Those are replaced with multi-phase callback in `hashes_exist`, and
simple callable that takes `completed` value in
`odb._estimate_remote_size()`.

The `odb.list_hashes_exists()` and `odb.all()` now return iterables,
hence the caller has the responsibility of handling progress.

This is now handled in `dvc.data` by using a custom `Tqdm` progressbar.
This is ugly and is temporary, but my motivation is to remove tqdm from dvc.objects
first. We can gradually get rid of these from `dvc.data` too, but
that requires more knowledge about `dvc.data` and involves more
design work. All in time, I guess. :)
There is no support for chunk_size kwarg in fs.open.
Also, `upload_fobj` is now able to figure out proper
chunk size from fileobj, so I don't think we need this on _upload_file.
The PR also makes odb.add operations atomic, but the checkout is still
not atomic.
This reverts commit 1fb4a3a741558c1d7d54e42d54b3b91a38f3c836.
In the spirit of fsspec.generic, it is for cross-filesystem operations.
Trying to avoid `fs.utils` as much as possible, but we have some
functions that are really utilities coming, so I am making a place
for them. :)
skshetry and others added 21 commits May 20, 2022 14:12
Also moves umask to dvc.fs.system
dvc.objects: extend diskcache to handle pickle errors
When staging a directory, always save a "raw dir object" to the odb.
If the corresponding ".dir" object has not been added to the odb,
`stage()` calls can load the tree from the raw dir object instead of
rebuilding it by walking the directory.

This can lead to significant speed improvements when calling
`dvc status` for modified directories.

Fixes #7390
This commit moves `HashedStreamReader`/`get_hash_file`/`file_md5`
into `dvc.objects.hash`.

The `get_hash_file` has been renamed to `hash_file`.
This also copies `relpath` and `is_exec` to the `dvc.fs.utils`.
It turns out that diskcache.Disk saves passed arguments to the database
and uses that all the time. So this means that we are not forward
compatible as it breaks moving from newer version of DVC to an older
one. We don't promise this, but this breaks our benchmarks which
are probably not isolated enough.

Regarding the use of cache, it is a bit iffy to extend it this way.
We only need this type for error messages. We could consider removing
this, but since the error message is documented and it's not too hard
to extend this :), I decided to preserve the message.
Some of the APIs/FileSystems are reimported inside dvc.fs module.
…ng fobj_md5

Now HashedStreamReader will be used to calculate md5 in fobj_md5/hash_file.

Also:
* adds in `bytes_md5` that may be useful for testing.
* adds a hasher-registry with hashlib
* adds a `text=` kwarg to be able to specify to use dos2unix or not
  (useful for testing)
* makes `localfs` the default filesystem when `fs` arg is not passed
  (makes it easier to test in the REPL)
This commit:
* defines a `Ignore` protocol in `dvc.objects`, that expects a `find` method.
  This will be used in dvc-data and dvc-objects instead of depending on dvc.ignore.
  dvcignore is expected to respect this protocol, which it does.
* renames `dvcignore` kwarg to `ignore` in `dvc-data` and `dvc-objects`.
* teaches dvc.cli that `IgnoreInCollectedError` is a known exception
   and is logged correctly and adds a test for the behaviour.

I am not happy that we need to hardcode `.dvcignore` in dvc-data
but I don't see any easy way, since we always try to avoid
staging if the tree contains .dvcignore file, even if .dvcignore
is not passed.
transfer raises internal TransferError, merge raises internal MergeError
and local ODB raises ODBError on unprotect.
transfer's error is reraised to FileTransferError back in DataCloud.
This moves get_fs_config/get_fs_cloud and ODBManager to DVC
as they require DVC related utilities, and we don't need them
strictly in dvc-objects/dvc-data.
This also adds support to ReferenceHashFile to load from
any filesystem, using it's module name and filename.
Now we can get rid of DVCFileSystem dependency in
ReferenceHashFile.
This is a temporary solution, we really need a registry.
(I am not quite sure if this works in pyinstaller mode,
but I think it should, it's how fsspec does it if I understand
correctly.)
Now dvc.objects and dvc.data uses the implementation from dvc.objects
for Tqdm progress bars. This is temporary. :)
Also introduces a `validate_status` kwarg in `data.transfer`,
which is used in dvc to log for missing version info.
@skshetry skshetry marked this pull request as ready for review May 20, 2022 08:41
@skshetry skshetry merged commit 89bf06b into main May 20, 2022
@skshetry skshetry deleted the dvc-extract branch May 20, 2022 08:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants