objects: migrate remote push/pull to objects.transfer #6308

pmrowla · 2021-07-13T06:53:03Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

adds objects.transfer for transferring objects between two ODBs
- old remote._process functionality has been migrated into objects.transfer
- all transfers now use objects.save as the underlying file transfer method (rather than directly calling fs.upload or fs.download)
adds objects.status for comparing object status between two ODBs
- old remote.status functionality has been migrated to objects.status.compare_status
old remote.index functionality has been migrated into objects.db.index
- ODB index sits in its own layer separate from the actual ODB
- index code has been migrated to diskcache instead of direct sqlite usage
- legacy (now unused) sqlite code has been removed from dvc.state
push/pull/fetch now use unified objects.transfer
import now uses unified objects.transfer
dvc.remote is now removed

pmrowla · 2021-07-16T09:01:25Z

Still need to run dvc-bench tests

pmrowla · 2021-07-16T10:44:10Z

Had discussion with @efiop regarding potential follow ups to this PR:

objects.transfer should operate on object IDs rather than objects (since they already essentially treated as naive object IDs anyways)
- for handling trees, the tree contents can be lazily loaded during transfer as needed instead of passing loaded tree objects into transfer
eventually objects.save usage should be replaced by transferring from staging into the destination ODB
- all staged files/dirs should have actual entries in the staging ODB, as opposed to the current behavior where we only "stage" trees
- staging/staged objects should potentially work like a reference filesystem
- loading the object ID from the staging ODB would then open/read the file data from referenced location
- i.e. git imports could then be loaded (or transferred) from the staging ODB like any other object, and the actual file data would be read from the git/scm fs when necessary

efiop · 2021-07-16T12:39:22Z

dvc/objects/db/base.py

+
+        if verify is None:
+            verify = self.verify
        try:
-            self.check(hash_info, check_hash=self.verify)
+            self.check(hash_info, check_hash=verify)
            return
        except (ObjectFormatError, FileNotFoundError):
            pass


Btw, discussed with @isidentical that for some filesystems like (hdfs and future ssh) upload_fobj down below will no longer be atomic, so we might need to use a temporary path here and then just rename into place. (there is an option to wrap fs calls to make them atomic but that is error prone and ugly).

Seems to me like this might still need to be handled at the fs level, uploading to a temp path and renaming at the ODB level won't work for all of our filesystems (HTTP doesn't support move/rename)

@pmrowla That atomicity is not something that fs should care about when uploading/downloading, this is an odb-level behavior.

HTTP doesn't support move/rename

Are operations already atomic there? Or it just doesn't support rename at all anywhere?

In the HTTP case, it's atomic since the full POST/PUT request wouldn't be completed so the server should drop whatever was partially uploaded. And yeah, we don't support rename/move/copy at all since there's no HTTP method for that operation (unless you're using an extension built on top of HTTP like webdav)

It seems to me that both _upload and _upload_fobj should work the same way, and should both guarantee atomicity at the fs level - like how in localfs we do the explicit upload to tempfile and rename for both _upload and _upload_fobj

@pmrowla Thanks for clarifying!

_upload_fobj is temporary until fsspec migration is complete and we can use put/get[_file] directly.

fs atomicity is unlikely to be guaranteed by all filesystems and might actually be undesirable in some use cases outside dvc (e.g. you might want to upload as much of a file as you can, or you might not care about atomicity so you might not want to waste an API call for rename), so it seems like it could be more robust if we do that in our odb layer (or fs wrapper after all?) for now.

Clearly, it seems like it would be useful to have the knowledge about whether or not particular fs operations are atomic so that we could waste the least api calls possible, so maybe our fsspec_wrapper is indeed a pretty good place for it for now, similar how, IIRC, in C libraries you have atomic_* functions, we could have something like put_file and atomic_put_file or atomic=True or something. Maybe this could be useful for fsspec in general as well, not quite sure right now 🤔

Doesn't look like this PR is changing the old behaviour, so probably not worth blocking it because of it, but we'll def need to keep this in mind for the followups.

dvc/objects/db/index.py

pmrowla · 2021-07-20T08:38:24Z

dvc/objects/status.py

+    objs: Iterable["HashFile"],
+    name: Optional[str] = None,
+    index: Optional["ObjectDBIndexBase"] = None,
+    cache_odb: Optional["ObjectDB"] = None,


odb passed into status will not always be where we want to load dir trees from (we want to load from local cache and not stream from the remote when possible)

For now this is just a performance optimization to keep parity with existing behavior, but the need for this in general will be more obvious once the full oid migration is done

pmrowla · 2021-07-20T08:40:30Z

On my machine the dvc bench results for the current PR are comparable to master, as noted in #6308 (comment) there will be a follow up PR that changes our object collection and status/transfer to use object IDs (hash infos) instead of objects

pmrowla added the refactoring Factoring and re-factoring label Jul 13, 2021

pmrowla self-assigned this Jul 13, 2021

pmrowla added this to In progress in DVC 13 July - 26 July 2021 via automation Jul 13, 2021

pmrowla added this to In progress in DVC 29 June - 12 July 2021 via automation Jul 13, 2021

pmrowla moved this from In progress to Done in DVC 29 June - 12 July 2021 Jul 13, 2021

pmrowla force-pushed the odb-transfer branch 2 times, most recently from 5a69b58 to f8c0799 Compare July 16, 2021 09:00

pmrowla marked this pull request as ready for review July 16, 2021 09:01

pmrowla requested a review from a team as a code owner July 16, 2021 09:01

pmrowla requested a review from isidentical July 16, 2021 09:01

pmrowla force-pushed the odb-transfer branch from 2f06eef to 2e499d2 Compare July 16, 2021 09:27

efiop reviewed Jul 16, 2021

View reviewed changes

efiop mentioned this pull request Jul 18, 2021

push: prevent crash while using sqlite db on windows network share #6191

Closed

2 tasks

pmrowla added 15 commits July 19, 2021 14:46

objects: migrate remote status to objects.status

60a055e

data_cloud: use objects.status

9383fd2

objects: migrate remote._process to objects.transfer

87c8947

data_cloud: use objects.transfer

9ac7e97

objects: handle memfs staging objects in transfer

885dad7

fetch/import: use unified transfer()

de8dae4

update get_remote and transfer exception usage

13e9345

update tests

be9e9fb

objects.transfer: use src ODB verify rules after xfer

214a9e7

odb: move index from remote into odb

4bbd6af

objects.transfer: skip src status check when possible

b6b655b

update tests

2713584

remove dvc.remote

00743d9

odb.index: migrate to diskcache

1479b8a

objects: fix status/transfer optimization

5f8b895

pmrowla added 4 commits July 19, 2021 14:47

update index usage

84bc217

state: remove dead sqlite related code

9a8686d

update cloud tests

aef3b83

drop unnecessary remove in tests

be33b6b

pmrowla force-pushed the odb-transfer branch from f8ad8b3 to be33b6b Compare July 19, 2021 05:47

efiop reviewed Jul 19, 2021

View reviewed changes

dvc/objects/db/index.py Outdated Show resolved Hide resolved

pmrowla added 2 commits July 20, 2021 14:21

use abstract base class in odb index

857fe5b

status: load trees from local cache instead of remote odb when possible

905a580

pmrowla commented Jul 20, 2021

View reviewed changes

pmrowla moved this from In progress to Review in progress in DVC 13 July - 26 July 2021 Jul 20, 2021

pmrowla changed the title ~~[WIP] objects: migrate remote push/pull to objects.transfer~~ objects: migrate remote push/pull to objects.transfer Jul 20, 2021

handle status when dir cache is explicitly removed

99b1654

DVC 13 July - 26 July 2021 automation moved this from Review in progress to Reviewer approved Jul 20, 2021

efiop approved these changes Jul 20, 2021

View reviewed changes

efiop merged commit 3d84df9 into iterative:master Jul 20, 2021

DVC 13 July - 26 July 2021 automation moved this from Reviewer approved to Done Jul 20, 2021

pmrowla deleted the odb-transfer branch July 20, 2021 10:41

efiop mentioned this pull request Jul 22, 2021

output: cache loaded tree objects #6301

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

objects: migrate remote push/pull to objects.transfer #6308

objects: migrate remote push/pull to objects.transfer #6308

pmrowla commented Jul 13, 2021 •

edited

pmrowla commented Jul 16, 2021

pmrowla commented Jul 16, 2021

efiop Jul 16, 2021

pmrowla Jul 19, 2021

efiop Jul 19, 2021

pmrowla Jul 19, 2021

efiop Jul 19, 2021 •

edited

efiop Jul 19, 2021

pmrowla Jul 20, 2021

pmrowla commented Jul 20, 2021

objects: migrate remote push/pull to objects.transfer #6308

objects: migrate remote push/pull to objects.transfer #6308

Conversation

pmrowla commented Jul 13, 2021 • edited

pmrowla commented Jul 16, 2021

pmrowla commented Jul 16, 2021

efiop Jul 16, 2021

Choose a reason for hiding this comment

pmrowla Jul 19, 2021

Choose a reason for hiding this comment

efiop Jul 19, 2021

Choose a reason for hiding this comment

pmrowla Jul 19, 2021

Choose a reason for hiding this comment

efiop Jul 19, 2021 • edited

Choose a reason for hiding this comment

efiop Jul 19, 2021

Choose a reason for hiding this comment

pmrowla Jul 20, 2021

Choose a reason for hiding this comment

pmrowla commented Jul 20, 2021

pmrowla commented Jul 13, 2021 •

edited

efiop Jul 19, 2021 •

edited