New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
objects: migrate remote push/pull to objects.transfer #6308
Conversation
5a69b58
to
f8c0799
Compare
Still need to run dvc-bench tests |
Had discussion with @efiop regarding potential follow ups to this PR:
|
|
||
if verify is None: | ||
verify = self.verify | ||
try: | ||
self.check(hash_info, check_hash=self.verify) | ||
self.check(hash_info, check_hash=verify) | ||
return | ||
except (ObjectFormatError, FileNotFoundError): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, discussed with @isidentical that for some filesystems like (hdfs and future ssh) upload_fobj
down below will no longer be atomic, so we might need to use a temporary path here and then just rename
into place. (there is an option to wrap fs calls to make them atomic but that is error prone and ugly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to me like this might still need to be handled at the fs level, uploading to a temp path and renaming at the ODB level won't work for all of our filesystems (HTTP doesn't support move/rename)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla That atomicity is not something that fs should care about when uploading/downloading, this is an odb-level behavior.
HTTP doesn't support move/rename
Are operations already atomic there? Or it just doesn't support rename at all anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the HTTP case, it's atomic since the full POST/PUT request wouldn't be completed so the server should drop whatever was partially uploaded. And yeah, we don't support rename/move/copy at all since there's no HTTP method for that operation (unless you're using an extension built on top of HTTP like webdav)
It seems to me that both _upload
and _upload_fobj
should work the same way, and should both guarantee atomicity at the fs level - like how in localfs we do the explicit upload to tempfile and rename for both _upload and _upload_fobj
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pmrowla Thanks for clarifying!
_upload_fobj
is temporary until fsspec migration is complete and we can use put/get[_file]
directly.
fs atomicity is unlikely to be guaranteed by all filesystems and might actually be undesirable in some use cases outside dvc (e.g. you might want to upload as much of a file as you can, or you might not care about atomicity so you might not want to waste an API call for rename
), so it seems like it could be more robust if we do that in our odb layer (or fs wrapper after all?) for now.
Clearly, it seems like it would be useful to have the knowledge about whether or not particular fs operations are atomic so that we could waste the least api calls possible, so maybe our fsspec_wrapper
is indeed a pretty good place for it for now, similar how, IIRC, in C
libraries you have atomic_*
functions, we could have something like put_file
and atomic_put_file
or atomic=True
or something. Maybe this could be useful for fsspec
in general as well, not quite sure right now π€
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like this PR is changing the old behaviour, so probably not worth blocking it because of it, but we'll def need to keep this in mind for the followups.
objs: Iterable["HashFile"], | ||
name: Optional[str] = None, | ||
index: Optional["ObjectDBIndexBase"] = None, | ||
cache_odb: Optional["ObjectDB"] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
odb
passed into status will not always be where we want to load dir trees from (we want to load from local cache and not stream from the remote when possible)
For now this is just a performance optimization to keep parity with existing behavior, but the need for this in general will be more obvious once the full oid migration is done
On my machine the dvc bench results for the current PR are comparable to master, as noted in #6308 (comment) there will be a follow up PR that changes our object collection and status/transfer to use object IDs (hash infos) instead of objects |
β I have followed the Contributing to DVC checklist.
π If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.
Thank you for the contribution - we'll try to review it as soon as possible. π
objects.transfer
for transferring objects between two ODBsremote._process
functionality has been migrated intoobjects.transfer
objects.save
as the underlying file transfer method (rather than directly callingfs.upload
orfs.download
)objects.status
for comparing object status between two ODBsremote.status
functionality has been migrated toobjects.status.compare_status
remote.index
functionality has been migrated intoobjects.db.index
dvc.state
objects.transfer
objects.transfer
dvc.remote
is now removed