Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

objects: migrate remote push/pull to objects.transfer #6308

Merged
merged 22 commits into from Jul 20, 2021

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jul 13, 2021

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

  • adds objects.transfer for transferring objects between two ODBs
    • old remote._process functionality has been migrated into objects.transfer
    • all transfers now use objects.save as the underlying file transfer method (rather than directly calling fs.upload or fs.download)
  • adds objects.status for comparing object status between two ODBs
    • old remote.status functionality has been migrated to objects.status.compare_status
  • old remote.index functionality has been migrated into objects.db.index
    • ODB index sits in its own layer separate from the actual ODB
    • index code has been migrated to diskcache instead of direct sqlite usage
    • legacy (now unused) sqlite code has been removed from dvc.state
  • push/pull/fetch now use unified objects.transfer
  • import now uses unified objects.transfer
  • dvc.remote is now removed

@pmrowla pmrowla added the refactoring Factoring and re-factoring label Jul 13, 2021
@pmrowla pmrowla self-assigned this Jul 13, 2021
@pmrowla pmrowla added this to In progress in DVC 13 July - 26 July 2021 via automation Jul 13, 2021
@pmrowla pmrowla added this to In progress in DVC 29 June - 12 July 2021 via automation Jul 13, 2021
@pmrowla pmrowla moved this from In progress to Done in DVC 29 June - 12 July 2021 Jul 13, 2021
@pmrowla pmrowla force-pushed the odb-transfer branch 2 times, most recently from 5a69b58 to f8c0799 Compare July 16, 2021 09:00
@pmrowla pmrowla marked this pull request as ready for review July 16, 2021 09:01
@pmrowla pmrowla requested a review from a team as a code owner July 16, 2021 09:01
@pmrowla pmrowla requested a review from isidentical July 16, 2021 09:01
@pmrowla
Copy link
Contributor Author

pmrowla commented Jul 16, 2021

Still need to run dvc-bench tests

@pmrowla
Copy link
Contributor Author

pmrowla commented Jul 16, 2021

Had discussion with @efiop regarding potential follow ups to this PR:

  • objects.transfer should operate on object IDs rather than objects (since they already essentially treated as naive object IDs anyways)
    • for handling trees, the tree contents can be lazily loaded during transfer as needed instead of passing loaded tree objects into transfer
  • eventually objects.save usage should be replaced by transferring from staging into the destination ODB
    • all staged files/dirs should have actual entries in the staging ODB, as opposed to the current behavior where we only "stage" trees
    • staging/staged objects should potentially work like a reference filesystem
    • loading the object ID from the staging ODB would then open/read the file data from referenced location
    • i.e. git imports could then be loaded (or transferred) from the staging ODB like any other object, and the actual file data would be read from the git/scm fs when necessary

Comment on lines +89 to 96

if verify is None:
verify = self.verify
try:
self.check(hash_info, check_hash=self.verify)
self.check(hash_info, check_hash=verify)
return
except (ObjectFormatError, FileNotFoundError):
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, discussed with @isidentical that for some filesystems like (hdfs and future ssh) upload_fobj down below will no longer be atomic, so we might need to use a temporary path here and then just rename into place. (there is an option to wrap fs calls to make them atomic but that is error prone and ugly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to me like this might still need to be handled at the fs level, uploading to a temp path and renaming at the ODB level won't work for all of our filesystems (HTTP doesn't support move/rename)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla That atomicity is not something that fs should care about when uploading/downloading, this is an odb-level behavior.

HTTP doesn't support move/rename

Are operations already atomic there? Or it just doesn't support rename at all anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the HTTP case, it's atomic since the full POST/PUT request wouldn't be completed so the server should drop whatever was partially uploaded. And yeah, we don't support rename/move/copy at all since there's no HTTP method for that operation (unless you're using an extension built on top of HTTP like webdav)

It seems to me that both _upload and _upload_fobj should work the same way, and should both guarantee atomicity at the fs level - like how in localfs we do the explicit upload to tempfile and rename for both _upload and _upload_fobj

Copy link
Member

@efiop efiop Jul 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla Thanks for clarifying!

_upload_fobj is temporary until fsspec migration is complete and we can use put/get[_file] directly.

fs atomicity is unlikely to be guaranteed by all filesystems and might actually be undesirable in some use cases outside dvc (e.g. you might want to upload as much of a file as you can, or you might not care about atomicity so you might not want to waste an API call for rename), so it seems like it could be more robust if we do that in our odb layer (or fs wrapper after all?) for now.

Clearly, it seems like it would be useful to have the knowledge about whether or not particular fs operations are atomic so that we could waste the least api calls possible, so maybe our fsspec_wrapper is indeed a pretty good place for it for now, similar how, IIRC, in C libraries you have atomic_* functions, we could have something like put_file and atomic_put_file or atomic=True or something. Maybe this could be useful for fsspec in general as well, not quite sure right now πŸ€”

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like this PR is changing the old behaviour, so probably not worth blocking it because of it, but we'll def need to keep this in mind for the followups.

dvc/objects/db/index.py Outdated Show resolved Hide resolved
objs: Iterable["HashFile"],
name: Optional[str] = None,
index: Optional["ObjectDBIndexBase"] = None,
cache_odb: Optional["ObjectDB"] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

odb passed into status will not always be where we want to load dir trees from (we want to load from local cache and not stream from the remote when possible)

For now this is just a performance optimization to keep parity with existing behavior, but the need for this in general will be more obvious once the full oid migration is done

@pmrowla pmrowla moved this from In progress to Review in progress in DVC 13 July - 26 July 2021 Jul 20, 2021
@pmrowla
Copy link
Contributor Author

pmrowla commented Jul 20, 2021

On my machine the dvc bench results for the current PR are comparable to master, as noted in #6308 (comment) there will be a follow up PR that changes our object collection and status/transfer to use object IDs (hash infos) instead of objects

@pmrowla pmrowla changed the title [WIP] objects: migrate remote push/pull to objects.transfer objects: migrate remote push/pull to objects.transfer Jul 20, 2021
DVC 13 July - 26 July 2021 automation moved this from Review in progress to Reviewer approved Jul 20, 2021
@efiop efiop merged commit 3d84df9 into iterative:master Jul 20, 2021
DVC 13 July - 26 July 2021 automation moved this from Reviewer approved to Done Jul 20, 2021
@pmrowla pmrowla deleted the odb-transfer branch July 20, 2021 10:41
@efiop efiop mentioned this pull request Jul 22, 2021
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Factoring and re-factoring
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants