Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

objects: use object IDs and references instead of naive objs in status/transfer #6360

Merged
merged 38 commits into from Aug 4, 2021

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Jul 23, 2021

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

  • Adds ReferenceHashFile object type and ReferenceObjectDB ODB type
    • Ref ODB stores references to files outside the ODB fs, so the object at hash abc123 is a serialized reference to some filesystem and path where the actual source file w/hash abc123 exists
  • objects.stage now stages objects as references in a memfs based ref ODB
  • objects.save is removed in favor of stage + transfer from staging to ODB pattern
  • objects.status and objects.transfer now accept a collection of object IDs (hashinfos) instead of naive objects
    • repo/stage/output object collection now collects object IDs instead of naive objects as well

@pmrowla pmrowla added the refactoring Factoring and re-factoring label Jul 23, 2021
@pmrowla pmrowla self-assigned this Jul 23, 2021
@pmrowla pmrowla added this to In progress in DVC 13 July - 26 July 2021 via automation Jul 23, 2021
@pmrowla pmrowla added this to In progress in DVC 27 Jul - 10 Aug via automation Jul 27, 2021
@pmrowla pmrowla moved this from In progress to Done in DVC 13 July - 26 July 2021 Jul 27, 2021
@efiop efiop moved this from In progress to Review in progress in DVC 27 Jul - 10 Aug Jul 27, 2021
def get(self, hash_info: "HashInfo"):
if hash_info.isdir:
return super().get(hash_info)
path_info = self.hash_to_path(hash_info.value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: as we've discussed with @pmrowla before, it might make sense to either store these ref objects in a some subdir in odb (e.g. refs/ or smth) or maybe use a file extention (similar to .dir, e.g. .ref) to distinguish these objects from actual real objects.

This might get useful in cases where you need to store actual objects as well as ref objects and is kinda like git's object header, but encoded into a path or filename, which allows us to very easilly distinguish them on sight and through API calls like list_objects.

@efiop
Copy link
Member

efiop commented Jul 27, 2021

Btw, even though refodb is not able to make dvc provide 100% git-like staging experience (we can't save actual files since they are too big), we could maybe improve --no-commit by just showing that in dvc status. We won't be able to make your data roll back to the staged state, but at least we won't need to store md5 of an uncommited (to .dvc/cache) file/dir, which is less hazardous and so uncommited and missing cache cases could be more easilly distinguished in functionality that cares (e.g. dvc status, dvc diff, etc). This will, of course, be possible, if staging refodb becomes permanently saved to disk, which is a future possible step. But just wanted to mention this, since it seems to allow us for a better ui or better functionality in general in the future.

@pmrowla
Copy link
Contributor Author

pmrowla commented Jul 30, 2021

For the record:

had discussion with @efiop regarding potential future use cases for refodb

  • replacing get with the ability to "checkout" directly from a remote odb seems promising
  • checkout itself would be generalized into a "transfer/link from source odb into checkout refodb" where the refodb contains references to the output paths

@pared pared self-requested a review July 30, 2021 14:03
@efiop
Copy link
Member

efiop commented Aug 2, 2021

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Aug 2, 2021

Command rebase: success

Branch has been successfully rebased

return None

return HashInfo("md5", value[2], size=int(size))
return HashInfo("md5", value[2], size=size)
Copy link
Member

@efiop efiop Aug 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks backward compatibility for this db πŸ™

We could change the layout, but just need to make sure we do corresponding version adjustments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll adjust it so that we go back to writing size as a string in the state DB.

The main thing here was that I don't think the utils function should have been returning both as strings (the returned mtime does have to be a string since for dirs it's actually the hash of all the file mtimes, but that doesn't apply to size)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmrowla Agreed, that's part of an old code that used to be directly in old state class πŸ™

rev = repo.get_rev()
if locked and self.def_repo.get(self.PARAM_REV_LOCK) is None:
self.def_repo[self.PARAM_REV_LOCK] = rev

path_info = PathInfo(repo.root_dir) / str(self.def_path)
if not obj_only:
try:
for odb, objs in repo.used_objs(
for odb, obj_ids in repo.used_objs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we go obj_ids -> oids and coin this name? Sure, it is a bit more cryptic, but short and sweet and makes sense if you know what you are doing. Just feels like we should get the terminology straight right away so it is settled, as it will be used more and more around the code.

dvc/hash_info.py Show resolved Hide resolved
@pmrowla pmrowla marked this pull request as ready for review August 3, 2021 11:31
@pmrowla pmrowla requested a review from a team as a code owner August 3, 2021 11:31
@pmrowla
Copy link
Contributor Author

pmrowla commented Aug 4, 2021

Thinking about staging some more (with regard to the partial add issue), we will need to always write staged trees to disk somewhere prior to the formal save/transfer (whether it's in the actual ODB or somewhere else like .dvc/tmp).

With the current behavior for this PR where it's all done in memory, we wouldn't be able to recover from a partial/failed add since we would lose the dir cache/tree for the original "complete" directory

@pmrowla pmrowla changed the title [WIP] objects: use object IDs and references instead of naive objs in status/transfer objects: use object IDs and references instead of naive objs in status/transfer Aug 4, 2021
@efiop efiop merged commit 114a07e into iterative:master Aug 4, 2021
DVC 27 Jul - 10 Aug automation moved this from Review in progress to Done Aug 4, 2021
@efiop
Copy link
Member

efiop commented Aug 4, 2021

@pmrowla Thank you! πŸ™

@efiop
Copy link
Member

efiop commented Aug 8, 2021

For the record: some followups that we've discussed before:

  1. obj type is part of oid
  2. base odb should be aware of obj type (will likely result in removal of ReferenceODB)
  3. use protobuf? (not needed right now, as we don't need to serialize anything to an actual disk, but still)
  4. hash type might be part of oid but we could keep plain 12/34 structure for now in refs/
  5. ref objects could be a compatibility layer between different hash types.
  6. should be possible to transfer all objects from one odb to another (e.g. run cache too). Aka cloning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactoring Factoring and re-factoring
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

2 participants