Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support for alternative hashes (#1676) #2022

Closed
wants to merge 20 commits into from

Conversation

shcheklein
Copy link
Member

  • [x ] Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


Fixes #1676

Based on a fork by @vyloy .

Copy link
Member Author

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1920 - address comments from the previous PR

checksum_types, modchecksum.LOCAL_SUPPORTED_CHECKSUM_TYPES
):
return True
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not simply:

return bool(modchecksum.checksum_types_from_str(...))

?

info = self.info.get(t)
if info:
return info
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be:

return next((self.info[t] for t in self.remote.checksum_types() if t in self.info), None)
# or
return next(filter(bool, map(self.info.get, self.remote.checksum_types())), None) 

:)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or

return self.info[self.checksum]


def checksum_type(self):
return self._checksum_types[0]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this properties as above?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also is that an error if we have more than one here? If yes than I suggest adding an assert.

@@ -275,13 +299,55 @@ def get_checksum(self, path_info):
checksum = self.get_file_checksum(path_info)

if checksum:
self.state.save(path_info, checksum)
self.state.save(path_info, checksum, checksum_type)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a decorator to abstract away caching?

types = set(checksum_info.keys())
supported_types = set(self.checksum_types())
intersection = supported_types.intersection(types)
if not supported_types.intersection(types):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to call .keys() just types = set(checksum_info). Actually Python has optimized code path to make sets from dicts. intersection is computed twice.

addition_check, path_info, checksum_info
)
)
return True
Copy link
Contributor

@Suor Suor May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this addition_check only purpose to pass there self.changed_cache? If so won't it be more clear to simply use bool parameter?

AZURE_SUPPORTED_CHECKSUM_TYPES = [CHECKSUM_ETAG]
GS_SUPPORTED_CHECKSUM_TYPES = [CHECKSUM_MD5]
HDFS_SUPPORTED_CHECKSUM_TYPES = [CHECKSUM_CHECKSUM]
S3_SUPPORTED_CHECKSUM_TYPES = [CHECKSUM_ETAG]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this go to appropriate modules and just stay contained there? This will eliminate one level of indirection - these constants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They will go to the appropriate modules for sure 👍

@@ -207,7 +203,7 @@ def walk(self, path_info):
return os.walk(path_info.path)

def get_file_checksum(self, path_info):
return file_md5(path_info.path)[0]
return file_checksum(path_info.path, self._checksum_types[0])[0]
Copy link
Contributor

@Suor Suor May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like file_md5/file_checksum second result is never used. Maybe now is a good time to drop it and return only hex.

Optional(PARAM_CMD): Or(str, None),
Optional(PARAM_WDIR): Or(str, None),
Optional(PARAM_DEPS): Or(And(list, Schema([dependency.SCHEMA])), None),
Optional(PARAM_OUTS): Or(And(list, Schema([output.SCHEMA])), None),
Optional(PARAM_LOCKED): bool,
Optional(PARAM_META): object,
}
}.items():
SCHEMA[k] = v
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just:

SCHEMA.update({
    Optional(PARAM_CMD): Or(str, None),
    ...
})

self._changed_deps(),
self._changed_outs(),
self._changed_checksum(),
]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using ors instead of any() here will simplify code and provide short-circuit behavior.

v = d.get(k)
if v:
checksum[k] = v

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checksum = {k: d[k] for k in modchecksum.CHECKSUM_MAP if d.get(k)}

# Or if you don't expect falsy values in `d`:
checksum = {k: d[k] for k in modchecksum.CHECKSUM_MAP if k in d}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or using funcy:

from funcy import project
checksum = project(d, modchecksum.CHECKSUM_MAP)

Funcy is my lib :)

r = {}
for key, value in self.checksum.items():
if value:
r[key] = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r = {k: v for k, v in self.checksum.items() if v} 

Or using funcy:

from funcy import compact
r = compact(self.checksum)

if value:
r[key] = value

return r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole thing is merging of two dicts with a filter:

from  itertools import chain

d1 = self.checksum
d2 = {...}
return {k: v for k, v in chain(d1.items(), d2.items()) if v}

# or with funcy
from funcy import merge, compact

d1 = self.checksum
d2 = {...}
return compact(merge(d1, d2))

return None
for h in checksum_types:
if h not in supported_types:
return None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be

if not set(checksum_types) <= set(supported_types):
    return None

ret = []
for e in d:
ret.append(dict_filter(e, exclude))
return ret
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good place to use list comprehension.

self.assertEqual(
d["sha256"],
"6bc1683385ab4218fed1f53195c41d95d186c085a9e929cd77f012d4f541dd34",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look like functional tests not unit ones. Also could be easily written in new pytest style using dvc fixture:

def test_compatibility(dvc):
    ret = main(...)
    assert ret == 0

    # ....

@shcheklein
Copy link
Member Author

@Suor thanks for reviewing this. It was started by an external contributor and am trying to get it done, I don't expect it to be fast though. I'll ping you again when it's in a better shape and some fundamental issues are solved. I'll address your comments along the way, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alterative hashings for data
3 participants