Skip to content
This repository has been archived by the owner on Sep 9, 2020. It is now read-only.

WIP: HashFromNode and VerifyDepTree #959

Merged
merged 36 commits into from Aug 17, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
84b3567
HashFromNode and VerifyDepTree
karrick Jul 15, 2017
ae2c30e
added package declarations at top of all test go files
karrick Aug 5, 2017
1a76ff6
refactor based on feedback from @sdboyer
karrick Aug 7, 2017
66fe4eb
writes NULL byte after writing to hash
karrick Aug 7, 2017
5c97180
added copywrite notification for dummy files
karrick Aug 7, 2017
546d65f
test scaffolding dirs and files have meaningful names
karrick Aug 7, 2017
f1d5d85
use filepath function to locate testdata_digest directory
karrick Aug 7, 2017
b6d16e1
convert CRLF to LF when hashing file contents
karrick Aug 8, 2017
9a7209d
added copywrite notification for digest{,_test}.go
karrick Aug 8, 2017
581e51a
fixes `gosimple $PKGS` lint suggestion
karrick Aug 8, 2017
a298630
lineEndingReader is more simplified than lineEndingWriterTo
karrick Aug 8, 2017
f0b532d
bug fixes and better testing for lineEndingReader
karrick Aug 8, 2017
911fda4
TestScaffolding prints testdata_digest directory
karrick Aug 8, 2017
ecbbf03
pathnames returned as keys to VerifyDepTree normalized with filepath.…
karrick Aug 8, 2017
0f7b419
continue printing VerifyDepTree response
karrick Aug 8, 2017
443fcd3
comments and some logic changes
karrick Aug 8, 2017
8575040
normalize pathnames as keys for expected digests map
karrick Aug 9, 2017
541b3f8
no longer dumps status output during testing
karrick Aug 9, 2017
0a01713
no longer dumps status output during testing
karrick Aug 9, 2017
8a0c365
two loop solution
karrick Aug 12, 2017
cba6ebe
O(n) loop for CRLF -> LF translation in lineEndingReader
karrick Aug 12, 2017
59a1046
typos
karrick Aug 12, 2017
c5be3d8
NotInLock loop more easy to read
karrick Aug 13, 2017
87959d0
DigestFromPathname -> DigestFromDirectory
karrick Aug 13, 2017
29b1709
sets element in backing array to nil after popping off queue
karrick Aug 13, 2017
17e7c9d
comments updated
karrick Aug 13, 2017
0df696c
code cleanup and documentation updates
karrick Aug 13, 2017
f02d479
VerifyDepTree reports all NotInLock file system nodes
karrick Aug 14, 2017
e9ff6b2
additional test cases
karrick Aug 15, 2017
fbe5d64
VerifyDepTree test displays actual hash values when got status does n…
karrick Aug 15, 2017
48cd664
DirWalk
karrick Aug 15, 2017
845c95e
repeatedly uses io.CopyBuffer with pre-allocated buffer
karrick Aug 15, 2017
9a883e8
DigestFromDirectory uses DirWalk
karrick Aug 15, 2017
a252c36
public type has properly formatted documentation
karrick Aug 15, 2017
83ad7c6
DigestFromDirectory handles Lstat error from DirWalk
karrick Aug 15, 2017
a9bb8cd
removed constant no longer being used
karrick Aug 15, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
145 changes: 145 additions & 0 deletions internal/fs/hash.go
@@ -0,0 +1,145 @@
package fs

import (
"crypto/sha256"
"fmt"
"io"
"os"
"path/filepath"
"sort"
"strconv"

"github.com/pkg/errors"
)

const (
pathSeparator = string(filepath.Separator)
skipModes = os.ModeDevice | os.ModeNamedPipe | os.ModeSocket | os.ModeCharDevice
)

// HashFromNode returns a deterministic hash of the specified file system node,
// performing a breadth-first traversal of directories. While the specified
// prefix is joined with the pathname to walk the file system, the prefix string
// is eliminated from the pathname of the nodes encounted when hashing the
// pathnames.
//
// This function ignores any file system node named `vendor`, `.bzr`, `.git`,
// `.hg`, and `.svn`, as these are typically used as Version Control System
// (VCS) directories.
//
// Other than the `vendor` and VCS directories mentioned above, the calculated
// hash includes the pathname to every discovered file system node, whether it
// is an empty directory, a non-empty directory, empty file, non-empty file, or
// symbolic link. If a symbolic link, the referent name is included. If a
// non-empty file, the file's contents are incuded. If a non-empty directory,
// the contents of the directory are included.
//
// While filepath.Walk could have been used, that standard library function
// skips symbolic links, and for now, we want to hash the referent string of
// symbolic links.
func HashFromNode(prefix, pathname string) (hash string, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put this cool cat in internal/gps/pkgtree, rather than internal/fs. While we break the rule a bit with the fs functions, this is a key part of the domain solution gps provides, and it should be grouped with it. pkgtree makes sense as a a package to put it in, as this is another operation operating on trees of packages, and it shares some of the same nuanced semantics about what nodes are and aren't visited with pkgtree.ListPackages().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, let's return []byte instead of string.

and, style nit: please use named returns only when necessary for disambiguation of return values (which we don't need here).

Copy link
Contributor Author

@karrick karrick Aug 7, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I buy everything you suggest here, but after working with the code, and keeping in mind the fact that Gopkg.lock has string versions of digest sums in them, I have a hard time seeing this as []byte rather than a string.

Please provide additional information that I might not be aware of to help me understand this choice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no strong argument for it - it's really just a consistency thing. i wasn't explicit about it above, but the big reason to move this into gps is because we try to treat gps as a library that has no idea that dep exists. the only exceptions to that right now are some calls into internal/fs, and then some tests that slipped into gps that we need to back out.

the only other place in gps with hashing is when we hash inputs (what shows up in Gopkg.lock as inputs-digest). that returns a []byte; it does so because that's what the sha256 writer returns, and changing it would be expressing an opinion about use - and gps has no opinion about use. maybe it goes to a file, where it must be encoded - or maybe it's being written directly to a database as a bytestream. gps will, at some point in the not-so-distant-future, be moved out of internal, so this isn't just our concern alone.

again, i realize this is totally bikesheddable territory, and there's no definitive, clear reason. just, trying to maintain consistency in the exposed API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually do not regularly use named return values. A quick perusal of everything I have on github reveals that it's not my style. I thought I saw something else in this library that used it, or perhaps some other library, and I wanted to maintain style discipline. Thanks for the feedback!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the named return values was an issue i learned about myself in the course of working on all this. you probably saw some of my detritus 😢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely understand your logic about []byte and string. In fact, I had a follow up question for you were you to actually answer []byte, but you answered my follow up question with a link to other use. The question would have been, "Do we return a byte slice of the hexidecimal output, or the actual raw bytes returned by the digest?" I have made the required changes to the source code to return the raw bytes returned by the digest just as the other link does.

// Create a single hash instance for the entire operation, rather than a new
// hash for each node we encounter.
h := sha256.New()

// "../../../vendor", "github.com/account/library"
prefix = filepath.Clean(prefix) + pathSeparator
prefixLength := len(prefix)
if prefixLength > 0 {
prefixLength += len(pathSeparator) // if not empty string, include len of path separator
}
joined := filepath.Join(prefix, pathname)

// Initialize a work queue with the os-agnostic cleaned up pathname. Note
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to be os-agnostic, i think we need filepath.ToSlash(), no? that'll ensure that path separators are always encoded as /.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely correct; thanks for the nudge!

// that we use `filepath.Clean` rather than `filepath.Abs`, because we don't
// want the hash to be based on the absolute pathnames of the specified
// directory and contents.
pathnameQueue := []string{joined}

for len(pathnameQueue) > 0 {
// NOTE: unshift a pathname from the queue
pathname, pathnameQueue = pathnameQueue[0], pathnameQueue[1:]

fi, er := os.Lstat(pathname)
if er != nil {
err = errors.Wrap(er, "cannot Lstat")
return
}

mode := fi.Mode()

// Skip special files
if mode&skipModes != 0 {
continue
}

// NOTE: Write pathname to hash, because hash ought to be as much a
// function of the names of the files and directories as their
// contents. Added benefit is that even empty directories and symbolic
// links will effect final hash value.
//
// NOTE: Throughout this function, we ignore return values from writing
// to the hash, because hash write always returns nil error.
_, _ = h.Write([]byte(pathname)[prefixLength:])

if mode&os.ModeSymlink != 0 {
referent, er := os.Readlink(pathname)
if er != nil {
err = errors.Wrap(er, "cannot Readlink")
return
}
// Write the referent to the hash and proceed to the next pathname
// in the queue.
_, _ = h.Write([]byte(referent))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to differentiate between a normal file containing only the path the symlink points to, and an actual symlink. Maybe that's happening elsewhere...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please restate your comment. For all non-skipped file system nodes, we always hash the pathname to the node. Then if it's a symlink, we also hash the referent. Else if it's a directory, we enqueue the sorted list of children. Else if it's a regular file, we hash the file size and the file contents.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, if we're hashing the file size, that sorta incidentally achieves my stated goal, though maybe not entirely.

an example to illustrate - say that we have two nodes, one a file, and one a symlink. the content of the file is /foo, and the referent of the symlink is 4/foo.

is it true that, if the path to either of these nodes is filename, then the data that will be fed to the hasher is filename4/foo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdboyer, okay after reading your example about a dozen times, I think I understand your scenario. Basically we ought to protect ourselves against scenario of accidental hash collision when the stored symbolic link referent points to node, and when a regular file's contents might match the symbolic link referent. Crazy case here, but valid. I think if we write an addition NULL byte to the hash after every entry we write to the hash. Because NULL bytes are not legal file system characters for Windows or Linux, it provides a entry terminator of sorts, kind of like a new line character would.

This ought to prevent accidental hash collisions due to symlink referents matching the contents of a regular file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, latest push writes NULL bytes after writing pathname information to the hash.

continue
}

fh, er := os.Open(pathname)
if er != nil {
err = errors.Wrap(er, "cannot Open")
return
}

if fi.IsDir() {
childrenNames, er := fh.Readdirnames(0) // 0: read names of all children
if er != nil {
err = errors.Wrap(er, "cannot Readdirnames")
// NOTE: Even if there was an error reading the names of the
// directory entries, we still must close file handle for the
// open directory before we return. In this case, we simply skip
// sorting and adding entry names to the work queue beforehand.
childrenNames = nil
}

// NOTE: Sort children names to ensure deterministic ordering of
// contents of each directory, ensuring hash remains same even if
// operating system returns same values in a different order on
// subsequent invocation.
sort.Strings(childrenNames)

for _, childName := range childrenNames {
switch childName {
case ".", "..", "vendor", ".bzr", ".git", ".hg", ".svn":
// skip
default:
pathnameQueue = append(pathnameQueue, pathname+pathSeparator+childName)
}
}
} else {
_, _ = h.Write([]byte(strconv.FormatInt(fi.Size(), 10))) // format file size as base 10 integer
_, er = io.Copy(h, fh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfortunately, we have to do a little more than just a straight copy here - at minimum, we have to normalize CRLF to LF so that windows machines are working on the same literal data as the rest. i know that git itself does something like this, though i gave up looking after finding my way to here.

this would certainly explain why the windows appveyor tests are failing with different digests than both the travis failures and the wanted values.

i'm pretty sure this will be a correct operation across all vcs, but that belief needs verification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, at least, and correct me if you think differently, the priority is that the hash be consistent across any version of any of the operating systems supported by Go, and using any of the dep-supported VCS. Please tell me if my view is not consistent with your goals in mind.

Assuming that is agreed, your point about CRLF vs LF translation might reveal quite the complex problem for us to tackle.

We would need to perform CRLF translation on all text files we encounter. Obviously we don't need to mangle the bytes on disk, but just mangle the bytes as we stream them into the hash. (Also, I think we could even get away with in-memory translating every file, including binary ones, with impunity here, but this seems like a bizarre solution.)

Your thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that is agreed, your point about CRLF vs LF translation might reveal quite the complex problem for us to tackle.

yep! i've been worried about this one for a while 😄

(Also, I think we could even get away with in-memory translating every file, including binary ones, with impunity here, but this seems like a bizarre solution.)

after poking at this and thinking about it for a while, i think this would actually be fine - a simple scanner that just reads the bytes into memory and mangles them there, effectively treating \r\n and \n as equivalent. while i wasn't able to find the actual logic in git that does this work, i did at least verify that git hash-object on a file produces the same hash when i replace a \r\n with a \n, so that appears to be the strategy they employ.

the (a?) tricky bit is that we could no longer rely on the file size to be an accurate byte count; we could only know afterwards, after having performed the mangling. i believe that would mean we'd either have to write the file length as a suffix, or keep the entire file's contents in a temporary buffer and then pass it to the hasher after we've reached the end and gotten an accurate count.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind, at least, and correct me if you think differently, the priority is that the hash be consistent across any version of any of the operating systems supported by Go, and using any of the dep-supported VCS.

oh, and 💯 yes. it's worse than useless if we don't have this property.

err = errors.Wrap(er, "cannot Copy") // errors.Wrap only wraps non-nil, so elide checking here
}

// NOTE: Close the file handle to the open directory or file.
if er = fh.Close(); err == nil {
err = errors.Wrap(er, "cannot Close")
}
if err != nil {
return // early termination iff error
}
}

hash = fmt.Sprintf("%x", h.Sum(nil))
return
}
40 changes: 40 additions & 0 deletions internal/fs/hash_test.go
@@ -0,0 +1,40 @@
package fs

import (
"os"
"path/filepath"
"testing"
)

func TestHashFromNodeWithFile(t *testing.T) {
actual, err := HashFromNode("", "./testdata/blob")
if err != nil {
t.Fatal(err)
}
expected := "9ccd71eec554488c99eb9205b6707cb70379e1aa637faad58ac875278786f2ff"
if actual != expected {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we try to follow the (general) convention used within golang/go itself, which is "want" and "got" instead of "expected" and "actual". that applies to both var names and printed output.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also go a little further in gps by using a convention of:

(GOT): <data>
(WNT): <data>

which helps a bit with scannability. not that it matters with hash digests, ofc - just, again, consistency.

t.Errorf("Actual:\n\t%#q\nExpected:\n\t%#q", actual, expected)
}
}

func TestHashFromNodeWithDirectory(t *testing.T) {
actual, err := HashFromNode("../fs", "testdata/recursive")
if err != nil {
t.Fatal(err)
}
expected := "432949ff3f1687e7e09b31fd81ecab3b7d9b68da5f543aad18e1930b5bd22e30"
if actual != expected {
t.Errorf("Actual:\n\t%#q\nExpected:\n\t%#q", actual, expected)
}
}

var goSource = filepath.Join(os.Getenv("GOPATH"), "src")

func BenchmarkHashFromNode(b *testing.B) {
for i := 0; i < b.N; i++ {
_, err := HashFromNode("", goSource)
if err != nil {
b.Fatal(err)
}
}
}
7 changes: 7 additions & 0 deletions internal/fs/testdata/blob
@@ -0,0 +1,7 @@
fjdkal;fdjskc
xzc
axc
fdsf
adsf
das
fd
7 changes: 7 additions & 0 deletions internal/fs/testdata/recursive/blob
@@ -0,0 +1,7 @@
fjdkal;fdjskc
xzc
axc
fdsf
adsf
das
fd
Empty file.
3 changes: 3 additions & 0 deletions internal/fs/testdata/recursive/foo/bar
@@ -0,0 +1,3 @@
fjdsakl;fd
vcafcds
vca
1 change: 1 addition & 0 deletions internal/fs/testdata/recursive/vendor/skip1
@@ -0,0 +1 @@
this file ought to be skipped
Empty file.
1 change: 1 addition & 0 deletions internal/gps/_testdata/src/cycle/a.go
Expand Up @@ -6,6 +6,7 @@ package cycle

import (
"cycle/one"

"github.com/golang/dep/internal/gps"
)

Expand Down
1 change: 1 addition & 0 deletions internal/gps/_testdata/src/cycle/one/a.go
Expand Up @@ -6,6 +6,7 @@ package one

import (
"cycle/two"

"github.com/golang/dep/internal/gps"
)

Expand Down
1 change: 1 addition & 0 deletions internal/gps/_testdata/src/cycle/two/a.go
Expand Up @@ -6,6 +6,7 @@ package two

import (
"cycle"

"github.com/golang/dep/internal/gps"
)

Expand Down
3 changes: 2 additions & 1 deletion internal/gps/_testdata/src/missing/a.go
Expand Up @@ -7,8 +7,9 @@ package simple
import (
"sort"

"github.com/golang/dep/internal/gps"
"missing/missing"

"github.com/golang/dep/internal/gps"
)

var (
Expand Down