Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: io/fs,path/filepath: add iterator form of WalkDir #64341

Open
rogpeppe opened this issue Nov 22, 2023 · 11 comments
Open

proposal: io/fs,path/filepath: add iterator form of WalkDir #64341

rogpeppe opened this issue Nov 22, 2023 · 11 comments
Labels
Milestone

Comments

@rogpeppe
Copy link
Contributor

rogpeppe commented Nov 22, 2023

It's often more convenient to use a for loop rather than a callback-based API, and fs.WalkDir and filepath.Walk are both callback-based APIs. I've been using github.com/kr/fs for years to work around this.
The new iterate-over-func functionality (#61405) now makes it straightforward to adapt the stdlib functions into iterators, so perhaps we could provide iterator-based APIs as part of the standard library now.

Here's a straw-man implementation:

package fs

type WalkDirEntry struct {
	Path    string
	Entry   fs.DirEntry
	Err     error
	skipDir *bool
}

// SkipDir causes the iteration to skip the contents
// of the entry. This will have no effect if called outside the iteration
// for this particular entry.
func (entry WalkDirEntry) SkipDir() {
	*entry.skipDir = true
}

// WalkDirIter returns an iterator that can be used to iterate over the contents
// of a directory. It uses WalkDir under the hood. To skip a directory,
// call the SkipDir method on an entry.
func WalkDirIter(fsys fs.FS, root string) func(func(WalkDirEntry) bool) {
	return func(yield func(WalkDirEntry) bool) {
		fs.WalkDir(fsys, root, func(path string, d fs.DirEntry, err error) error {
			skipDir := false
			info := WalkDirEntry{
				Path:    path,
				Entry:   d,
				Err:     err,
				skipDir: &skipDir,
			}
			if !yield(info) {
				return fs.SkipAll
			}
			if skipDir {
				return fs.SkipDir
			}
			return nil
		})
	}
}

The main point of contention from my point of view is the semantics of the SkipDir method. Calling a method on an iterated item to determine the subsequent course of iteration seems somewhat unusual, but I don't see any other decent alternative.

[Edited to remove incorrect remarks about GC behavior].

@gopherbot gopherbot added this to the Proposal milestone Nov 22, 2023
@mateusz834
Copy link
Member

Any reason not to return the error as a second value? like this:

type WalkDirEntry struct {
	Path    string
	Entry   fs.DirEntry
	skipDir *bool
}

func WalkDirIter(fsys fs.FS, root string) func(func(WalkDirEntry, error) bool) {

Seems like the conclusion of #61405 is to return errors this way. #61405 (comment)

@rogpeppe
Copy link
Contributor Author

rogpeppe commented Nov 22, 2023

Any reason not to return the error as a second value? like this:

That was @Merovius's suggestion too. I wasn't sure, because the Path and Entry fields can be valid even when the error isn't, making the docs a bit awkward to write, which made me think it might not be the correct approach.

Nonetheless, here's a stab at it. Maybe it's not so awkward after all.

// WalkDirIter returns an iterator that can be used to iterate over the contents
// of a directory. The implementation uses [WalkDir].
//
// Each iteration produces an entry and an error that will be non-nil
// for unreadable directories. The [WalkDirEntry.Path] field is always valid;
// the [WalkDirEntry.Entry] field is always valid except for when the
// initial Stat on the root directory fails.
//
// If a directory's ReadDir method fails, the iteration will produce two
// entries, both with the directory's path: first with a nil error,
// giving the body of the loop the chance to stop or call [WalkDirEntry.SkipDir],
// and secondly with the error from ReadDir.
//
// To skip a directory, call the SkipDir method on an entry.
func WalkDirIter(fsys fs.FS, root string) func(func(WalkDirEntry, error) bool) {
	var skipDir bool // outside the loop so we'll only allocate once.
	return func(yield func(WalkDirEntry, error) bool) {
		fs.WalkDir(fsys, root, func(path string, d fs.DirEntry, err error) error {
			info := WalkDirEntry{
				Path:    path,
				Entry:   d,
				skipDir: &skipDir,
			}
			skipDir = false
			if !yield(info, err) {
				return fs.SkipAll
			}
			if skipDir {
				return fs.SkipDir
			}
			return nil
		})
	}
}

@mateusz834
Copy link
Member

I am also not a big fan of the SkipDir method, it looks weird to skip something (change the state of the iterator) using the result of the iterator, but on the other hand I don't know how this could be done differently with range-over-func. Was this discussed somewhere in #61405?

@Merovius
Copy link
Contributor

@gazerro
Copy link
Contributor

gazerro commented Nov 22, 2023

I've observed (https://sourcegraph.com/search?q=context%3Aglobal+lang%3Ago+fs.WalkDir&patternType=standard&sm=1&groupBy=repo) that most fs.WalkDirFunc functions first check for the existence of an error. This makes me inclined to sympathize with @mateusz834's suggestion. However, what I observe most is that when the error is not checked first, access to the DirEntry argument occurs without verifying that it is not nil. (It is nil if the stat on the root directory has failed.)

This lack of checking obviously leads to a panic. There are also two missing checks in two tests in the Go repository.

I suggest not passing the error as a parameter, that is, leaving only one parameter, but ensuring that WalkDirIter returns an error in case the stat on the root directory fails:

files, err := fs.WalkDirIter(fsys, root)
if err != nil {
      // stat on the root directory failed
      ...
}
for file := range files {
	_ = file.Entry.Name() // this cannot panic because file.Entry is always guaranteed to be non-nil
}

@rogpeppe
Copy link
Contributor Author

@gazerro Yes, I tend to agree, although it's a bit unfortunate that that makes it considerably harder (or perhaps no possible?) to implement WalkDirIter in terms of WalkDir AFAICS. Perhaps it might be preferable to reverse that, because I think it should be easier to implement WalkDir in terms of WalkDirIter with those semantics.

@gazerro
Copy link
Contributor

gazerro commented Nov 22, 2023

Perhaps it might be preferable to reverse that, because I think it should be easier to implement WalkDir in terms of WalkDirIter with those semantics.

Yes, I agree. It would be more appropriate to implement WalkDir in terms of WalkDirIter and possibly deprecate WalkDir.

@gazerro
Copy link
Contributor

gazerro commented Nov 22, 2023

I also suggest naming the new function DirWalker (or perhaps simply Entries? Though, this name might be overly simplistic). While WalkDir appropriately denotes an action, the new function simply returns an iterator (or a sequence, as one prefers to say). So it would be used as:

entries, err := fs.DirWalker(fsys, root)
if err != nil {
      // stat on the root directory failed
      ...
}
for entry := range entries {
	// ...
}

@earthboundkid
Copy link
Contributor

earthboundkid commented Nov 23, 2023

Individual subdirectories and files can fail, so I don’t know what it saves you to return an error for the initial stat.

@gazerro
Copy link
Contributor

gazerro commented Nov 23, 2023

@carlmjohnson if the first stat fails, the DirEntry.Entry field is nil. Therefore, before accessing DirEntry.Entry, it is necessary to check that it is not nil. This should be done both within an if err != nil { ... } statement and in cases where the error is not handled because we are not interested in handling it.

I have noticed many cases (https://sourcegraph.com/search?q=context%3Aglobal+lang%3Ago+fs.WalkDir&patternType=standard&sm=1&groupBy=repo) where this check is not done.

In addition, this is my opinion, I find it incorrect for the function to return an iterator on a root directory that does not exist.

@myaaaaaaaaa
Copy link

myaaaaaaaaa commented Aug 21, 2024

I think it would be better to focus solely on the use case of reading file contents, since the design of Go's range-over-func itself is biased towards iterators that are simple rather than featureful:

package fs

type Data interface {
	Stat() FileInfo
	ReadAll() []byte
}

func Files(fsys FS, err *error) iter.Seq2[string, Data]

Usage would be something like:

func main() {
	var err error
	for pathName, data := range fs.Files(os.DirFS("."), &err) {
		if !strings.HasSuffix(pathName, ".json") {
			continue
		}
		// returns []byte{} - not nil - on error
		// errors are automatically collected at the end
		json := data.ReadAll()
		fmt.Println(string(json))
	}
	// loop automatically terminates early if any error occurs
	if err != nil {
		fmt.Println(err)
	}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Incoming
Development

No branches or pull requests

7 participants