Skip to content

proposal: os: new Readdirentries method to read directory entries and efficiently expose file metadataΒ #40352

@israel-lugo

Description

@israel-lugo

API

(Update: edited to turn into a proposal, add suggested API)

func (f *File) Readdirentries(n int) ([]DirEntry, error)

// The file type could not be determined
const ModeUnknown FileMode = xxx

type DirEntry struct {
  // Name is the base name of the file
  Name string
  // ... unexported fields ...
}

// Id returns a number which uniquely identifies the file within the filesystem that contains it.
// This is known as the inode number in Unix-like systems, or file ID in Windows. Under Windows,
// this will require a system call on first call, but not on Unix.
//
// Note this is not guaranteed to be unique in the ReFS file system, introduced with Windows
// Server 2012, since that uses 128-bit identifiers.
func (d *DirEntry) Id() (uint64, error) { ... }

// Type returns the file's type. Depending on the underlying filesystem, this may require an Lstat,
// which will be done internally and cached after first use. If lightweight is true, the Lstat will not
// be done; in that case, if the file type is not immediately known, ModeUnknown will be
// returned. This may be useful e.g.if the caller will be opening the file anyway and would prefer
// to do a Stat of the open file to avoid filename races. 
func (d *DirEntry) Type(lightweight bool) (FileMode, error) { ... }

// Lstat behaves like the normal Lstat. Its result will be cached after the first use, which may have
// occurred from calling Type or even Inode under Windows.
func (d *DirEntry) Lstat() (FileInfo, error) { ... }

This is analogous e.g. to Python's os.scandir, as pointed out by @qingyunha below.

Context

Could we please have a new File-level API to list a directory's entries, which exposes the d_type field (syscall.Dirent.Type) and, ideally, also d_ino (syscall.Dirent.Ino)?

Under Linux, certain filesystems (such as Btrfs, ext4) store the file type information in the direntry itself. This is available via the d_type field, which may be DT_UNKNOWN if the file type could not be determined for some reason (e.g. no filesystem support, or weird quirks such as "." or ".."). According to man readdir(3), some BSDs also support this.

Currently, we have os.(*File).Readdir, which does an lstat on every file and does not make use of the type information, even if it's there. This makes sense given the method's signature, since it needs to find out the file's size, mode, etc.

We also have os.(*File).Readdirnames, which reads the dirent but only returns the name portion.

It would be very useful to have an intermediate method between these two, that returns not only the name, but also the file type (which may of course be DT_UNKNOWN), and ideally anything else it can know from the dirent, such as the file's inode number (d_ino or syscall.Dirent.Ino).

This would make it much easier to implement a fast/scalable file crawler (e.g. for backup software or something else). Given a directory with 100,000 entries, being able to cheaply separate subdirectories from other files while listing the directory itself lets the crawler e.g. batch up regular files for further processing, or choose crawling strategies depending on whether there are 2 subdirectories or 75,000. Especially for the backup case, having the inode number outright would also be useful, as it helps identify hardlinks (which may skip reading the data twice) without the cost of the lstat.

See e.g. this topic in golang-nuts for some speed comparisons. This can make a very big difference.

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.ProposalProposal-Hold

    Type

    No type

    Projects

    Status

    Hold

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions