archive/tar: add support for writing tar containing sparse files #13548

Open
grubernaut opened this Issue Dec 9, 2015 · 43 comments

Comments

Projects
None yet
@grubernaut

I've created a Github Repo with all the needed steps for reproducing this on Ubuntu 12.04 using Go1.5.1. I've also verified that using Go1.5.2 still experiences this error.

Run vagrant create then vagrant provision from repository root.

vagrant create
vagrant provision

Expected Output:

$ vagrant provision
==> default: Running provisioner: shell...
    default: Running: inline script
==> default: stdin: is not a tty
==> default: go version go1.5.2 linux/amd64
==> default: Creating Sparse file
==> default: Proving file is truly sparse
==> default: 0 -rw-r--r-- 1 root root 512M Dec  9 15:26 sparse.img
==> default: Compressing in Go without sparse
==> default: Compressing in Go with sparse
==> default: FileInfo File Size: 536870912
==> default: Proving non-sparse in Go gained size on disk
==> default: 512M -rw-r--r-- 1 root root 512M Dec  9 15:26 non_sparse/sparse.img
==> default: Proving sparse in Go DID keep file size on disk
==> default: 0 -rw-r--r-- 1 root root 0 Dec  9 15:26 sparse/sparse.img
==> default: Compressing via tar w/ Sparse Flag set
==> default: Proving sparse via tar DID keep file size on disk
==> default: 0 -rw-r--r-- 1 root root 512M Dec  9 15:26 tar/sparse.img

Actual Output:

$ vagrant provision
==> default: Running provisioner: shell...
    default: Running: inline script
==> default: stdin: is not a tty
==> default: go version go1.5.2 linux/amd64
==> default: Creating Sparse file
==> default: Proving file is truly sparse
==> default: 0 -rw-r--r-- 1 root root 512M Dec  9 15:35 sparse.img
==> default: Compressing in Go without sparse
==> default: Compressing in Go with sparse
==> default: Proving non-sparse in Go gained size on disk
==> default: 513M -rw-r--r-- 1 root root 512M Dec  9 15:35 non_sparse/sparse.img
==> default: Proving sparse in Go DID NOT keep file size on disk
==> default: 512M -rw-r--r-- 1 root root 512M Dec  9 15:35 sparse/sparse.img
==> default: Compressing via tar w/ Sparse Flag set
==> default: Proving sparse via tar DID keep file size on disk
==> default: 0 -rw-r--r-- 1 root root 512M Dec  9 15:35 tar/sparse.img

The Vagrantfile supplied in the repository runs the following shell steps:

  • Installs Go
  • Creates a sparse file via truncate -s 512M sparse.img
  • Proves that the file is sparse via ls -lash sparse.img
  • Runs compress.go via go run compress.go
  • Untars the archives created by compress.go via tar -xf
  • Verifies that the extracted files did not maintain sparse files, both with and without the sparse type set in the tar file's header. ls -lash sparse.img
  • Uses GNU/Tar to compress the sparse file with the sparse flag set tar -Scf sparse.tar sparse.img
  • Extracts the archive created by GNU/Tar tar -xf sparse.tar
  • Proves that GNU/Tar maintained sparse files ls -lash sparse.img

This is somewhat related to #12594.

I could also be creating the archive incorrectly, and have tried a few different methods for creating the tar archive, each one however, did not keep the sparse files intact upon extraction of the archive. This also cannot be replicated in OSX as HGFS+ does not have a concept of sparse files, and instantly destroys any file sparseness, hence the need for running and testing the reproduction case in a vagrant vm.

Any thoughts or hints into this would be greatly appreciated, thanks!

@bradfitz

This comment has been minimized.

Show comment
Hide comment
@bradfitz

bradfitz Dec 9, 2015

Member

/cc @dsnet who's been going crazy on the archive/tar package in the Go 1.6 tree ("master" branch)

Member

bradfitz commented Dec 9, 2015

/cc @dsnet who's been going crazy on the archive/tar package in the Go 1.6 tree ("master" branch)

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Dec 9, 2015

Member

This isn't a bug per-say, but more of a feature request. Sparse file support is only provided for tar.Reader, but not tar.Writer. Currently, it's a bit asymmetrical, but supporting sparse files on tar.Writer requires API change, which may take some time to think about.

Also, this is mostly unrelated to #12594. Although, that bug should definitely be fixed before any attempt at this is made. For the time being, I recommend putting this in the "unplanned" milestone, I'll revisit this issue when the other tar bugs are first fixed.

Member

dsnet commented Dec 9, 2015

This isn't a bug per-say, but more of a feature request. Sparse file support is only provided for tar.Reader, but not tar.Writer. Currently, it's a bit asymmetrical, but supporting sparse files on tar.Writer requires API change, which may take some time to think about.

Also, this is mostly unrelated to #12594. Although, that bug should definitely be fixed before any attempt at this is made. For the time being, I recommend putting this in the "unplanned" milestone, I'll revisit this issue when the other tar bugs are first fixed.

@grubernaut

This comment has been minimized.

Show comment
Hide comment
@grubernaut

grubernaut Dec 9, 2015

@dsnet should I keep this here as a feature request, or is there another preferred format for those?

@dsnet should I keep this here as a feature request, or is there another preferred format for those?

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Dec 9, 2015

Member

The issue tracker is perfect for that. So this is just fine.

Member

dsnet commented Dec 9, 2015

The issue tracker is perfect for that. So this is just fine.

@rsc rsc changed the title from archive/tar: Writing a tarfile does not maintain sparse files to archive/tar: no support for writing tar containing sparse files Dec 28, 2015

@rsc rsc added this to the Unplanned milestone Dec 28, 2015

@rsc rsc changed the title from archive/tar: no support for writing tar containing sparse files to archive/tar: add support for writing tar containing sparse files Dec 28, 2015

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Feb 26, 2016

Member

This my proposed addition to the tar API to support sparse writing.

First, we modify tar.Header to have an extra field:

type Header struct {
    ...

    // SparseHoles represents a sequence of holes in a sparse file.
    //
    // The regions must be sorted in ascending order, not overlap with
    // each other, and not extend past the specified Size.
    // If len(SparseHoles) > 0 or Typeflag is TypeGNUSparse, then the file is
    // sparse. It is optional for Typeflag to be set to TypeGNUSparse.
    SparseHoles  []SparseHole
}

// SparseEntry represents a Length-sized fragment at Offset in the file.
type SparseEntry struct {
    Offset int64
    Length int64
}

On the reader side, nothing much changes. We already support sparse files. All that's being done is that we're now exporting information about the sparse file through the SparseHoles field.

On the writer side, the user must set the SparseHoles field if they intend to write a sparse file. It is optional for them to set Typeflag to TypeGNUSparse (there are multiple formats to represent sparse files so this is not important). The user then proceeds to write all the data for the file. For sparse holes, they will be required to write Length zeros for that given hole. It is a little inefficient writing zeros for the holes, but I decided on this approach because:

  • It is symmetrical with how tar.Reader already operates (which transparently expands a sparse file).
  • It is more representative of what the "end result" really looks like. For example, it allows a user to write a sparse file by just doing io.Copy(tarFile, sparseFile) and not worry about where the holes are (assuming they already populated the SparseHoles field).

I should note that the tar format represents sparse files by indicating which regions have data, and treating everything else as a hole. The API exposed here does the opposite; it represents sparse files by indicating which regions are holes, and treating everything else as data. The reason for this inversion is because it fits the Go philosophy that the zero value of some be meaningful. The zero value of SparseHoles indicates that there are no holes in the file, and thus it is a normal file; i.e., the default makes sense. If we were to use SparseDatas instead, the zero value of that indicates that there is no data in the file, which is rather odd.

It is a little inefficient requiring that users write zeros and the bottleneck will be the memory bandwidth's ability to transfer potentially large chunks of zeros. Though not necessary, the following methods may be worth adding as well:

// Discard skips the next n bytes, returning the number of bytes discarded.
// This is useful when dealing with sparse files to efficiently skip holes.
func (tr *Reader) Discard(n int64) (int64, error) {}

// FillZeros writes the next n bytes by filling it in with zeros.
// It returns the number of bytes written, and an error if any.
// This is useful when dealing with sparse files to efficiently skip holes.
func (tw *Writer) FillZeros(n int64) (int64, error) {}

Potential example usage: https://play.golang.org/p/Vy63LrOToO

Member

dsnet commented Feb 26, 2016

This my proposed addition to the tar API to support sparse writing.

First, we modify tar.Header to have an extra field:

type Header struct {
    ...

    // SparseHoles represents a sequence of holes in a sparse file.
    //
    // The regions must be sorted in ascending order, not overlap with
    // each other, and not extend past the specified Size.
    // If len(SparseHoles) > 0 or Typeflag is TypeGNUSparse, then the file is
    // sparse. It is optional for Typeflag to be set to TypeGNUSparse.
    SparseHoles  []SparseHole
}

// SparseEntry represents a Length-sized fragment at Offset in the file.
type SparseEntry struct {
    Offset int64
    Length int64
}

On the reader side, nothing much changes. We already support sparse files. All that's being done is that we're now exporting information about the sparse file through the SparseHoles field.

On the writer side, the user must set the SparseHoles field if they intend to write a sparse file. It is optional for them to set Typeflag to TypeGNUSparse (there are multiple formats to represent sparse files so this is not important). The user then proceeds to write all the data for the file. For sparse holes, they will be required to write Length zeros for that given hole. It is a little inefficient writing zeros for the holes, but I decided on this approach because:

  • It is symmetrical with how tar.Reader already operates (which transparently expands a sparse file).
  • It is more representative of what the "end result" really looks like. For example, it allows a user to write a sparse file by just doing io.Copy(tarFile, sparseFile) and not worry about where the holes are (assuming they already populated the SparseHoles field).

I should note that the tar format represents sparse files by indicating which regions have data, and treating everything else as a hole. The API exposed here does the opposite; it represents sparse files by indicating which regions are holes, and treating everything else as data. The reason for this inversion is because it fits the Go philosophy that the zero value of some be meaningful. The zero value of SparseHoles indicates that there are no holes in the file, and thus it is a normal file; i.e., the default makes sense. If we were to use SparseDatas instead, the zero value of that indicates that there is no data in the file, which is rather odd.

It is a little inefficient requiring that users write zeros and the bottleneck will be the memory bandwidth's ability to transfer potentially large chunks of zeros. Though not necessary, the following methods may be worth adding as well:

// Discard skips the next n bytes, returning the number of bytes discarded.
// This is useful when dealing with sparse files to efficiently skip holes.
func (tr *Reader) Discard(n int64) (int64, error) {}

// FillZeros writes the next n bytes by filling it in with zeros.
// It returns the number of bytes written, and an error if any.
// This is useful when dealing with sparse files to efficiently skip holes.
func (tw *Writer) FillZeros(n int64) (int64, error) {}

Potential example usage: https://play.golang.org/p/Vy63LrOToO

@ianlancetaylor

This comment has been minimized.

Show comment
Hide comment
@ianlancetaylor

ianlancetaylor Feb 26, 2016

Contributor

If Reader and Writer support sparse files transparently, why export SparseHoles? Is the issue that when writing you don't want to introduce a sparse hole that the caller did not explicitly request?

Contributor

ianlancetaylor commented Feb 26, 2016

If Reader and Writer support sparse files transparently, why export SparseHoles? Is the issue that when writing you don't want to introduce a sparse hole that the caller did not explicitly request?

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Feb 26, 2016

Member

The Reader expands sparse files transparently. The Writer is "transparent" in the sense that a user can just do io.Copy(tw, sparseFile) and so long as the user already specified where there sparse holes are, it will avoid writing the long runs of zeros.

Purely transparent sparse files for Writer cannot easily done since the tar.Header is written before the file data. Thus, the Writer cannot know what sparse map to encode in the header prior to seeing the data itself. Thus, Writer.WriteHeader needs to be told where the sparse holes are.

I don't think tar should automatically create sparse files (for backwards compatibility). As a data point, the tar utilities do not automatically generate sparse files unless the -S flag is passed in. However, it would be nice if the user didn't need to come up with the SparseHoles themselves. Unfortunately, I don't see an easy solution to this.


There are three main ways that sparse files may be written:

  1. In the case of writing a file from the filesystem (the use case that spawned this issue is of this), I'm not aware of any platform independent way to easily query for all the sparse holes. There is a method to do this on Linux and Solaris with SEEK_DATA and SEEK_HOLE (see my test in CL/17692), but I'm not aware of ways to do this on other OSes like Windows or Darwin.
  2. In the case of a round-trip read-write, a tar.Header read from Reader.Next and written to Writer.WriteHeader will work just fine as expected since tar.Header will have the SparseHoles field populated.
  3. In the case of writing a file from a memory, the user will need to write their own zero detection scheme (assuming they don't already know where the holes are).

I looked at the source for GNU and BSD tar to see what they do:

  • (Source) BSD tar attempts to use FIEMAP first, then SEEK_DATA/SEEK_HOLE, then (it seems) it avoids sparse files altogether.
  • (Source) GNU tar attempts to use SEEK_DATA/SEEK_HOLE, then falls back on brute-force zero block detection.

I'm not too fond of the OS specific things that they do to detect holes (granted archive/tar already has many OS specific things in it). I think it would be nice if tar.Writer provided a way to write spares files, but I think we should delegate detection of sparse holes to the user for now. If possible, we can try and get sparse info during FileInfoHeader, but I'm not sure that os.FileInfo has the necessary information to do the queries that are needed.

Member

dsnet commented Feb 26, 2016

The Reader expands sparse files transparently. The Writer is "transparent" in the sense that a user can just do io.Copy(tw, sparseFile) and so long as the user already specified where there sparse holes are, it will avoid writing the long runs of zeros.

Purely transparent sparse files for Writer cannot easily done since the tar.Header is written before the file data. Thus, the Writer cannot know what sparse map to encode in the header prior to seeing the data itself. Thus, Writer.WriteHeader needs to be told where the sparse holes are.

I don't think tar should automatically create sparse files (for backwards compatibility). As a data point, the tar utilities do not automatically generate sparse files unless the -S flag is passed in. However, it would be nice if the user didn't need to come up with the SparseHoles themselves. Unfortunately, I don't see an easy solution to this.


There are three main ways that sparse files may be written:

  1. In the case of writing a file from the filesystem (the use case that spawned this issue is of this), I'm not aware of any platform independent way to easily query for all the sparse holes. There is a method to do this on Linux and Solaris with SEEK_DATA and SEEK_HOLE (see my test in CL/17692), but I'm not aware of ways to do this on other OSes like Windows or Darwin.
  2. In the case of a round-trip read-write, a tar.Header read from Reader.Next and written to Writer.WriteHeader will work just fine as expected since tar.Header will have the SparseHoles field populated.
  3. In the case of writing a file from a memory, the user will need to write their own zero detection scheme (assuming they don't already know where the holes are).

I looked at the source for GNU and BSD tar to see what they do:

  • (Source) BSD tar attempts to use FIEMAP first, then SEEK_DATA/SEEK_HOLE, then (it seems) it avoids sparse files altogether.
  • (Source) GNU tar attempts to use SEEK_DATA/SEEK_HOLE, then falls back on brute-force zero block detection.

I'm not too fond of the OS specific things that they do to detect holes (granted archive/tar already has many OS specific things in it). I think it would be nice if tar.Writer provided a way to write spares files, but I think we should delegate detection of sparse holes to the user for now. If possible, we can try and get sparse info during FileInfoHeader, but I'm not sure that os.FileInfo has the necessary information to do the queries that are needed.

@AkihiroSuda

This comment has been minimized.

Show comment
Hide comment
@AkihiroSuda

AkihiroSuda Nov 29, 2016

@dsnet Design SGTM (non-binding), do you plan to implement that feature?

@dsnet Design SGTM (non-binding), do you plan to implement that feature?

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Dec 1, 2016

Member

I'll try and get this into the Go 1.9 cycle. However, a major refactoring of the tar.Writer implementation needs to happen first.

Member

dsnet commented Dec 1, 2016

I'll try and get this into the Go 1.9 cycle. However, a major refactoring of the tar.Writer implementation needs to happen first.

@dsnet dsnet modified the milestones: Go1.9Maybe, Unplanned Dec 1, 2016

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Dec 7, 2016

Member

That being said, for all those interested in this feature, can you mention what your use case is?

For example, are you only interested in being able to write a sparse file where you have to specify explicitly where the holes in the file are? Or do you expect to pass an os.FileInfo and have the tar package figure it out (I'm not sure this is possible)?

Member

dsnet commented Dec 7, 2016

That being said, for all those interested in this feature, can you mention what your use case is?

For example, are you only interested in being able to write a sparse file where you have to specify explicitly where the holes in the file are? Or do you expect to pass an os.FileInfo and have the tar package figure it out (I'm not sure this is possible)?

@willglynn

This comment has been minimized.

Show comment
Hide comment
@willglynn

willglynn Dec 8, 2016

My use is go_ami_tools/aws_bundle, a library which makes machine images for Amazon EC2. The inside of the Amazon bundle format is a sparse tar, which is a big advantage for machine images since there's usually lots of zeroes. go_ami_tools currently writes all the zeroes and lets them get compressed away, but a spare tar would be better.

I'd like to leave zero specification up to the user of my library. ec2-bundle-and-upload-image – my example tool – would read zeroes straight from the host filesystem, but someone could just as easily plug the go_ami_tools library to a VMDK or QCOW reader in which case the zeroes would be caller-specified.

My use is go_ami_tools/aws_bundle, a library which makes machine images for Amazon EC2. The inside of the Amazon bundle format is a sparse tar, which is a big advantage for machine images since there's usually lots of zeroes. go_ami_tools currently writes all the zeroes and lets them get compressed away, but a spare tar would be better.

I'd like to leave zero specification up to the user of my library. ec2-bundle-and-upload-image – my example tool – would read zeroes straight from the host filesystem, but someone could just as easily plug the go_ami_tools library to a VMDK or QCOW reader in which case the zeroes would be caller-specified.

@AkihiroSuda

This comment has been minimized.

Show comment
Hide comment
@AkihiroSuda

AkihiroSuda Dec 8, 2016

My use case is to solve a Docker's issue moby/moby#5419 (comment) , which leads docker build to ENOSPC when the container image contains a sparse file.

My use case is to solve a Docker's issue moby/moby#5419 (comment) , which leads docker build to ENOSPC when the container image contains a sparse file.

@grubernaut

This comment has been minimized.

Show comment
Hide comment
@grubernaut

grubernaut Dec 8, 2016

We (Hashicorp) run Packer builds for customers on our public SaaS, Atlas. We offer up an Artifact Store for Atlas customers so that they can store their created Vagrant Boxes, VirtualBox (ISO, VMX), QEMU, or other builds inside our infrastructure. If the customer specifies using the Atlas post-processor during a Packer build, we first create an archive of the resulting artifact, and then we create a POST to Atlas with the resulting archive.

Many of the resulting QEMU, VirtualBox, and VMware builds can be fairly large (10-20GB), and we've had a few customers sparse the resulting disk image, which can lower the resulting artifacts size to ~500-1024MB. This, of course allows for faster downloads, less bandwidth usage, and a better experience overall.

We first start to create the archive from the Atlas Post-Processor in Packer (https://github.com/mitchellh/packer/blob/master/post-processor/atlas/post-processor.go#L154).
We then archive the resulting artifact directory, and walk the directory. Finally, we write the file headers, and perform an io.Copy: (https://github.com/hashicorp/atlas-go/blob/master/archive/archive.go#L381).

In this case, we wouldn't know explicitly where the holes in the file are, and would have to rely on os.FileInfo or something similar to generate the sparsemap of the file; although I'm not entirely sure that this is possible.

We (Hashicorp) run Packer builds for customers on our public SaaS, Atlas. We offer up an Artifact Store for Atlas customers so that they can store their created Vagrant Boxes, VirtualBox (ISO, VMX), QEMU, or other builds inside our infrastructure. If the customer specifies using the Atlas post-processor during a Packer build, we first create an archive of the resulting artifact, and then we create a POST to Atlas with the resulting archive.

Many of the resulting QEMU, VirtualBox, and VMware builds can be fairly large (10-20GB), and we've had a few customers sparse the resulting disk image, which can lower the resulting artifacts size to ~500-1024MB. This, of course allows for faster downloads, less bandwidth usage, and a better experience overall.

We first start to create the archive from the Atlas Post-Processor in Packer (https://github.com/mitchellh/packer/blob/master/post-processor/atlas/post-processor.go#L154).
We then archive the resulting artifact directory, and walk the directory. Finally, we write the file headers, and perform an io.Copy: (https://github.com/hashicorp/atlas-go/blob/master/archive/archive.go#L381).

In this case, we wouldn't know explicitly where the holes in the file are, and would have to rely on os.FileInfo or something similar to generate the sparsemap of the file; although I'm not entirely sure that this is possible.

@vbatts

This comment has been minimized.

Show comment
Hide comment
@vbatts

vbatts Apr 24, 2017

Contributor

@dsnet the use-case is largely around the container images. So the Reader design you proposed SGTM, though it would be nice if the tar reader also provider io.Seeker to accommodate the SparseHoles, but that is not a terrible issue just less than ideal.
For the Writer, either passing the FileInfo, or some way quick detection and perhaps an io.Writer wrapper with a type assertion?
Both sides would be useful though. Thanks for your work on this.

Contributor

vbatts commented Apr 24, 2017

@dsnet the use-case is largely around the container images. So the Reader design you proposed SGTM, though it would be nice if the tar reader also provider io.Seeker to accommodate the SparseHoles, but that is not a terrible issue just less than ideal.
For the Writer, either passing the FileInfo, or some way quick detection and perhaps an io.Writer wrapper with a type assertion?
Both sides would be useful though. Thanks for your work on this.

@dsnet dsnet modified the milestones: Go1.10, Go1.9Maybe May 22, 2017

memory added a commit to memory/docker.github.io that referenced this issue May 26, 2017

Suggest passing --no-log-init to adduser
Running `useradd` without `--no-log-init` risks triggering a resource exhaustion issue:

    moby/moby#15585
    moby/moby#5419
    golang/go#13548

@memory memory referenced this issue in docker/docker.github.io May 26, 2017

Merged

Suggest passing --no-log-init to adduser #3413

mistyhacks added a commit to docker/docker.github.io that referenced this issue Jun 2, 2017

Suggest passing --no-log-init to adduser (#3413)
Running `useradd` without `--no-log-init` risks triggering a resource exhaustion issue:

    moby/moby#15585
    moby/moby#5419
    golang/go#13548
@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 18, 2017

Member

Sorry this got dropped in Go1.9, I have a working solution out for review for Go1.10.

Member

dsnet commented Aug 18, 2017

Sorry this got dropped in Go1.9, I have a working solution out for review for Go1.10.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Aug 18, 2017

Change https://golang.org/cl/56771 mentions this issue: archive/tar: refactor Reader support for sparse files

Change https://golang.org/cl/56771 mentions this issue: archive/tar: refactor Reader support for sparse files

gopherbot pushed a commit that referenced this issue Aug 19, 2017

archive/tar: refactor Reader support for sparse files
This CL is the first step (of two) for adding sparse file support
to the Writer. This CL only refactors the logic of sparse-file handling
in the Reader so that common logic can be easily shared by the Writer.

As a result of this CL, there are some new publicly visible API changes:
	type SparseEntry struct { Offset, Length int64 }
	type Header struct { ...; SparseHoles []SparseEntry }

A new type is defined to represent a sparse fragment and a new field
Header.SparseHoles is added to represent the sparse holes in a file.
The API intentionally represent sparse files using hole fragments,
rather than data fragments so that the zero value of SparseHoles
naturally represents a normal file (i.e., a file without any holes).
The Reader now populates SparseHoles for sparse files.

It is necessary to export the sparse hole information, otherwise it would
be impossible for the Writer to specify that it is trying to encode
a sparse file, and what it looks like.

Some unexported helper functions were added to common.go:
	func validateSparseEntries(sp []SparseEntry, size int64) bool
	func alignSparseEntries(src []SparseEntry, size int64) []SparseEntry
	func invertSparseEntries(src []SparseEntry, size int64) []SparseEntry

The validation logic that used to be in newSparseFileReader is now moved
to validateSparseEntries so that the Writer can use it in the future.
alignSparseEntries is currently unused by the Reader, but will be used
by the Writer in the future. Since TAR represents sparse files by
only recording the data fragments, we add the invertSparseEntries
function to convert a list of data fragments to a normalized list
of hole fragments (and vice-versa).

Some other high-level changes:
* skipUnread is deleted, where most of it's logic is moved to the
Discard methods on regFileReader and sparseFileReader.
* readGNUSparsePAXHeaders was rewritten to be simpler.
* regFileReader and sparseFileReader were completely rewritten
in simpler and easier to understand logic.
* A bug was fixed in sparseFileReader.Read where it failed to
report an error if the logical size of the file ends before
consuming all of the underlying data.
* The tests for sparse-file support was completely rewritten.

Updates #13548

Change-Id: Ic1233ae5daf3b3f4278fe1115d34a90c4aeaf0c2
Reviewed-on: https://go-review.googlesource.com/56771
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Aug 19, 2017

Change https://golang.org/cl/57212 mentions this issue: archive/tar: implement Writer support for sparse files

Change https://golang.org/cl/57212 mentions this issue: archive/tar: implement Writer support for sparse files

@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Aug 20, 2017

Contributor

I think the proposed API is suboptimal because it leaves users of the library with the daunting task of correctly doing hole detections if they want to properly handle sparse files without disk explosions. I have a proposal for a different API.

Reader

Change it so that sparse files can be transparently extracted by using io.Copy to disk.

  • Add a new exported function tar.Reader.WriteTo(). This function implements a standard extraction unless the file being read is a sparse file and the writer also implements io.Seeker: in this case, holes defined in the tar header are skipped using Seek().
  • With the above, existing code that extracts a tar file using io.Copy() is silently upgraded to generate sparse file (assuming it extracts to disk, and the filesystem supports extents; otherwise, nothing changes).
  • I think this is a welcome improvement for people that actually do care about sparse files, because tar.Reader would stop expanding them and provides a way to actually preserve them as sparse files while being streamed to disk.

Writer

Change it so that sparse files can be transparently packed by using io.Copy from disk, with best-effort hole detection.

  • The client will signal sparse file by setting typeflag to TypeGNUSparseFile. In this case, WriteHeader() caches a copy of the provided header but does not actually write immediately to disk, because the sparse map is not available yet without inspecting the file.
  • tar.Writer.Write() is not able to do hole detection. If this function is called for a sparse file, the cached header is immediately flushed to disk with an empty sparse map.
  • Add a new exported function tar.Writer.ReadFrom(). When this function is called for a sparse file, and if the provider reader implements io.Seeker, hole detection is performed using SEEK_HOLE / SEEK_DATA. After hole detection, the cached header can be completed with the sparse map and written to disk, followed by the actual sparse data.
    • In future, further hole detection mechanisms can be added here for different operating systems.
  • If no holes are found (or hole detection is not supported), the typeflag is automatically changed back to TypeReg. This basically allows users to opt-in to automatic sparse files detection by always using TypeGNUSparseFile for regular files (obtaining a result similar to gtar -S).

Notes

  • I think this API is superior because most users don't have sparse maps available. They don't even know how to compute them. What they have is sparse files on disks (eg: generated by a VM) and they want to efficiently pack them with tar. Or they might find tar with sparse files inside and they just want to extract them to disk. We just tell them to use io.Copy() to/from os.File and everything will work as expected. I expect that both Packer and Docker would transparently be fixed.
  • Since this bug was opened, nobody ever raised a concrete use case in which somebody has a sparse data structure in memory and wants to pack it as a tar file. Assuming this use case ever arises, it is still possible to implement it by calling tar.Writer.ReadFrom and passing a special ReadSeeker that is able to go through the sparse data structure in memory like it was a sparse file, and also supports SEEK_HOLE / SEEK_DATA.
Contributor

rasky commented Aug 20, 2017

I think the proposed API is suboptimal because it leaves users of the library with the daunting task of correctly doing hole detections if they want to properly handle sparse files without disk explosions. I have a proposal for a different API.

Reader

Change it so that sparse files can be transparently extracted by using io.Copy to disk.

  • Add a new exported function tar.Reader.WriteTo(). This function implements a standard extraction unless the file being read is a sparse file and the writer also implements io.Seeker: in this case, holes defined in the tar header are skipped using Seek().
  • With the above, existing code that extracts a tar file using io.Copy() is silently upgraded to generate sparse file (assuming it extracts to disk, and the filesystem supports extents; otherwise, nothing changes).
  • I think this is a welcome improvement for people that actually do care about sparse files, because tar.Reader would stop expanding them and provides a way to actually preserve them as sparse files while being streamed to disk.

Writer

Change it so that sparse files can be transparently packed by using io.Copy from disk, with best-effort hole detection.

  • The client will signal sparse file by setting typeflag to TypeGNUSparseFile. In this case, WriteHeader() caches a copy of the provided header but does not actually write immediately to disk, because the sparse map is not available yet without inspecting the file.
  • tar.Writer.Write() is not able to do hole detection. If this function is called for a sparse file, the cached header is immediately flushed to disk with an empty sparse map.
  • Add a new exported function tar.Writer.ReadFrom(). When this function is called for a sparse file, and if the provider reader implements io.Seeker, hole detection is performed using SEEK_HOLE / SEEK_DATA. After hole detection, the cached header can be completed with the sparse map and written to disk, followed by the actual sparse data.
    • In future, further hole detection mechanisms can be added here for different operating systems.
  • If no holes are found (or hole detection is not supported), the typeflag is automatically changed back to TypeReg. This basically allows users to opt-in to automatic sparse files detection by always using TypeGNUSparseFile for regular files (obtaining a result similar to gtar -S).

Notes

  • I think this API is superior because most users don't have sparse maps available. They don't even know how to compute them. What they have is sparse files on disks (eg: generated by a VM) and they want to efficiently pack them with tar. Or they might find tar with sparse files inside and they just want to extract them to disk. We just tell them to use io.Copy() to/from os.File and everything will work as expected. I expect that both Packer and Docker would transparently be fixed.
  • Since this bug was opened, nobody ever raised a concrete use case in which somebody has a sparse data structure in memory and wants to pack it as a tar file. Assuming this use case ever arises, it is still possible to implement it by calling tar.Writer.ReadFrom and passing a special ReadSeeker that is able to go through the sparse data structure in memory like it was a sparse file, and also supports SEEK_HOLE / SEEK_DATA.
@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 20, 2017

Member

The current API may be sub-optimal in performance, but it is complete in functionality. The suggestions you have are reasonable approaches in addition to what's currently sent out for review.


Your suggestion to add Reader.WriteTo seems reasonable.

However, an implementation of Writer.ReadFrom is not so easy. There are several problems:

  • (As you mentioned), hole detection is hard. SEEK_HOLE and SEEK_DATA are very Linux specific and makes many assumptions even on a Linux platforms; it assumes that the io.Reader is also a io.ReadSeeker, and more specifically, that it is actually an *os.File. Whether it does sparse file expansion correctly will be opaque to the user and when it fails to do so will be even more frustrating to the user. I did research what it would take to be able to do sparse-hole detection on other OS's, and it seems that in other situations, you would either need the filesytem inode or the file descriptor. Neither of which are easily obtained.
  • WriteHeader is called before any write operations on the file. However, your solution requires the WriteHeader to cache the header, and silently change it behind the user's back upon the first write operation. If Write is called first, it must assume non-sparse (unless SparseHoles was populated), and write the header as is. Otherwise, if ReadFrom is called and it satisfies all of the assumptions above, then it magically writes a sparse file.

If it sparse-file detection were more prevalent across all OS's, reliable, and easy to access, then I would support the Writer.ReadFrom, but it's currently too magical in how it works.


In terms of performance, the current API can be augmented by Reader.Discard and Writer.FillZeros, which does allow you to very quickly skip through the holes. While it is a disadvantage that it is the user's responsibility to skip over the holes themselves using Header.SparseHoles. It is an approach that it is much more explicit and clear in how it works.

Member

dsnet commented Aug 20, 2017

The current API may be sub-optimal in performance, but it is complete in functionality. The suggestions you have are reasonable approaches in addition to what's currently sent out for review.


Your suggestion to add Reader.WriteTo seems reasonable.

However, an implementation of Writer.ReadFrom is not so easy. There are several problems:

  • (As you mentioned), hole detection is hard. SEEK_HOLE and SEEK_DATA are very Linux specific and makes many assumptions even on a Linux platforms; it assumes that the io.Reader is also a io.ReadSeeker, and more specifically, that it is actually an *os.File. Whether it does sparse file expansion correctly will be opaque to the user and when it fails to do so will be even more frustrating to the user. I did research what it would take to be able to do sparse-hole detection on other OS's, and it seems that in other situations, you would either need the filesytem inode or the file descriptor. Neither of which are easily obtained.
  • WriteHeader is called before any write operations on the file. However, your solution requires the WriteHeader to cache the header, and silently change it behind the user's back upon the first write operation. If Write is called first, it must assume non-sparse (unless SparseHoles was populated), and write the header as is. Otherwise, if ReadFrom is called and it satisfies all of the assumptions above, then it magically writes a sparse file.

If it sparse-file detection were more prevalent across all OS's, reliable, and easy to access, then I would support the Writer.ReadFrom, but it's currently too magical in how it works.


In terms of performance, the current API can be augmented by Reader.Discard and Writer.FillZeros, which does allow you to very quickly skip through the holes. While it is a disadvantage that it is the user's responsibility to skip over the holes themselves using Header.SparseHoles. It is an approach that it is much more explicit and clear in how it works.

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 20, 2017

Member

As compromise, here's a possibility that has the advantages of Reader.WriteTo and Writer.ReadFrom for performance and (more) explicit handling of sparse files.

We can do the following:

  • We keep the current API of exporting Header.SparseHoles, which gives the user the ability to manipulate that field manually if they want to.
  • We modify FileInfoHeader to populate Header.SparseHoles. (I looked into this in the past and it's not easy).
  • We add Reader.WriteTo, which will operate as you suggested above.
  • We add Writer.ReadFrom which will only check if the input io.Reader is also an io.ReadSeeker. If so, it will use that to seek past large holes, implicitly calling the Writer.fillZeros method (which we will keep unexported).

The above has the advantage that Writer.ReadFrom only needs to check for io.ReadSeeker and doesn't need to assume SEEK_HOLE and SEEK_DATA support. It avoids any magic in Writer.WriteHeader where it would cache the header, possibly change it again, and write it on first write operation. Population of Header.SparseHoles is the responsibility of FileInfoHeader, which is already an OS-specific function given that it takes in an os.FileInfo.

Member

dsnet commented Aug 20, 2017

As compromise, here's a possibility that has the advantages of Reader.WriteTo and Writer.ReadFrom for performance and (more) explicit handling of sparse files.

We can do the following:

  • We keep the current API of exporting Header.SparseHoles, which gives the user the ability to manipulate that field manually if they want to.
  • We modify FileInfoHeader to populate Header.SparseHoles. (I looked into this in the past and it's not easy).
  • We add Reader.WriteTo, which will operate as you suggested above.
  • We add Writer.ReadFrom which will only check if the input io.Reader is also an io.ReadSeeker. If so, it will use that to seek past large holes, implicitly calling the Writer.fillZeros method (which we will keep unexported).

The above has the advantage that Writer.ReadFrom only needs to check for io.ReadSeeker and doesn't need to assume SEEK_HOLE and SEEK_DATA support. It avoids any magic in Writer.WriteHeader where it would cache the header, possibly change it again, and write it on first write operation. Population of Header.SparseHoles is the responsibility of FileInfoHeader, which is already an OS-specific function given that it takes in an os.FileInfo.

@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Aug 20, 2017

Contributor

I like your suggestion because it manages to avoid the implicit header caching, and moves hole detection into header creation, where it belongs. But I don't see how it can be implemented. Main question: how can FileInfoHeader populate Header.SparseHoles? It only gets a os.FileInfo in input and there's no way to open a file from a FileInfo (there's no full path information in it).

Keeping SparseHoles exported also raises some consistency questions:

  • What happens if a user sets SparseHoles in the header but then write non-zero bytes in the holes using Writer.Write?
  • What happens if a user sets SparseHoles but not Typeflag to TypeGNUSparseFile?

but I guess this can be fixed with documentation.

Contributor

rasky commented Aug 20, 2017

I like your suggestion because it manages to avoid the implicit header caching, and moves hole detection into header creation, where it belongs. But I don't see how it can be implemented. Main question: how can FileInfoHeader populate Header.SparseHoles? It only gets a os.FileInfo in input and there's no way to open a file from a FileInfo (there's no full path information in it).

Keeping SparseHoles exported also raises some consistency questions:

  • What happens if a user sets SparseHoles in the header but then write non-zero bytes in the holes using Writer.Write?
  • What happens if a user sets SparseHoles but not Typeflag to TypeGNUSparseFile?

but I guess this can be fixed with documentation.

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 20, 2017

Member

But I don't see how it can be implemented. Main question: how can FileInfoHeader populate Header.SparseHoles?

Agreed. I tried implementing it and it's not possible. I don't see a way around this other than a new constructor function func FileHeader(f *os.File) (*Header, error). I would still like to see a solution for whether that signature is sufficient to detect sparse holes on OSX and Windows.

What happens if a user sets SparseHoles in the header but then write non-zero bytes in the holes using Writer.Write?

The documentation for Writer.Write in CL/57212 says it must be written with NUL-bytes.

What happens if a user sets SparseHoles but not Typeflag to TypeGNUSparseFile?

That's fine. TypeGNUSparseFile implies that the format will be GNU, otherwise it will be PAX. Both are valid. I'll document it more when the format is actually exposed to the user in #18710.

Member

dsnet commented Aug 20, 2017

But I don't see how it can be implemented. Main question: how can FileInfoHeader populate Header.SparseHoles?

Agreed. I tried implementing it and it's not possible. I don't see a way around this other than a new constructor function func FileHeader(f *os.File) (*Header, error). I would still like to see a solution for whether that signature is sufficient to detect sparse holes on OSX and Windows.

What happens if a user sets SparseHoles in the header but then write non-zero bytes in the holes using Writer.Write?

The documentation for Writer.Write in CL/57212 says it must be written with NUL-bytes.

What happens if a user sets SparseHoles but not Typeflag to TypeGNUSparseFile?

That's fine. TypeGNUSparseFile implies that the format will be GNU, otherwise it will be PAX. Both are valid. I'll document it more when the format is actually exposed to the user in #18710.

@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Aug 21, 2017

Contributor

I don't see a way around this other than a new constructor function func FileHeader(f *os.File) (*Header, error).

That is really unfortunate, as it would not even be a superset of FileInfoHeader() (as FileInfoHeader() works with any os.FileInfo, not only those that come from os.File; I used it many time to generate a header from in-memory structures that exposed a os.FileInfo as a way to fake a filesystem node). So we would end up with two similar functions, none of which is able to handle all required cases, and the user would be forced to use one or another depending on the context.

So it looks like there are currently two options on the table:

  • Use my original proposal (implicit header caching during writing, no SparseHoles API)
  • Use your later proposal, but with a new FileHeader constructor which must be used in addition to FileInfoHeader, complicating user code.

Any other idea? Do you have a final call on this?

I would still like to see a solution for whether that signature is sufficient to detect sparse holes on OSX and Windows.

In Windows, you can use os.File.Fd() to access the underlying HANDLE, with which you can call DeviceIOControl with control code FSCTL_QUERY_ALLOCATED_RANGES to access the hole list (see this example).

Currently released versions of macOS (or rather HFS+) doesn't support sparse files. The new APFS filesystem supports them, but the documentation is rather sparse at the moment, given that macOS with APFS is still in beta (this is the only APFS-related API list I found, and it touches several features but not sparse files).

I did some quick test on both beta e non beta version of macOS, and it looks like APFS allows to create sparse file just like Linux, by simply seeking; for instance, I did dd if=/dev/zero of=file.img bs=1 count=0 seek=512000000 to create a file of apparent size of 512 MB that occupies zero bytes (verified with du file.img). Also, the man page of lseek includes SEEK_HOLE and SEEK_DATA, though I haven't directly tested them, but they're described as working exactly as they work in Linux and Solaris. So it looks like that macOS support will be achieved with the same code that will be used on Linux.

Contributor

rasky commented Aug 21, 2017

I don't see a way around this other than a new constructor function func FileHeader(f *os.File) (*Header, error).

That is really unfortunate, as it would not even be a superset of FileInfoHeader() (as FileInfoHeader() works with any os.FileInfo, not only those that come from os.File; I used it many time to generate a header from in-memory structures that exposed a os.FileInfo as a way to fake a filesystem node). So we would end up with two similar functions, none of which is able to handle all required cases, and the user would be forced to use one or another depending on the context.

So it looks like there are currently two options on the table:

  • Use my original proposal (implicit header caching during writing, no SparseHoles API)
  • Use your later proposal, but with a new FileHeader constructor which must be used in addition to FileInfoHeader, complicating user code.

Any other idea? Do you have a final call on this?

I would still like to see a solution for whether that signature is sufficient to detect sparse holes on OSX and Windows.

In Windows, you can use os.File.Fd() to access the underlying HANDLE, with which you can call DeviceIOControl with control code FSCTL_QUERY_ALLOCATED_RANGES to access the hole list (see this example).

Currently released versions of macOS (or rather HFS+) doesn't support sparse files. The new APFS filesystem supports them, but the documentation is rather sparse at the moment, given that macOS with APFS is still in beta (this is the only APFS-related API list I found, and it touches several features but not sparse files).

I did some quick test on both beta e non beta version of macOS, and it looks like APFS allows to create sparse file just like Linux, by simply seeking; for instance, I did dd if=/dev/zero of=file.img bs=1 count=0 seek=512000000 to create a file of apparent size of 512 MB that occupies zero bytes (verified with du file.img). Also, the man page of lseek includes SEEK_HOLE and SEEK_DATA, though I haven't directly tested them, but they're described as working exactly as they work in Linux and Solaris. So it looks like that macOS support will be achieved with the same code that will be used on Linux.

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 21, 2017

Member

(transferring over discussion from CL/57212)

There are 3 distinct tasks with regard to sparse files:

  • A. Support for encoding/decoding sparse files in the TAR Format (reader support already exists, CL/57212 adds support for encoding)
  • B. Support for easily and efficiently writing/reading sparse files to/from disk.
  • C. Support for detecting spares holes in a file on disk

While they are obviously related, they are independent problems, and I believe conflating them together is a mistake.

As it currently stands, the API for Writer is split into two parts: WriteHeader and Write (of which zero or more calls are made to populate the data for the previously written header). This API exactly reflects how TAR files are serialized.

Any solution for sparse files must have the information for sparse holes available at the time that WriteHeader is called (which implies that information about spares holes is held within the Header as either exported or unexported information). I am a proponent of having that information exported since there are other ways through which I want to create sparse files other than just pulling them straight from disk. While I understand that this information is "lower-level" than what users may want, it is a literal representation of what the sparse file looks like and is sufficient for representing sparse files in both GNU and PAX format. Users that want to use higher-level APIs to populate this field do not need to care about it. In the same way, if you use FileInfoHeader, you don't need to care about about setting the Header.Mode yourself ever, but the fact that Mode is available is still very useful when crafting the Header manually.

That being said, we can separate-out task C as a helper method or function that takes in an *os.File and populates a Header.SparseHoles field. It seems that we can't use FileInfoHeader(os.FileInfo) (*Header, error) because of lack of information, and there are disadvantages to FileHeader(*os.File) (*Header, error). We could make it even more surgical and only generate the sparse holes: (*Header) SetSparseHolesFrom(*os.File) error.

API aside, the implementation itself is actually hard because support for sparse-hole detection varies widely across operating systems. (Anyone who's looked at the code for GNU or BSD tar will see a host of #ifdef special-casing logic for different platforms, yuck). The fact that detection relies on OS-specific details is all the more reason why we should not conflate C into B or A; that is, Writer or Reader should not change behavior depending on OS specific details (It's fine for OS-specific information to affect the creation of Header, but not directly Reader or Writer).

In regards to B, how to efficiently and easily write/read a file to/from disk is a separate problem from how it is represented in the TAR format (which is task A and addressed by CL/57212). The suggestions given above regarding how to resolve B are both compatible with the the approach taken for A. For example, Reader.Discard and Writer.FillZeros are actually implemented (but unexported in CL/57212). The unit tests actually uses them to efficiently write a sparse file with very large logical size. Also, Reader.WriteTo and Writer.ReadFrom can be added that special-case inputs that are also a io.WriteSeeker or io.ReadSeeker. WriteTo/ReadFrom can be internally implemented in terms of Discard/FillZeros and the use of io.Seeker does not need to depend on OS-specific details like SEEK_HOLE and SEEK_DATA, but only io.SeekCurrent to skip past holes. Again, neither of these extensions conflict with A.

In regards to A, I don't think there's any controversy here. Support for sparse files clearly requires a logic to encode a valid TAR file representing the sparse holes.


It seems to me, that the biggest unknown is how to accomplish C. Your research seems to confirm that *os.File is sufficient to detect holes on all major platforms. Whether we add a new function or a new method to Header, and what it looks like is still up for debate, but I am fairly convinced that this is the right direction to be headed.

That being said, the fact that C is not fully thought through should not prevent A and B from happening. In fact, doing A and B first gives us a a testing ground to prototype what C should look like. It's okay if only support for A (and B if time permits) lands in Go1.10, and users still need to write their own logic for C. We can merge those into Go1.11, based on experience reports.

Member

dsnet commented Aug 21, 2017

(transferring over discussion from CL/57212)

There are 3 distinct tasks with regard to sparse files:

  • A. Support for encoding/decoding sparse files in the TAR Format (reader support already exists, CL/57212 adds support for encoding)
  • B. Support for easily and efficiently writing/reading sparse files to/from disk.
  • C. Support for detecting spares holes in a file on disk

While they are obviously related, they are independent problems, and I believe conflating them together is a mistake.

As it currently stands, the API for Writer is split into two parts: WriteHeader and Write (of which zero or more calls are made to populate the data for the previously written header). This API exactly reflects how TAR files are serialized.

Any solution for sparse files must have the information for sparse holes available at the time that WriteHeader is called (which implies that information about spares holes is held within the Header as either exported or unexported information). I am a proponent of having that information exported since there are other ways through which I want to create sparse files other than just pulling them straight from disk. While I understand that this information is "lower-level" than what users may want, it is a literal representation of what the sparse file looks like and is sufficient for representing sparse files in both GNU and PAX format. Users that want to use higher-level APIs to populate this field do not need to care about it. In the same way, if you use FileInfoHeader, you don't need to care about about setting the Header.Mode yourself ever, but the fact that Mode is available is still very useful when crafting the Header manually.

That being said, we can separate-out task C as a helper method or function that takes in an *os.File and populates a Header.SparseHoles field. It seems that we can't use FileInfoHeader(os.FileInfo) (*Header, error) because of lack of information, and there are disadvantages to FileHeader(*os.File) (*Header, error). We could make it even more surgical and only generate the sparse holes: (*Header) SetSparseHolesFrom(*os.File) error.

API aside, the implementation itself is actually hard because support for sparse-hole detection varies widely across operating systems. (Anyone who's looked at the code for GNU or BSD tar will see a host of #ifdef special-casing logic for different platforms, yuck). The fact that detection relies on OS-specific details is all the more reason why we should not conflate C into B or A; that is, Writer or Reader should not change behavior depending on OS specific details (It's fine for OS-specific information to affect the creation of Header, but not directly Reader or Writer).

In regards to B, how to efficiently and easily write/read a file to/from disk is a separate problem from how it is represented in the TAR format (which is task A and addressed by CL/57212). The suggestions given above regarding how to resolve B are both compatible with the the approach taken for A. For example, Reader.Discard and Writer.FillZeros are actually implemented (but unexported in CL/57212). The unit tests actually uses them to efficiently write a sparse file with very large logical size. Also, Reader.WriteTo and Writer.ReadFrom can be added that special-case inputs that are also a io.WriteSeeker or io.ReadSeeker. WriteTo/ReadFrom can be internally implemented in terms of Discard/FillZeros and the use of io.Seeker does not need to depend on OS-specific details like SEEK_HOLE and SEEK_DATA, but only io.SeekCurrent to skip past holes. Again, neither of these extensions conflict with A.

In regards to A, I don't think there's any controversy here. Support for sparse files clearly requires a logic to encode a valid TAR file representing the sparse holes.


It seems to me, that the biggest unknown is how to accomplish C. Your research seems to confirm that *os.File is sufficient to detect holes on all major platforms. Whether we add a new function or a new method to Header, and what it looks like is still up for debate, but I am fairly convinced that this is the right direction to be headed.

That being said, the fact that C is not fully thought through should not prevent A and B from happening. In fact, doing A and B first gives us a a testing ground to prototype what C should look like. It's okay if only support for A (and B if time permits) lands in Go1.10, and users still need to write their own logic for C. We can merge those into Go1.11, based on experience reports.

@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Aug 22, 2017

Contributor

I understand your line of reasoning. I have a couple of comments:

  • You seem not to care much about the fact that existing code will have to be modified to fully support sparse files. With my original proposal, most tar extractor/packer would be automatically upgraded to correctly support sparse files. With your proposal (eg: (*Header) SetSparseHolesFrom(*os.File) error), existing code will have to be amended to call this function during packing, while extract would potentially work as-is. Can you confirm that this is your preference?
  • You seem to want to avoid OS-specific code in Reader / Writer. I'm afraid that's not fully possible because on Windows you need to create holes through a specific API; seeking by itself does not create holes, just zeros. So Reader.WriteTo will have to call OS-specific code, when Windows support is added.

I'm still very worried about releasing 1.10 with only A and B, as we might make mistakes on the API that are hard to revert afterwards. But that's your call.

Contributor

rasky commented Aug 22, 2017

I understand your line of reasoning. I have a couple of comments:

  • You seem not to care much about the fact that existing code will have to be modified to fully support sparse files. With my original proposal, most tar extractor/packer would be automatically upgraded to correctly support sparse files. With your proposal (eg: (*Header) SetSparseHolesFrom(*os.File) error), existing code will have to be amended to call this function during packing, while extract would potentially work as-is. Can you confirm that this is your preference?
  • You seem to want to avoid OS-specific code in Reader / Writer. I'm afraid that's not fully possible because on Windows you need to create holes through a specific API; seeking by itself does not create holes, just zeros. So Reader.WriteTo will have to call OS-specific code, when Windows support is added.

I'm still very worried about releasing 1.10 with only A and B, as we might make mistakes on the API that are hard to revert afterwards. But that's your call.

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 22, 2017

Member

You seem not to care much about the fact that existing code will have to be modified to fully support sparse files.

It's not that I don't care, but that when considering competing concerns, I feel this is not a compelling benefit:

  • Currently, anyone using archive/tar already has no support for writing sparse files. So we are not regressing anyone if support is not "automatic".
  • I feel strongly about not making Writer dependent on OS-specific details as much as possible, and unfortunately FileInfoHeader is not powerful enough to populate SparseHoles.
  • For users using FileInfoHeader, they will need to call another function (depending on what happens in C) to populate SparseHoles. However, this a 3-line addition to their code to get sparse-writing support. The cost does not seem significant.
  • The BSD and GNU tar tools don't produce sparse files by default (you need to specify the '-S' flag). I believe this reduces the expectation that Writer should encode sparse files automatically. If anything, it sets the expectation that sparse files are not written by default.

You seem to want to avoid OS-specific code in Reader / Writer. I'm afraid that's not fully possible because on Windows you need to create holes through a specific API; seeking by itself does not create holes, just zeros. So Reader.WriteTo will have to call OS-specific code, when Windows support is added.

That is unfortunate, but it still works fine with WriteTo and ReadFrom since they will documented as handling only io.Seeker specially. In the case of windows, WriteTo will be equivalent to:

io.Copy(dst, struct { io.Reader }{tr})

Which is exactly what happens today.

Also, Windows users who really care about sparse writing still have alternatives:

  • Use the information in SparseHoles to punch their own holes.
  • Create a wrapper around os.File that does the hole punching when a past-EOF Seek is performed.
Member

dsnet commented Aug 22, 2017

You seem not to care much about the fact that existing code will have to be modified to fully support sparse files.

It's not that I don't care, but that when considering competing concerns, I feel this is not a compelling benefit:

  • Currently, anyone using archive/tar already has no support for writing sparse files. So we are not regressing anyone if support is not "automatic".
  • I feel strongly about not making Writer dependent on OS-specific details as much as possible, and unfortunately FileInfoHeader is not powerful enough to populate SparseHoles.
  • For users using FileInfoHeader, they will need to call another function (depending on what happens in C) to populate SparseHoles. However, this a 3-line addition to their code to get sparse-writing support. The cost does not seem significant.
  • The BSD and GNU tar tools don't produce sparse files by default (you need to specify the '-S' flag). I believe this reduces the expectation that Writer should encode sparse files automatically. If anything, it sets the expectation that sparse files are not written by default.

You seem to want to avoid OS-specific code in Reader / Writer. I'm afraid that's not fully possible because on Windows you need to create holes through a specific API; seeking by itself does not create holes, just zeros. So Reader.WriteTo will have to call OS-specific code, when Windows support is added.

That is unfortunate, but it still works fine with WriteTo and ReadFrom since they will documented as handling only io.Seeker specially. In the case of windows, WriteTo will be equivalent to:

io.Copy(dst, struct { io.Reader }{tr})

Which is exactly what happens today.

Also, Windows users who really care about sparse writing still have alternatives:

  • Use the information in SparseHoles to punch their own holes.
  • Create a wrapper around os.File that does the hole punching when a past-EOF Seek is performed.
@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 22, 2017

Member

I'm still very worried about releasing 1.10 with only A and B, as we might make mistakes on the API that are hard to revert afterwards.

I understand.

The only exposed API in A is Header.SparseHoles. I know you have reservations about exposing that information, but I believe users should be able to craft a sparse-file manually. If we only rely on what results from C to produce sparse files, then manual crafting would not be possible.

To a degree, I do share your concern regarding B. I don't feel rushed to expose this for Go1.10, but it would be nice.

Member

dsnet commented Aug 22, 2017

I'm still very worried about releasing 1.10 with only A and B, as we might make mistakes on the API that are hard to revert afterwards.

I understand.

The only exposed API in A is Header.SparseHoles. I know you have reservations about exposing that information, but I believe users should be able to craft a sparse-file manually. If we only rely on what results from C to produce sparse files, then manual crafting would not be possible.

To a degree, I do share your concern regarding B. I don't feel rushed to expose this for Go1.10, but it would be nice.

@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Aug 23, 2017

Contributor

OK so the roadmap is clear now. Do you want me to send CLs about some specific parts?

Contributor

rasky commented Aug 23, 2017

OK so the roadmap is clear now. Do you want me to send CLs about some specific parts?

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 23, 2017

Member

After I submit the CL for A, feel free to send out a CL to add WriteTo/ReadFrom to address B. We can continue discussing here for a design for C. Do you have any proposals for C?

Member

dsnet commented Aug 23, 2017

After I submit the CL for A, feel free to send out a CL to add WriteTo/ReadFrom to address B. We can continue discussing here for a design for C. Do you have any proposals for C?

gopherbot pushed a commit that referenced this issue Aug 23, 2017

archive/tar: implement Writer support for sparse files
This CL is the second step (of two; part1 is CL/56771) for adding
sparse file support to the Writer.

There are no new identifiers exported in this CL, but this does make
use of Header.SparseHoles added in part1. If the Typeflag is set to
TypeGNUSparse or len(SparseHoles) > 0, then the Writer will emit an
sparse file, where the holes must be written by the user as zeros.

If TypeGNUSparse is set, then the output file must use the GNU format.
Otherwise, it must use the PAX format (with GNU-defined PAX keys).

A future CL may export Reader.Discard and Writer.FillZeros,
but those methods are currently unexported, and only used by the
tests for efficiency reasons.
Calling Discard or FillZeros on a hole 10GiB in size does take
time, even if it is essentially a memcopy.

Updates #13548

Change-Id: Id586d9178c227c0577f796f731ae2cbb72355601
Reviewed-on: https://go-review.googlesource.com/57212
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Aug 24, 2017

Contributor

While implementing Reader.WriteTo, I realized that only using io.WriteSeeker abstraction is not sufficient also on Linux/macOS to create sparse files, because it is not able to create a hole at the end of file; you need to call os.File.Truncate() for that to really happen; if you seek past EOF and close the file, the file size is not changed. This is in addition of io.WriteSeeker being insufficient on Windows.

We have a few possibilities:

  • Special-case Reader.WriteTo for *os.File rather than io.WriteSeeker
  • Special-case Reader.WriteTo for io.WriteSeeker, plus tell the users that they need to call Truncate themselves.
  • Special-case Reader.WriteTo for io.WriteSeeker and *os.File (or a non-idiomatic Truncater interface).

Any comment?

Contributor

rasky commented Aug 24, 2017

While implementing Reader.WriteTo, I realized that only using io.WriteSeeker abstraction is not sufficient also on Linux/macOS to create sparse files, because it is not able to create a hole at the end of file; you need to call os.File.Truncate() for that to really happen; if you seek past EOF and close the file, the file size is not changed. This is in addition of io.WriteSeeker being insufficient on Windows.

We have a few possibilities:

  • Special-case Reader.WriteTo for *os.File rather than io.WriteSeeker
  • Special-case Reader.WriteTo for io.WriteSeeker, plus tell the users that they need to call Truncate themselves.
  • Special-case Reader.WriteTo for io.WriteSeeker and *os.File (or a non-idiomatic Truncater interface).

Any comment?

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 24, 2017

Member

Another possibility is to have io.Seeker to seek to 1-before the last byte in the last fragment and write a single byte.

My evaluation of the approaches:


Seek to 1-byte before the last hole and write a single zero byte.

For consistency, Writer.ReadFrom can also do the same 1-byte before EOF technique to ensure the file really is that long (since you can Seek to arbitrary offsets and most io.Seeker won't tell you it is past EOF).

I don't think byte-for-byte reproduction (in terms of where the hole regions are) of sparse files is necessary. So I'm okay if this implicitly causes a single block to be allocated at the end of the file. The reality is that the sparse file generated is still at the whim of the underlying filesystem, which may not be able to exactly respect the hole regions from the original tar file (the source FS may have 4KiB blocks, and the target FS may have a different block size and can't represent holes at offsets from the original FS).

Special-case Reader.WriteTo for os.File rather than io.WriteSeeker

For consistency, WriteTo/ReadFrom should both use os.File then.
The upside to this approach is that it more clearly optimized for os.File, which has stronger guarantees about the behavior of seeking past EOF.
The downside, you can't use a wrapper around os.File that does hole-punching yourself (in the case of Windows).

Special-case Reader.WriteTo for io.WriteSeeker, plus tell the users that they need to call Truncate themselves.

The downside is this a very subtle requirement for the user.

Special-case Reader.WriteTo for io.WriteSeeker and os.File (or a non-idiomatic Truncater interface).

The downside is more special-casing. There is value in having as few special-cases as possible.


My first vote goes to "seek 1-byte before" technique. My second vote is special-casing for "os.File" only.

Member

dsnet commented Aug 24, 2017

Another possibility is to have io.Seeker to seek to 1-before the last byte in the last fragment and write a single byte.

My evaluation of the approaches:


Seek to 1-byte before the last hole and write a single zero byte.

For consistency, Writer.ReadFrom can also do the same 1-byte before EOF technique to ensure the file really is that long (since you can Seek to arbitrary offsets and most io.Seeker won't tell you it is past EOF).

I don't think byte-for-byte reproduction (in terms of where the hole regions are) of sparse files is necessary. So I'm okay if this implicitly causes a single block to be allocated at the end of the file. The reality is that the sparse file generated is still at the whim of the underlying filesystem, which may not be able to exactly respect the hole regions from the original tar file (the source FS may have 4KiB blocks, and the target FS may have a different block size and can't represent holes at offsets from the original FS).

Special-case Reader.WriteTo for os.File rather than io.WriteSeeker

For consistency, WriteTo/ReadFrom should both use os.File then.
The upside to this approach is that it more clearly optimized for os.File, which has stronger guarantees about the behavior of seeking past EOF.
The downside, you can't use a wrapper around os.File that does hole-punching yourself (in the case of Windows).

Special-case Reader.WriteTo for io.WriteSeeker, plus tell the users that they need to call Truncate themselves.

The downside is this a very subtle requirement for the user.

Special-case Reader.WriteTo for io.WriteSeeker and os.File (or a non-idiomatic Truncater interface).

The downside is more special-casing. There is value in having as few special-cases as possible.


My first vote goes to "seek 1-byte before" technique. My second vote is special-casing for "os.File" only.

@mwhooker mwhooker referenced this issue in hashicorp/packer Aug 24, 2017

Closed

support for sparse archives in atlas #5282

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Aug 29, 2017

Member

This bug seems interesting to what we're trying here: #21681

Member

dsnet commented Aug 29, 2017

This bug seems interesting to what we're trying here: #21681

@dsnet dsnet removed their assignment Aug 31, 2017

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Sep 1, 2017

Member

@rasky, have you started working on B yet? I have a working version of it using the "seek 1-byte before" technique.

Member

dsnet commented Sep 1, 2017

@rasky, have you started working on B yet? I have a working version of it using the "seek 1-byte before" technique.

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Sep 1, 2017

Change https://golang.org/cl/60871 mentions this issue: archive/tar: add Header.DetectSparseHoles

Change https://golang.org/cl/60871 mentions this issue: archive/tar: add Header.DetectSparseHoles

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Sep 1, 2017

Change https://golang.org/cl/60872 mentions this issue: archive/tar: add Reader.WriteTo and Writer.ReadFrom

Change https://golang.org/cl/60872 mentions this issue: archive/tar: add Reader.WriteTo and Writer.ReadFrom

@djs55 djs55 referenced this issue in docker/for-mac Sep 8, 2017

Closed

Docker consumes all disk space #2038

gopherbot pushed a commit that referenced this issue Sep 18, 2017

archive/tar: add Reader.WriteTo and Writer.ReadFrom
To support the efficient packing and extracting of sparse files,
add two new methods:
	func Reader.WriteTo(io.Writer) (int64, error)
	func Writer.ReadFrom(io.Reader) (int64, error)

If the current archive entry is sparse and the provided io.{Reader,Writer}
is also an io.Seeker, then use Seek to skip past the holes.
If the last region in a file entry is a hole, then we seek to 1 byte
before the EOF:
	* for Reader.WriteTo to write a single byte
	to ensure that the resulting filesize is correct.
	* for Writer.ReadFrom to read a single byte
	to verify that the input filesize is correct.

The downside of this approach is when the last region in the sparse file
is a hole. In the case of Reader.WriteTo, the 1-byte write will cause
the last fragment to have a single chunk allocated.
However, the goal of ReadFrom/WriteTo is *not* the ability to
exactly reproduce sparse files (in terms of the location of sparse holes),
but rather to provide an efficient way to create them.

File systems already impose their own restrictions on how the sparse file
will be created. Some filesystems (e.g., HFS+) don't support sparseness and
seeking forward simply causes the FS to write zeros. Other filesystems
have different chunk sizes, which will cause chunk allocations at boundaries
different from what was in the original sparse file. In either case,
it should not be a normal expectation of users that the location of holes
in sparse files exactly matches the source.

For users that really desire to have exact reproduction of sparse holes,
they can wrap os.File with their own io.WriteSeeker that discards the
final 1-byte write and uses File.Truncate to resize the file to the
correct size.

Other reasons we choose this approach over special-casing *os.File because:
	* The Reader already has special-case logic for io.Seeker
	* As much as possible, we want to decouple OS-specific logic from
	Reader and Writer.
	* This allows other abstractions over *os.File to also benefit from
	the "skip past holes" logic.
	* It is easier to test, since it is harder to mock an *os.File.

Updates #13548

Change-Id: I0a4f293bd53d13d154a946bc4a2ade28a6646f6a
Reviewed-on: https://go-review.googlesource.com/60872
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>

@gopherbot gopherbot closed this in 1eacf78 Sep 20, 2017

@vbatts

This comment has been minimized.

Show comment
Hide comment
@vbatts

vbatts Sep 20, 2017

Contributor
Contributor

vbatts commented Sep 20, 2017

@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Sep 20, 2017

Contributor

Now that support has been added, it would be great if people interested in this feature would provide feedback at least on the API, before it gets shipped and can’t be changed anymore.

Have a look at https://tip.golang.org/pkg/archive/tar/

Contributor

rasky commented Sep 20, 2017

Now that support has been added, it would be great if people interested in this feature would provide feedback at least on the API, before it gets shipped and can’t be changed anymore.

Have a look at https://tip.golang.org/pkg/archive/tar/

@grubernaut

This comment has been minimized.

Show comment
Hide comment
@grubernaut

grubernaut Sep 22, 2017

cc: @mwhooker, as the original case for this issue came from an end-user requiring sparse support inside of Atlas-Go after creating a sparse image via Packer. More detail and function to be patched linked here: #13548 (comment)

cc: @mwhooker, as the original case for this issue came from an end-user requiring sparse support inside of Atlas-Go after creating a sparse image via Packer. More detail and function to be patched linked here: #13548 (comment)

@AstromechZA

This comment has been minimized.

Show comment
Hide comment
@AstromechZA

AstromechZA Sep 23, 2017

Came across this issue looking for sparse-file support in Golang. API looks good to me and certainly fits my usecase :). Is there no sysSparsePunch needed for unix?

Came across this issue looking for sparse-file support in Golang. API looks good to me and certainly fits my usecase :). Is there no sysSparsePunch needed for unix?

@dsnet

This comment has been minimized.

Show comment
Hide comment
@dsnet

dsnet Sep 23, 2017

Member

On Unix OSes that support sparse files, seeking past EOF and writing or resizing the file to be larger automatically produces a sparse file.

Member

dsnet commented Sep 23, 2017

On Unix OSes that support sparse files, seeking past EOF and writing or resizing the file to be larger automatically produces a sparse file.

@AstromechZA

This comment has been minimized.

Show comment
Hide comment
@AstromechZA

AstromechZA Sep 23, 2017

Cool, so it detects that you've skipped past a block and not written anything to it and automatically assumes its sparse? Nice 👍

Cool, so it detects that you've skipped past a block and not written anything to it and automatically assumes its sparse? Nice 👍

@gopherbot

This comment has been minimized.

Show comment
Hide comment
@gopherbot

gopherbot Nov 16, 2017

Change https://golang.org/cl/78030 mentions this issue: archive/tar: partially revert sparse file support

Change https://golang.org/cl/78030 mentions this issue: archive/tar: partially revert sparse file support

gopherbot pushed a commit that referenced this issue Nov 16, 2017

archive/tar: partially revert sparse file support
This CL removes the following APIs:
	type SparseEntry struct{ ... }
	type Header struct{ SparseHoles []SparseEntry; ... }
	func (*Header) DetectSparseHoles(f *os.File) error
	func (*Header) PunchSparseHoles(f *os.File) error
	func (*Reader) WriteTo(io.Writer) (int, error)
	func (*Writer) ReadFrom(io.Reader) (int, error)

This API was added during the Go1.10 dev cycle, and are safe to remove.

The rationale for reverting is because Header.DetectSparseHoles and
Header.PunchSparseHoles are functionality that probably better belongs in
the os package itself.

The other API like Header.SparseHoles, Reader.WriteTo, and Writer.ReadFrom
perform no OS specific logic and only perform the actual business logic of
reading and writing sparse archives. Since we do know know what the API added to
package os may look like, we preemptively revert these non-OS specific changes
as well by simply commenting them out.

Updates #13548
Updates #22735

Change-Id: I77842acd39a43de63e5c754bfa1c26cc24687b70
Reviewed-on: https://go-review.googlesource.com/78030
Reviewed-by: Russ Cox <rsc@golang.org>
@rasky

This comment has been minimized.

Show comment
Hide comment
@rasky

rasky Nov 17, 2017

Contributor

Unfortunately, the code had to be reverted and will not be part of 1.10 anymore. This bug should probably be reopened.

Contributor

rasky commented Nov 17, 2017

Unfortunately, the code had to be reverted and will not be part of 1.10 anymore. This bug should probably be reopened.

@bradfitz bradfitz reopened this Nov 17, 2017

@bradfitz bradfitz modified the milestones: Go1.10, Unplanned Nov 17, 2017

@bradfitz bradfitz added the NeedsFix label Nov 17, 2017

@stgraber stgraber referenced this issue in lxc/lxd Feb 10, 2018

Closed

lxc publish expands sparse files #4239

2 of 2 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment