Proposal: Rewrite the entire package. Design doc discussion #90

mholt · 2018-09-20T20:19:41Z

UPDATE: Done in pull request #99

Having been using this package for a few years now, I've encountered a number of issues that lead me to want to redesign this package entirely: burn it down and start over, copying only the fundamental parts of the code, and not worrying about backwards compatibility.

Some specific issues I've experienced:

Too much magic. Recently I spent a day debugging an problem where a .zip file couldn't be extracted every time with archiver. Sometimes it would work, sometimes it wouldn't. I eventually discovered that it's because archiver determines which extractor to use based on the extension and the file header while iterating through a map of formats (which is not ordered). If the Zip format came first, it matched by extension but failed to extract; if the TarGz format came first, it matched by file header (because it was actually a tar.gz file), and extraction succeeded.
Weak API. Apparently I was able to accidentally create a .tar.gz file with the Zip archiver, because the name I built for the file was not attached to which archiver format I was using. I can do archiver.TarGz.Make("file.zip") without errors, which is bad. Here's the code that led to my bug in the first place (notice the missing . in "zip"):

var a archiver.Archiver = archiver.TarGz
if filepath.Ext(outputFile) == "zip" {
	a = archiver.Zip
}

^ Bad package design.

Not enough customizability. Namely: compression level; whether to include a top-level named folder vs. just its contents), similar to how rsync works based on presence of a trailing slash; and whether to overwrite existing files when outputting.
Lack of native streaming capabilities. Recently there were From a library perspective, I should be able to stream in a zip file and spit out individual files, or stream in individual files (or a list of filenames?) and spit out a single zip file.
There is no true cross-platform native solution to zip-slip (yet). I had to disable the "security feature" that prevented me extracting a perfectly safe archive. Even "SecureJoin" solutions don't cut it (read the linked thread, and its linked threads). For now, these "mitigations" only get in the way.
Not enough power to inspect archives or cherry-pick files. It would be helpful to be able to work with archives' contents without performing an extraction, such as getting listings, or filtering which files are extracted, etc.

General solutions:

When possible (almost always), match only by file header and ignore file extension. If the file contents are not (yet) available, then use extension but only after a warning or explicit opt-in. Or, (maybe "Also,") require that the file extension, where present, matches the format when creating an archive.
Be verbose in the error messages; if doing any magic, report it or make the magic explicitly opt-in either with a configuration parameter or a special high-level API that is documented as being magical, which wraps the underlying, concrete, explicit functions.
Couple the file extension to the archiver. For example: don't allow the Zip archiver to make a .tar.gz file. For example, the buggy code above could have been avoided with something more like this: archiver.Make(outputFile, files...) which uses the extension of outputFile to force a format that matches.
Expand the API so that an archiver is created for a specific format before being used, rather than having hard-coded globals like archiver.Zip like we do now. This will allow more customization too. Imagine zipArch := archiver.Zip{CompressionLevel: 10} or something similar.
Be explicit about our threat model, which is being adjusted, to state that the files are expected to be trusted, i.e. don't download files you don't trust. Maybe it is possible to inspect a file before extracting it to know whether it could be malicious (e.g. look for zip-slip patterns in file names), but I am not sure about that yet.
Moar interfaces. We have one, Archiver, but we might need more, to accommodate an expanded design with more features. Small interfaces are the best.
Rename the package to archive. (Decided to keep it the same)

This issue is to track the discussion about the new design; work will hopefully begin soon, as I can find the time.

The text was updated successfully, but these errors were encountered:

johnarok · 2018-10-14T01:41:30Z

Thinking we should do the following to focus our efforts in the re-design and channel the communities help towards this?

Update README on this enhancement and create a new project to track related issues that might be better handled in this re-design.
Make the current version maintenance (bug fixes) only and update the README.
Close all PR related older than 6 months..mostly they are around fixing zip slip and other conditions

thoughts?

mholt · 2018-10-14T06:30:16Z

@johnarok Yes, that all sounds good -- let's draw attention to it and triage old issues that will be handled by the redesign. As soon as I get a chance to get https://relicabackup.com off the ground, I want to turn my attention (at least on weekends) to this, since we use this package extensively when doing releases.

mholt · 2018-11-05T07:40:55Z

I spent some hours on a detailed API design today that addresses all my concerns, with lots of room for growth as well. Everything here is a very rough draft, subject to change and input, and I'm probably overlooking many implementation/technical difficulties that could cause this design draft to change...

I started by pouring over the open and closed issues and PRs and assembled a wishlist of functionalities:

Make an archive FILE (on disk) by giving it a list of source files
- Archiver interface
Extract an entire archive FILE (on disk) to a given directory on disk
- Unarchiver interface
Make an archive STREAM (io.Writer) by adding files one at a time (io.Reader + file info)
- Writer interface
Extract an archive STREAM (io.Reader) by reading files one at a time (io.Writer + file info)
- Reader interface
Walk the contents of an archive and get info about the files
- Walker interface
Extract a single file from an archive, possibly during a walk
- Extractor or Walker+File interfaces
Get info about a single file from an archive without extracting it
- Walker interface (caller of Walk can skip undesired files)
Compress a single file, or byte stream
- Compress and CompressFile interfaces
Decompress a single file, or byte stream
- Decompress and DecompressFile interfaces

I believe all these requirements are met by the following type definitions:

type Archiver interface {
	Archive(sources []string, destination string) error
}

type Unarchiver interface {
	Unarchive(source, destination string) error
}

type Writer interface {
	Create(out io.Writer) error
	Write(f File) error
	Close() error
}

type Reader interface {
	Open(in io.Reader) error
	Read() (File, error)
	Close() error
}

type Extractor interface {
	Extract(targets []string, source, destination string) error
}

type Walker interface {
	Walk(archive string, walkFn WalkFunc) error
}

type WalkFunc func(f File, err error) error

type File struct {
	// The exact fields we choose could be whatever;
	// these inspired by os.FileInfo
	Name    string
	Size    int64
	Mode    os.FileMode
	ModTime time.Time
	IsDir   bool

	// The original header info; depends on
	// type of archive -- could be nil, too.
	Header interface{}

	// Allow the file contents to be read as an io.Reader.
	io.Reader
}

type FileCompressor interface {
	CompressFile(source, destination string) error
}

type FileDecompressor interface {
	DecompressFile(source, destination string) error
}

type Compressor interface {
	Compress(in io.Reader, out io.Writer) error
}

type Decompressor interface {
	Decompress(in io.Reader, out io.Writer) error
}

Then a Zip archive implementation would start with a struct definition like this:

type Zip struct {
	CompressionLevel     int
	OverwriteExisting    bool
	SelectiveCompression bool
	FollowSymlinks       bool
	ExtractToFolder      bool

	w io.Writer // explained later
	r io.Reader // explained later
}

It could then implement the methods that this archive format is reasonably capable of. Not all formats need to implement all interfaces, that's the beauty of this design. And for the formats that support it, this design also has full support for dealing with streams efficiently.

Reader and Writer allow you to treat the archive kind of like a stream, but it's a little awkward since it's more like a bufio type: you read a file, or write a file (technically just its header under the hood) in a discrete fashion, and then the contents of it can be read or written in the usual io.Reader/io.Writer style. The only other thing is how to keep the state for Reading/Writing. Imagine this (some of this is kind of made-up, just roll with it):

z := &archive.Zip{
	CompressionLevel: 10,
	SelectiveCompression: true,
}

err := z.Create(w)
if err != nil {
	return err
}
defer f.Close()

for _, f := range openFiles {
	err := z.Write(archive.File{
		Name: f.Name(),
		// ...
		Reader: f,
	})
	if err != nil {
		return err
	}
}

We should only allow a Zip instance to either Open or Create (read or write), since otherwise Close is ambiguous. Unless we want to allow closing both. I dunno. There's no performance benefits to reusing a Zip (or other format), so maybe we should prevent reuse to prevent confusion.

Other than that, I'm quite happy with this design so far -- assuming it works -- and will get started implementing it soon.

Feedback welcome

mholt · 2018-11-06T04:24:44Z

I've begun work in the rewrite branch. See PR #99 to track my progress. ZIP is already implemented. Other formats on the way. Would love some help testing it!

mholt added enhancement help wanted labels Sep 20, 2018

mholt mentioned this issue Nov 6, 2018

Rewrite entire package #99

Merged

15 tasks

mholt added this to the 3.0 milestone Nov 6, 2018

mholt mentioned this issue Nov 7, 2018

Add support for plain .gz and .bz2 files #81

Merged

mholt closed this as completed in #99 Nov 7, 2018

mholt mentioned this issue Nov 28, 2018

Don't CheckExt on Archive #123

Closed

mholt mentioned this issue Jul 13, 2019

Opt out of extension checks #175

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Rewrite the entire package. Design doc discussion #90

Proposal: Rewrite the entire package. Design doc discussion #90

mholt commented Sep 20, 2018 •

edited

Loading

johnarok commented Oct 14, 2018 •

edited

Loading

mholt commented Oct 14, 2018

mholt commented Nov 5, 2018

mholt commented Nov 6, 2018

Proposal: Rewrite the entire package. Design doc discussion #90

Proposal: Rewrite the entire package. Design doc discussion #90

Comments

mholt commented Sep 20, 2018 • edited Loading

UPDATE: Done in pull request #99

johnarok commented Oct 14, 2018 • edited Loading

mholt commented Oct 14, 2018

mholt commented Nov 5, 2018

mholt commented Nov 6, 2018

mholt commented Sep 20, 2018 •

edited

Loading

johnarok commented Oct 14, 2018 •

edited

Loading