Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4 rewrite: All new design, core types, and stream-oriented interfaces #302

Merged
merged 11 commits into from Jan 4, 2022

Conversation

mholt
Copy link
Owner

@mholt mholt commented Jan 2, 2022

I spent the holiday rewriting this package from scratch with a completely new approach to handling archives and compression formats. This will become v4.

The new core APIs are completely stream-oriented and file-agnostic. The abstractions for files, directories, and file systems are virtualized thanks to the recently-added io/fs package in the Go standard library. I expect these design changes will close most open issues and PRs because it either fixes the problems or makes them irrelevant.

A significant number of issues relate directly to interactions with specific file systems (files on disk) and this new API does not deal with that directly anymore, except for a couple specific functions that read the disk in order to create the abstraction. Nothing in the core API writes to disk or deals with that. (Yay!) The stream and FS abstractions are highly flexible to build upon.

Another nice thing about the new design is that there's no more need for explicit composite types (like TarGz and TarBz2), because we have a new CompressedArchive type that composes an archive and a compression format. It's mainly used with Tar only, since Zip and Rar do their own thing, but another nice feature is the Identify() function that automatically gets you the right type:

// opening a file on disk for this example, but can be any ReadSeeker
unknownFile, err := os.Open("filename.tar.gz")
if err != nil {
	return err
}
defer unknownFile.Close()

// if you don't have a filename, leave it blank: identification uses both/either filenames and/or streams
format, err := archiver.Identify("filename.tar.gz", unknownFile)
if err != nil {
	return err
}

// we can now work with the file, for example, extract a file out of it
if ex, ok := format.(archiver.Extractor); ok {
	ex.Extract(context.Background(), unknownFile, "target.txt", func(_ context.Context, f File) error {
		// do something with the file ...
		return nil
	})
}

// or maybe it's just a compressed log file
if decom, ok := format.(archiver.Decompressor); ok {
	rc, err := decom.OpenReader(unknownFile)
	if err != nil {
		return err
	}
	defer rc.Close()
	// read from it ... all reads are decompressed now
}

I think the new APIs are pretty slick, and you should try them out and let me know what you think.

One of my favorite new features is the FileSystem() function. Give it a path on disk, and it will return a fs.ReadDirFS. Basically, it lets you read from real directories, regular files, archive files, and compressed archive files, ALL THE SAME WAY. This is pretty cool I think. You don't have to worry about whether the given file is just a regular file, a directory, or an archive file (which acts like a directory because it contains other files!) -- you can traverse it all the same way. And the archive format doesn't matter either, it's automatically identified for you! You literally won't even know things are being decompressed as you read from them:

fsys, err := archiver.FileSystem("/path/to/folder/or/file")
if err != nil {
	return err
}

// traverse everything except the ".git" folder...
err = fs.WalkDir(fsys, ".", func(path string, d fs.DirEntry, err error) error {
	if err != nil {
		return err
	}
	if path == ".git" {
		return fs.SkipDir
	}
	fmt.Println(path, d.IsDir())
	return nil
})
if err != nil {
	return err
}

// ...or just open one file
file, err := fsys.Open("example.txt")
if err != nil {
	return err
}
defer file.Close()

To make a new archive from files on disk, you might do this:

files, err := archiver.FilesFromDisk(map[string]string{
	"/path/on/disk/file1.txt": "file1.txt",
	"/path/on/disk/file2.txt": "subfolder/file2.txt",
	"/path/on/disk/folder":    "",
})
if err != nil {
	return err
}

out, err := os.Create("example.tar.gz")
if err != nil {
	return err
}
defer out.Close()

caf := archiver.CompressedArchive{
	Compression: archiver.Gz{},
	Archival:    archiver.Tar{},
}

err = caf.Archive(context.Background(), out, files)
if err != nil {
	return err
}

Notice how you have the flexibility of mapping each file (or folder) to a different path in the archive. You can also leave the mapped path blank to assume the base filename for convenience. Folders are added recursively.

Oh yeah, and I added basic context support.

The arc command has not been ported over yet, and as that has to deal directly with the file system, it will require some work before it is ready. This would be the only part of the repository that would write to disk, as the core library doesn't write to disk anymore.

Looks like I was able to delete about 70% of the code, however that count includes test files, the command, and the README, which have not been restored yet. Still, I could feel a significant code reduction in this new design when I wrote it.

Should make irrelevant / close / fix #118, #128, #131, #141, #146, #150, #227, #194, #204, #216, #239, #255, #262, #278, #282

@mholt mholt requested a review from coolaj86 January 2, 2022 00:38
@mholt mholt self-assigned this Jan 2, 2022
@mholt
Copy link
Owner Author

mholt commented Jan 2, 2022

I might rename this package to archives. It was suggested before but I didn't get around to it. Maybe now is a good time.

@coolaj86
Copy link
Collaborator

coolaj86 commented Jan 2, 2022

I might rename this package to archives. It was suggested before but I didn't get around to it. Maybe now is a good time.

I believe in the Go philosophy that if you're making a breaking change, that's a version bump, but if you're making a philosophical change, that's a new package.

@coolaj86
Copy link
Collaborator

coolaj86 commented Jan 2, 2022

This isn't a PR that one can simply review. This is a new project from the ground up.

Make a new repo (without the history baggage), get some tests in, mark this as deprecated, and link to the new repo.

How do you want to handle tests in the new project?

@mholt
Copy link
Owner Author

mholt commented Jan 2, 2022

Ha, sorry @coolaj86 , I tagged you as a reviewer not with the expectation that you'd actually give a full review -- I probably should have just @-mentioned you instead, since all I wanted to do was let you know that this is inbound. :)

There's no philosophical changes here... just semantics with the name. Meh, I dunno.

As for tests, I haven't thought too much about those yet. I think having a test corpus is a good idea, but as we learned from experience, some things should be generated (like symlinks) whereas other things can be committed... that'll probably come later.

@mholt mholt changed the title Initial commit of v4 rewrite; core types, methods v4 rewrite: All new design, core types, and stream-oriented interfaces Jan 2, 2022
@mholt mholt removed the request for review from coolaj86 January 2, 2022 09:55
@mholt
Copy link
Owner Author

mholt commented Jan 4, 2022

After quite a few more trials, this seems to be working decently well at least, so I am going to merge it into master but I won't tag it yet (except pre-releases).

Some more tests would be nice, as well as an actual command, but for now I want to have some people start using this, and I need to move my focus onto some other projects here pretty soon anyway.

The new README shows off some of the new version's elegance and features, so I hope you'll enjoy!

@jeff-mccoy
Copy link

@mholt this is a substantial update and a lot of really great additions. I'm very interested in the stream options now as we have a new use for them for our air gap delivery tool that uses your library. Thanks for the work on this, I see some notes about needing testing/validation, but also notice the tests seem to be less robust in this new cut. I don't think we'll be able to jump to v4 without some more automated tests, but will play with this new version for sure. I also saw you have other priorities soon, do you have an idea what the plan will be for automated tests with v4 going forward?

On another note, I spent the holiday break doing a "mega PR" myself for our project that caused some excitement 🤣: defenseunicorns/zarf#237.

@mholt
Copy link
Owner Author

mholt commented Feb 16, 2022

@jeff-mccoy That's great to hear! I agree v4 could benefit from more tests.

I also saw you have other priorities soon, do you have an idea what the plan will be for automated tests with v4 going forward?

Yeah, so that's true... I'm really busy with Caddy/CertMagic right now, and what I have written for archiver v4 works for me and my needs. However, I do accept sponsorships, and at the right tiers I can absolutely prioritize writing some automated tests for you. 👍

Otherwise, I can try to get around to them when I have a chance, or I'm happy to accept contributions (PRs) as well! Let me know what suits you best and we can do that.

(PS. Wowzers, that's a monster PR you've got there!)

@jeff-mccoy
Copy link

Awesome. I think we'll do some combination of those options. We deeply value open source projects and know it can be a thankless endeavor sometimes. We'll keep in touch for sure, thanks again for the quick reply and the great library.

@mholt
Copy link
Owner Author

mholt commented Feb 17, 2022

@jeff-mccoy Sounds good -- would look forward to working with you on this if that can happen!

@mholt
Copy link
Owner Author

mholt commented Jun 20, 2022

@jeff-mccoy Just checking in, how is this lib working for you? (It's been through several releases now that enhance its functionality and fix bugs.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ByHeader support to all formats
3 participants