Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support recursion across all archive types #47

Merged
merged 4 commits into from
Dec 21, 2021

Conversation

glynternet
Copy link
Contributor

@glynternet glynternet commented Dec 21, 2021

Recursion supported across all archive types by making crawl.Log4jIdentifier type no longer holds any concepts of specific archives or archive types and then reusing pretty much the same approach that was used before.

Instead, the following fields have been added to it:

ParseArchiveFormat func(string) (archive.FormatType, bool)
ArchiveWalkers     func(archive.FormatType) (archive.WalkerProvider, bool)

When a file or path is encountered, the ParseArchiveFormat function is used to determine if the file is an archive, and the archive type is passed along to the ArchiveWalkers function to retrieve an archive.WalkerProvider.

A new archive.WalkerProvider should support creating a WalkFn from either a file or a reader. This is handy because during the first walk, we have access to files on disk and so do not have to load anything into memory. When we are recursing into archives we do not have access to disk as everything is currently in memory, so a WalkerProvider must also provide a FromReader method to create one from an io.Reader. Reading into memory is only required by the archive.ZipWalkers provider currently, which still has a configurable memory limit, as it did before this PR (this situation will be improved in #40).

@changelog-app
Copy link

changelog-app bot commented Dec 21, 2021

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Support nested archive recursion across all supported archive types.

Check the box to generate changelog(s)

  • Generate changelog entry

Copy link
Contributor

@nmiyake nmiyake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor feedback and PR needs to be deconflicted, but on the whole the structure is much nicer and looks to make sense.

In the abstract may have a bit of feedback on the WalkerProvider construction, but nothing that's fundamental/major enough to derail the current approach.

@@ -96,3 +105,21 @@ func mustZipFileReader(t *testing.T, path string) (*zip.Reader, func(*testing.T)
assert.NoError(t, file.Close())
}
}
func mustTarGZReader(t *testing.T, path string) (*tar.Reader, func(*testing.T)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add newline between end of previous declaration and start of this one.


// FileWalkFn is called by a WalkFn on each file contained in an archive.
type FileWalkFn func(ctx context.Context, path string, size int64, contents io.Reader) (proceed bool, err error)

// WalkZipFiles walks the files in the provided *zip.Reader, calling walkFn on each header.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function can now be private (or even inlined into the one calling location) -- previously this was exposed since it was referenced directly, but no more need to do that with this refactor.

return err
}
// WalkTarFiles walks the files in the provided *tar.Reader, calling walkFn on each header.
func WalkTarFiles(ctx context.Context, r *tar.Reader, walkFn FileWalkFn) (err error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for zip (can either be unexported or possibly even inlined to one calling location)

@nmiyake
Copy link
Contributor

nmiyake commented Dec 21, 2021

Needs changelog and conflict resolution

@bulldozer-bot bulldozer-bot bot merged commit 9067da6 into develop Dec 21, 2021
@bulldozer-bot bulldozer-bot bot deleted the gh/generic-archive-recursion branch December 21, 2021 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants