Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transparent extraction of archives #14

Open
pombredanne opened this issue Jul 5, 2015 · 10 comments
Open

Transparent extraction of archives #14

pombredanne opened this issue Jul 5, 2015 · 10 comments

Comments

@pombredanne
Copy link
Member

As noted in #3, we do not extract and scan at the same time.

A better way would be to handle internally an archive as if it were a special type of directory (both contain files after all), and when a single archive scan is requested (or when archives are found in a larger scan) we could extract these temporarily to a temp directory, scan the extract and return the results. This would require a bit more thinking to get it right.
At a high level a tree with archives would be considered the same as a tree with directories. Archives would become just a special type directory-like containers for more files.

We could expose an os.walk-like function that would transparently extract archives to a temp directory and yield a real path and the temp location of a given file

@pombredanne
Copy link
Member Author

pombredanne commented Nov 25, 2016

I need some input on the best way transparent extraction would behave when reporting paths to files inside extracted archives and whether to handle extraction outside or inside/in-place of the scanned tree.

In the current way, extractcode needs to be launched separately prior to a scan and each archive is extracted inside the scanned tree in a new directory created side-by-side with its archive and named after the archive file name with an -extract suffix added.
The paths reported in the scan are both the archive path and separately the paths to the -extract directory and its contained files.

With a transparent archive extraction, we could either extract in place or not:

  1. keep the same approach, where the extraction is just integrated in the scan and the archive physically extracted in-place as when running extractcode

  2. during the scan, extract each archive to a temporary location instead (outside of the scanned tree).
    The corresponding extracted files would not exist in the scanned tree but instead extracted out of tree in temp files that would no longer be available once the scan is complete which may be either a good thing or a confusing thing.

With a transparent archive extraction, we could report a real or a "virtual" path to an extracted file in the scanned results:

  1. Report path/to/archive.zip-extract/some/dir/and/file and also report path/to/archive.zip
  2. Report path/to/archive.zip/some/dir/and/file without an -extract suffix as if an archive were a directory-like structure. And we would also report path/to/archive.zip. Here the path would be "virtual"

And if the extraction is not done in-place, then the paths would always be kinda virtual and absent of the scanned tree.

So this about either extracting in place or not and which path to report in the scan results.

@akintayo @chinyeungli @MaJuRG @nakami @rakeshbalusa @sschuberth @yahalom5776 Each of you has been involved with issues related to archive extraction. What would be your take and preference? What could be alternative ways?

Thanks for your input!

@sschuberth
Copy link
Collaborator

I believe extracting out-of-tree is the better / safer approach, simply because you don't have to worry about whether the directory you plan to extract to already exists. Also, it gives a cleaner separation between the "primary" source code / files, and files coming from the archives.

Consequently, regarding the reporting I do like path/to/archive.zip/some/dir/and/file better because with out-of-tree extraction there simply is not need to append an -extract suffix anymore.

But speaking of in-place vs. out-of-tree extraction I wonder whether creating files is necessary at all. Why not simply stream the files from the ZIP and directly pass their contents to the scanning engine? That would probably increase performance by reducing file I/O, and also get rid of the need to delete the temporarily extracted files afterwards.

@pombredanne
Copy link
Member Author

@sschuberth your idea to stream read archives is intriguing. It can be done actually not only on zips but also on most archives that are handled by libarchive or in Python code. It would not work though for these handled by 7zip I thing. I will need to weight the benefits vs. the code simplicity.

@akaihola
Copy link

Some notes from the point of view of our use case:

  • our packages are on a network drive
  • currently extractcode only can extract next to the archives
  • it doesn't make sense to extract onto the network drive for obvious reasons
    • maybe add an option for the destination root path?
  • would prefer to not read and extract hundreds of packages from a network drive every time before scanning
  • optimally, results would be cached
    • no extraction nor scanning would be done if results matching the package name+version and scancode version are already in the cache

What I'm using currently is our own wrapper script which extracts and runs scancode for each package separately, and stores results in per-package JSON files on a local drive. It also skips package files for which a JSON file already exists.

@pombredanne
Copy link
Member Author

@akaihola Thank you for the input. These are all valid points!
I would be interested if you can share your script.

@ashutoshsaboo
Copy link

Hi @pombredanne . I have started working on this. Kindly look at this PR - #544 . Thanks!

@pombredanne
Copy link
Member Author

@ashutoshsaboo I have some trouble to understand where you are going with this #544. It would make sense to lay out your approach first in prose here.

The principle is overall simple:

  1. a scan main input is a resource iterator that yield files
  2. we need another iterator that will itself be positioned after that and will
    2.1 receive an iterator of resources (not create a new one)
    2.2 if the resource is extractible, extract this to a temp dir and iterate and yield the extracted files

There are other details of course such as dealing with paths: both the real path we want to report and the "internal" extracted path where the file lives internally for scanners to process it would need to be returned by the iterator. Eventually creating a small File object may be a clean design. And may be some extra file infor level data is needed to understand that a file is extracted and not a plain file.
And this needs a lots of tests.
Check also pyfilesystem for a related filesystem abstraction.

@ashutoshsaboo
Copy link

@pombredanne Hi, I have replied to this on my PR thread, in the last comment - #544 . Would be nice to have your inputs on the same. 😄

@pombredanne pombredanne mentioned this issue Mar 10, 2017
4 tasks
@pombredanne pombredanne removed this from the v2.1 milestone Oct 4, 2017
pombredanne added a commit that referenced this issue Jan 17, 2018
 * Add Codebase asbstraction as an in-memory tree of Resource objects
 * Codebase and Resources can be walked, queried, added, removed as
   needed topdown and bottom up with sorted children.
 * Root Resource can now have/has scans and info for #543
 * Codebase Resource have correct counts of children for #607 and #598
 * Files can also have children (this is in preparation for transparent
   archives extraction/walking for #14)
 * Initial inventory collection is based on walking the file system once
   All other accesses are through the Codebase object
 * Resource hold a scans mapping and have file info directly attached as
   attributes.
 * To support simple serialization of Resource, these are not holding
   references to their parent and children: instead they hold numeric
   ids, including a Codebase id that can be accessed through a global
   cache, which is a poor man weak references implementation.
 * Remove and fold caching into resource.py at the Resource level. Each
   resource can put_scans and get_scans. This is either using the
   on-disk cache or just attached in memory to the resource object.
 * Add minimal resource cache tests

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member Author

Since #885 this will be possible through a pre-scan plugin that dives into archives.

@pombredanne
Copy link
Member Author

So after a long time, I think that what we really want is not exactly transparent extraction of archives, but rather smart and selective extraction of archives where relevant in the context of a specific archives and scans.
For instance if I have a JAR seen during a package scan we need to process a few manifests selectively and not the whole JAR at all times. Unless this is a source JAR. So the blanket extraction of everything is not likely the best option.

@pombredanne pombredanne removed this from the v3.3 milestone Sep 24, 2021
pombredanne pushed a commit that referenced this issue Jan 12, 2022
Merge changes from develop to main
AyanSinhaMahapatra pushed a commit that referenced this issue Apr 4, 2023
Add support for gems and improve RPM support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants