Transparent extraction of archives #14

pombredanne · 2015-07-05T11:54:14Z

As noted in #3, we do not extract and scan at the same time.

A better way would be to handle internally an archive as if it were a special type of directory (both contain files after all), and when a single archive scan is requested (or when archives are found in a larger scan) we could extract these temporarily to a temp directory, scan the extract and return the results. This would require a bit more thinking to get it right.
At a high level a tree with archives would be considered the same as a tree with directories. Archives would become just a special type directory-like containers for more files.

We could expose an os.walk-like function that would transparently extract archives to a temp directory and yield a real path and the temp location of a given file

The text was updated successfully, but these errors were encountered:

pombredanne · 2016-11-25T08:13:41Z

I need some input on the best way transparent extraction would behave when reporting paths to files inside extracted archives and whether to handle extraction outside or inside/in-place of the scanned tree.

In the current way, extractcode needs to be launched separately prior to a scan and each archive is extracted inside the scanned tree in a new directory created side-by-side with its archive and named after the archive file name with an -extract suffix added.
The paths reported in the scan are both the archive path and separately the paths to the -extract directory and its contained files.

With a transparent archive extraction, we could either extract in place or not:

keep the same approach, where the extraction is just integrated in the scan and the archive physically extracted in-place as when running extractcode
during the scan, extract each archive to a temporary location instead (outside of the scanned tree).
The corresponding extracted files would not exist in the scanned tree but instead extracted out of tree in temp files that would no longer be available once the scan is complete which may be either a good thing or a confusing thing.

With a transparent archive extraction, we could report a real or a "virtual" path to an extracted file in the scanned results:

Report path/to/archive.zip-extract/some/dir/and/file and also report path/to/archive.zip
Report path/to/archive.zip/some/dir/and/file without an -extract suffix as if an archive were a directory-like structure. And we would also report path/to/archive.zip. Here the path would be "virtual"

And if the extraction is not done in-place, then the paths would always be kinda virtual and absent of the scanned tree.

So this about either extracting in place or not and which path to report in the scan results.

@akintayo @chinyeungli @MaJuRG @nakami @rakeshbalusa @sschuberth @yahalom5776 Each of you has been involved with issues related to archive extraction. What would be your take and preference? What could be alternative ways?

Thanks for your input!

sschuberth · 2016-11-25T14:12:55Z

I believe extracting out-of-tree is the better / safer approach, simply because you don't have to worry about whether the directory you plan to extract to already exists. Also, it gives a cleaner separation between the "primary" source code / files, and files coming from the archives.

Consequently, regarding the reporting I do like path/to/archive.zip/some/dir/and/file better because with out-of-tree extraction there simply is not need to append an -extract suffix anymore.

But speaking of in-place vs. out-of-tree extraction I wonder whether creating files is necessary at all. Why not simply stream the files from the ZIP and directly pass their contents to the scanning engine? That would probably increase performance by reducing file I/O, and also get rid of the need to delete the temporarily extracted files afterwards.

pombredanne · 2016-11-30T14:16:51Z

@sschuberth your idea to stream read archives is intriguing. It can be done actually not only on zips but also on most archives that are handled by libarchive or in Python code. It would not work though for these handled by 7zip I thing. I will need to weight the benefits vs. the code simplicity.

akaihola · 2017-01-10T14:37:03Z

Some notes from the point of view of our use case:

our packages are on a network drive
currently extractcode only can extract next to the archives
it doesn't make sense to extract onto the network drive for obvious reasons
- maybe add an option for the destination root path?
would prefer to not read and extract hundreds of packages from a network drive every time before scanning
optimally, results would be cached
- no extraction nor scanning would be done if results matching the package name+version and scancode version are already in the cache

What I'm using currently is our own wrapper script which extracts and runs scancode for each package separately, and stores results in per-package JSON files on a local drive. It also skips package files for which a JSON file already exists.

pombredanne · 2017-01-10T14:51:24Z

@akaihola Thank you for the input. These are all valid points!
I would be interested if you can share your script.

ashutoshsaboo · 2017-03-07T06:43:40Z

Hi @pombredanne . I have started working on this. Kindly look at this PR - #544 . Thanks!

pombredanne · 2017-03-08T07:50:57Z

@ashutoshsaboo I have some trouble to understand where you are going with this #544. It would make sense to lay out your approach first in prose here.

The principle is overall simple:

a scan main input is a resource iterator that yield files
we need another iterator that will itself be positioned after that and will
2.1 receive an iterator of resources (not create a new one)
2.2 if the resource is extractible, extract this to a temp dir and iterate and yield the extracted files

There are other details of course such as dealing with paths: both the real path we want to report and the "internal" extracted path where the file lives internally for scanners to process it would need to be returned by the iterator. Eventually creating a small File object may be a clean design. And may be some extra file infor level data is needed to understand that a file is extracted and not a plain file.
And this needs a lots of tests.
Check also pyfilesystem for a related filesystem abstraction.

ashutoshsaboo · 2017-03-08T14:54:10Z

@pombredanne Hi, I have replied to this on my PR thread, in the last comment - #544 . Would be nice to have your inputs on the same. 😄

* Add Codebase asbstraction as an in-memory tree of Resource objects * Codebase and Resources can be walked, queried, added, removed as needed topdown and bottom up with sorted children. * Root Resource can now have/has scans and info for #543 * Codebase Resource have correct counts of children for #607 and #598 * Files can also have children (this is in preparation for transparent archives extraction/walking for #14) * Initial inventory collection is based on walking the file system once All other accesses are through the Codebase object * Resource hold a scans mapping and have file info directly attached as attributes. * To support simple serialization of Resource, these are not holding references to their parent and children: instead they hold numeric ids, including a Codebase id that can be accessed through a global cache, which is a poor man weak references implementation. * Remove and fold caching into resource.py at the Resource level. Each resource can put_scans and get_scans. This is either using the on-disk cache or just attached in memory to the resource object. * Add minimal resource cache tests Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>

pombredanne · 2018-02-07T15:31:33Z

Since #885 this will be possible through a pre-scan plugin that dives into archives.

pombredanne · 2021-04-23T07:11:03Z

So after a long time, I think that what we really want is not exactly transparent extraction of archives, but rather smart and selective extraction of archives where relevant in the context of a specific archives and scans.
For instance if I have a JAR seen during a package scan we need to process a few manifests selectively and not the whole JAR at all times. Unless this is a source JAR. So the blanket extraction of everything is not likely the best option.

Merge changes from develop to main

Add support for gems and improve RPM support

pombredanne added enhancement archive extraction nice to have labels Jul 14, 2015

pombredanne mentioned this issue Sep 17, 2015

scan a ziped archive #72

Closed

pombredanne modified the milestone: v2.0 Nov 5, 2015

pombredanne mentioned this issue Apr 8, 2016

ScanCode not showing license embedded in jar #244

Open

pombredanne modified the milestones: v2.0, v2.1 Aug 5, 2016

pombredanne added should have and removed nice to have labels Nov 25, 2016

This was referenced Feb 28, 2017

Add draft conventions documentation for AboutCode Data (i.e. ABCD) nexB/aboutcode#2

Merged

Paths in JSON output are URL-encoded #542

Closed

ashutoshsaboo mentioned this issue Mar 7, 2017

Implements extraction of Archives before Scanning #544

Closed

pombredanne mentioned this issue Mar 10, 2017

Plugins #552

Closed

4 tasks

ashutoshsaboo mentioned this issue Mar 12, 2017

Initial Commit towards Scancode Live Scan Server #554

Closed

yashdsaraf mentioned this issue Jul 22, 2017

Pre scan plugins for plugin architecture #697

Closed

pombredanne removed this from the v2.1 milestone Oct 4, 2017

pombredanne added this to the v3.0 milestone Feb 7, 2018

pombredanne modified the milestones: v3.0, v3.1 Nov 4, 2018

pombredanne modified the milestones: v3.1 Documentation, documentation, documentation, v3.2 Feb 16, 2019

pombredanne mentioned this issue Apr 1, 2019

Returns empty packages information when trying to scan a .whl package #1488

Open

maxhbr mentioned this issue Feb 10, 2020

add --replace-originals flag to extractcode #1893

Merged

4 tasks

pombredanne mentioned this issue Feb 10, 2021

Extract a specific paths from an archive nexB/extractcode#7

Open

pombredanne added the design needed label Apr 23, 2021

pombredanne mentioned this issue Jul 23, 2021

Review how and which archive to extract in pipelines nexB/scancode.io#251

Open

pombredanne removed this from the v3.3 milestone Sep 24, 2021

pombredanne mentioned this issue Oct 12, 2021

Server error 500 when clicking on some files nexB/scancode.io#344

Closed

pombredanne pushed a commit that referenced this issue Jan 12, 2022

Merge pull request #14 from nexB/develop

5593c2f

Merge changes from develop to main

AyanSinhaMahapatra pushed a commit that referenced this issue Apr 4, 2023

Merge pull request #14 from nexB/refinements

dd188e4

Add support for gems and improve RPM support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transparent extraction of archives #14

Transparent extraction of archives #14

pombredanne commented Jul 5, 2015

pombredanne commented Nov 25, 2016 •

edited

Loading

sschuberth commented Nov 25, 2016

pombredanne commented Nov 30, 2016

akaihola commented Jan 10, 2017

pombredanne commented Jan 10, 2017

ashutoshsaboo commented Mar 7, 2017

pombredanne commented Mar 8, 2017

ashutoshsaboo commented Mar 8, 2017

pombredanne commented Feb 7, 2018

pombredanne commented Apr 23, 2021

Transparent extraction of archives #14

Transparent extraction of archives #14

Comments

pombredanne commented Jul 5, 2015

pombredanne commented Nov 25, 2016 • edited Loading

sschuberth commented Nov 25, 2016

pombredanne commented Nov 30, 2016

akaihola commented Jan 10, 2017

pombredanne commented Jan 10, 2017

ashutoshsaboo commented Mar 7, 2017

pombredanne commented Mar 8, 2017

ashutoshsaboo commented Mar 8, 2017

pombredanne commented Feb 7, 2018

pombredanne commented Apr 23, 2021

pombredanne commented Nov 25, 2016 •

edited

Loading