provide method for extracting specific files from an archive #58

michaelfecher · 2020-01-27T10:57:49Z

Is your feature request related to a problem? Please describe.
For a serverless batch process, I have big 7z files floating around (2.5Gb+), which I don't want to extract completely (30Gb+).
Otherwise the serverless aproach would be very expensive.

Describe the solution you'd like
Instead I want to selectively extract files.
The argument of the method could be of a list of strings, which should match with the given file list of the archive and only extract the matches.
This would enable a leaner solution and focus on the files I am interested in.

Describe alternatives you've considered
There are currently no alternatives, because I already tried to do it via the 7z binary provisioning via copying the binaries + libraries. :)
It happens, that there is a permission denied error - regardless of what I am doing.
The support also couldn't assist here.

miurahr · 2020-01-27T14:37:33Z

py7zr was designed to do it in future, but current implementation does not have the function.

An extraction main function is built with five code blocks.

setup list of directories, files and symlinks from header data with output path.

py7zr/py7zr/py7zr.py

Line 718 in 6ba709c

for f in self.files:

py7zr/py7zr/py7zr.py

Lines 753 to 754 in 6ba709c

self.worker.register_filelike(f.id, outfilename)

target_files.append((outfilename, f.file_properties()))

make directories.

py7zr/py7zr/py7zr.py

Lines 755 to 757 in 6ba709c

    
           for target_dir in sorted(target_dirs): 
        
               try: 
        
                   target_dir.mkdir()

decompress archive file walking through lists. If path is None, nothing is written.

py7zr/py7zr/py7zr.py

Line 766 in 6ba709c

self.worker.extract(self.fp, multithread=multi_thread)
create symbolic link

py7zr/py7zr/py7zr.py

Line 777 in 6ba709c

sym_dst.symlink_to(sym_src)
set metadata of files and directories, such as creation time, permission etc.

py7zr/py7zr/py7zr.py

Lines 791 to 792 in 6ba709c

for o, p in target_files:

self._set_file_property(o, p)

Currently step.1 works for all of items. It calls 'self. worker.register_filelike()' against each target items. when calling it with 'None' as a target path, then decompress function will skip it.

miurahr · 2020-01-27T14:59:01Z

A branch https://github.com/miurahr/py7zr/commits/topic-extraction-filter try to realize it with dirty hack.

Please see test case to know how to specify files.

py7zr/tests/test_basic.py

Lines 372 to 383 in c3f580e

    
           @pytest.mark.api 
        
           def test_py7zr_extract_specified_file(tmp_path): 
        
               archive = py7zr.SevenZipFile(open(os.path.join(testdata_path, 'test_1.7z'), 'rb')) 
        
               expected = [{'filename': 'scripts/py7zr', 'mode': 33261, 'mtime': 1552522208, 
        
                           'digest': 'b0385e71d6a07eb692f5fb9798e9d33aaf87be7dfff936fd2473eab2a593d4fd'} 
        
                           ] 
        
               archive.extract(path=tmp_path, targets=['scripts', 'scripts/py7zr']) 
        
               assert tmp_path.joinpath('scripts').is_dir() 
        
               assert tmp_path.joinpath('scripts/py7zr').exists() 
        
               assert not tmp_path.joinpath('setup.cfg').exists() 
        
               assert not tmp_path.joinpath('setup.py').exists() 
        
               check_output(expected, tmp_path)

There are two items to extract in this test case, one is 'scripts' directory and another is 'scripts/py7zr' file. Other files 'setup.py' and 'setup.cfg' is skipped.

Any feedback?

michaelfecher · 2020-01-27T15:20:26Z

I nearly implemented a similar solution,
because I wasn't aware that you respond that fast 😄

# specific_file_list: List[str] as an argument to the function
file_list_pattern = '|'.join('(?:{0})'.format(x) for x in specific_file_list)
file_pattern = re.compile(file_list_pattern)
# in for loop, straight after iteration
if file_pattern.match(f.filename) is None:
    continue

I only wasn't sure if it's the intended solution, because I only had a brief look on the master code.
And I was misleaded by the code, because I thought all this worker stuff needs to be adapted as well.

Not sure why you find your solution "dirty"?
I only can assume that you named it "dirty", because you packed everything together in one function...

miurahr · 2020-01-27T22:06:16Z

Your code does almost as same as I tried.
if re.match('file1 | file2 | ..', f.filename) and if f.filename in file_list produce same result.

py7zr/py7zr/py7zr.py

Lines 737 to 739 in c3f580e

    
           if targets is not None and f.filename not in targets: 
        
               self.worker.register_filelike(f.id, None) 
        
               continue

There are several design considerations for API.

py7zr has a method getnames() which returns all of archived files as list. It would be better that value of getnames() can be used as an argument.
It is necessary to split extract function into several internal functions for a better maintenance.
When user specified files under directories, py7zr should make these directories before extraction of files. If user does not specify parent directory, method extract() call become failed.
My hack does not recognize path separate character difference ('/' or '').
Is it better to accept regex expression for argument?
extraction core currently run through all of archive data. When skipping target files, just ignore decompressed data ( fileish becomes NullHandler). It can reduce I/O but cannot reduce CPU time.

py7zr/py7zr/compression.py

Lines 263 to 274 in c3f580e

    
           def extract_single(self, fp: BinaryIO, files, src_start: int) -> None: 
        
               """Single thread extractor that takes file lists in single 7zip folder.""" 
        
               fp.seek(src_start) 
        
               for f in files: 
        
                   fileish = self.target_filepath.get(f.id, NullHandler())  # type: Handler 
        
                   fileish.open() 
        
                   # Skip empty file read 
        
                   if f.emptystream: 
        
                       fileish.write(b'') 
        
                   else: 
        
                       self.decompress(fp, f.folder, fileish, f.uncompressed[-1], f.compressed) 
        
                   fileish.close()

miurahr · 2020-01-27T22:41:35Z

Here is my idea how to utilize getnames()

py7zr/tests/test_basic.py

Lines 391 to 397 in e51772a

    
           allfiles = archive.getnames() 
        
           filter_pattern = re.compile(r'scripts.*') 
        
           targets = [] 
        
           for f in allfiles: 
        
               if filter_pattern.match(f): 
        
                   targets.append(f) 
        
           archive.extract(path=tmp_path, targets=targets)

If it is convenient, I'd like to add new method such as extract_re(path=<outdir>, filter=<regex>).

michaelfecher · 2020-01-28T12:17:12Z

Your code does almost as same as I tried.
if re.match('file1 | file2 | ..', f.filename) and if f.filename in file_list produce same result.

py7zr/py7zr/py7zr.py

Lines 737 to 739 in c3f580e

if targets is not None and f.filename not in targets:

self.worker.register_filelike(f.id, None)

continue

There are several design considerations for API.
1. py7zr has a method `getnames()` which returns all of archived files as list. It would be better that value of `getnames()` can be used as an argument.

2. It is necessary to split extract function into several internal functions for a  better maintenance.

3. When user specified files under directories,  py7zr should make these directories before extraction of files. If user does not specify parent directory, method `extract()` call become failed.

4. My hack does not recognize path separate character difference ('/' or '').

5. Is it better to accept regex expression for argument?

6. extraction core currently run through all of archive data. When skipping target files, just ignore  **decompressed** data ( fileish becomes NullHandler). It can reduce I/O but cannot reduce CPU time.
py7zr/py7zr/compression.py

Lines 263 to 274 in c3f580e

def extract_single(self, fp: BinaryIO, files, src_start: int) -> None:

"""Single thread extractor that takes file lists in single 7zip folder."""

fp.seek(src_start)

for f in files:

fileish = self.target_filepath.get(f.id, NullHandler()) # type: Handler

fileish.open()

# Skip empty file read

if f.emptystream:

fileish.write(b'')

else:

self.decompress(fp, f.folder, fileish, f.uncompressed[-1], f.compressed)

fileish.close()

When trying to run with your changes, I get an error:

  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/py7zr.py", line 772, in extract
    self.worker.extract(self.fp, multithread=multi_thread)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 261, in extract
    self.extract_single(fp, self.files, self.src_start)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 268, in extract_single
    fileish.open()
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 111, in open
    self.fp = self.target.open(mode=mode)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1203, in open
    opener=self._opener)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1058, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'FUBAR/65155/feedback.txt'

called the extract method like this:

zip_location = '/home/mf/code/py7zr/BIG_ARCHIVE_FILE.7z'
archive = py7zr.SevenZipFile(zip_location, mode='r')
# targets are in the archive!
targets = ['FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']
archive.extract(targets=targets)

Also an error occurs, if I provide an output path...

  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/py7zr.py", line 772, in extract
    self.worker.extract(self.fp, multithread=multi_thread)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 261, in extract
    self.extract_single(fp, self.files, self.src_start)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 268, in extract_single
    fileish.open()
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 111, in open
    self.fp = self.target.open(mode=mode)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1203, in open
    opener=self._opener)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1058, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/logs/FUBAR/65155/feedback.txt'

/tmp/logs/ was provided as path argument

miurahr · 2020-01-29T12:31:09Z

It is what I mentioned at

3. When user specified files under directories, py7zr should make these directories before extraction of files. If user does not specify parent directory, method extract()call become failed.

michaelfecher · 2020-01-29T15:26:05Z

very specifically asked, because the 2nd sentence confuses me...
the workaround would be to create the dirs for targets and path in the client code before calling extract()?

miurahr · 2020-01-29T21:56:46Z

means you should call with

- targets = ['FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']
+ targets = ['FUBAR', 'FUBAR/65155', 'FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']

miurahr · 2020-01-29T22:03:14Z

Please see

py7zr/tests/test_basic.py

Lines 374 to 379 in e51772a

    
           def test_py7zr_extract_specified_file(tmp_path): 
        
               archive = py7zr.SevenZipFile(open(os.path.join(testdata_path, 'test_1.7z'), 'rb')) 
        
               expected = [{'filename': 'scripts/py7zr', 'mode': 33261, 'mtime': 1552522208, 
        
                           'digest': 'b0385e71d6a07eb692f5fb9798e9d33aaf87be7dfff936fd2473eab2a593d4fd'} 
        
                           ] 
        
               archive.extract(path=tmp_path, targets=['scripts', 'scripts/py7zr'])

That is not archive.extract(path=tmp_path, targets=['scripts/py7zr']) but archive.extract(path=tmp_path, targets=['scripts', 'scripts/py7zr'])

michaelfecher · 2020-01-30T13:02:47Z

means you should call with

- targets = ['FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']
+ targets = ['FUBAR', 'FUBAR/65155', 'FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']

Thanks for the hint!
Adapted my code accordingly.
Unfortunately, I'm hitting another issue now :/

I am reading in the 7z file BEFORE a loop.
In the loop, I run the extraction to extract the corresponding files via the extract function.
The first iteration is fine, everything behaves as it should.
Unfortunately in the 2nd iteration, there occurs an error during the extract method:

  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/py7zr.py", line 772, in extract
    self.worker.extract(self.fp, multithread=multi_thread)
  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/compression.py", line 261, in extract
    self.extract_single(fp, self.files, self.src_start)
  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/compression.py", line 273, in extract_single
    self.decompress(fp, f.folder, fileish, f.uncompressed[-1], f.compressed)
  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/compression.py", line 301, in decompress
    assert out_remaining == 0
AssertionError

My code for the extraction looks like this:

from pathlib import Path

archive = py7zr.SevenZipFile(open(zip_location, 'rb'))
for not_revant, files_to_extract_list in fubar.items():
     unique_dirs = {str(p) for b in files_to_extract_list 
                                 for p in Path(b).parents
                                 if str(p) is not '.'}
     sorted_unique_dirs = sorted(unique_dirs , key=len)
     all_dirs_and_files = [*sorted_unique_dirs, *files_to_extract_list]
     archive.extract(path='/tmp/logs',
                        targets=all_dirs_and_files)

all_dirs_and_files variables per run:

1st iteration:
['FUBAR', 'FUBAR/65155', 'FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']

2nd iteration:
['FUBAR', 'FUBAR/65268', 'FUBAR/65268/feedback.txt.5', 'FUBAR/65268/feedback.txt.4', 'FUBAR/65268/feedback.txt.3', 'FUBAR/65268/feedback.txt.2', 'FUBAR/65268/feedback.txt.1', 'FUBAR/65268/feedback.txt']

michaelfecher · 2020-01-30T13:07:41Z

strange...
when I move

archive = py7zr.SevenZipFile(open(zip_location, 'rb'))

in the loop, then it works.
Is that intended?
Asking, because I'm used to open the file once, doing the stuff and close the file or rely on auto-closing ala with open(...).
Not knowing the details of the implementation, but won't there be an issue with the amounf of file handlers?
Still I'm super happy, that it works now 👍
Big thanks for your support so far!!
Will check it out, how it will behave with the big 7z files ;)

miurahr · 2020-02-03T22:41:23Z

7-zip format is basically use 'solid' archive, that all files are compressed in single archive stream.
When extracting data form the stream, decompressor should read the data from begging even target data is placed at end of stream.

Both extract() and extractall() method have to process all the archive data, read all the archive data, even that is several giga bytes.
extract() read all the data and not write some data, then write specified chunk to file.
extractall() read all the data and write all data chunk as target files.

After you called extract() method, an internal file pointer has positioned to end of data.
We can seek file pointer to start of data at each iteration, but it is quite inefficient.

You want to process large archive (30Gb) and looping method, if it is twice, you read 30Gb x 2 = 60Gb from disk. If it is ten times of loop, you read 300Gb from disk!

Solid 7-zip format does not support random access by its nature, but optimized to compression ratio.

Users are recommended to construct a list of files to extract, you can use loop there, then call extact() only once.

miurahr · 2020-02-04T01:44:20Z

Thanks @michaelfecher for testing.
Now PR #64 provide extracting specific files and support iterating.
See

py7zr/tests/test_basic.py

Lines 405 to 412 in ae9e76a

    
           @pytest.mark.api 
        
           def test_py7zr_extract_and_reset_iteration(tmp_path): 
        
               archive = py7zr.SevenZipFile(open(os.path.join(testdata_path, 'test_1.7z'), 'rb')) 
        
               iterations = archive.getnames() 
        
               for target in iterations: 
        
                   archive.extract(path=tmp_path, targets=[target]) 
        
                   archive.reset() 
        
               archive.close()

miurahr added enhancement New feature or request for extraction Issue on extraction, decompression or decryption help wanted Extra attention is needed labels Jan 27, 2020

miurahr removed the help wanted Extra attention is needed label Jan 27, 2020

This comment has been minimized.

Sign in to view

miurahr mentioned this issue Feb 4, 2020

Support filtering a target of extracted files from archive #64

Merged

miurahr closed this as completed in #64 Feb 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

provide method for extracting specific files from an archive #58

provide method for extracting specific files from an archive #58

michaelfecher commented Jan 27, 2020

miurahr commented Jan 27, 2020 •

edited

Loading

miurahr commented Jan 27, 2020 •

edited

Loading

michaelfecher commented Jan 27, 2020 •

edited

Loading

miurahr commented Jan 27, 2020 •

edited

Loading

miurahr commented Jan 27, 2020

michaelfecher commented Jan 28, 2020 •

edited

Loading

miurahr commented Jan 29, 2020

michaelfecher commented Jan 29, 2020 •

edited

Loading

miurahr commented Jan 29, 2020 •

edited

Loading

miurahr commented Jan 29, 2020

michaelfecher commented Jan 30, 2020 •

edited

Loading

michaelfecher commented Jan 30, 2020 •

edited

Loading

miurahr commented Feb 3, 2020 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

miurahr commented Feb 4, 2020

provide method for extracting specific files from an archive #58

provide method for extracting specific files from an archive #58

Comments

michaelfecher commented Jan 27, 2020

miurahr commented Jan 27, 2020 • edited Loading

miurahr commented Jan 27, 2020 • edited Loading

michaelfecher commented Jan 27, 2020 • edited Loading

miurahr commented Jan 27, 2020 • edited Loading

miurahr commented Jan 27, 2020

michaelfecher commented Jan 28, 2020 • edited Loading

miurahr commented Jan 29, 2020

michaelfecher commented Jan 29, 2020 • edited Loading

miurahr commented Jan 29, 2020 • edited Loading

miurahr commented Jan 29, 2020

michaelfecher commented Jan 30, 2020 • edited Loading

michaelfecher commented Jan 30, 2020 • edited Loading

miurahr commented Feb 3, 2020 • edited Loading

This comment has been minimized.

This comment has been minimized.

miurahr commented Feb 4, 2020

miurahr commented Jan 27, 2020 •

edited

Loading

miurahr commented Jan 27, 2020 •

edited

Loading

michaelfecher commented Jan 27, 2020 •

edited

Loading

miurahr commented Jan 27, 2020 •

edited

Loading

michaelfecher commented Jan 28, 2020 •

edited

Loading

michaelfecher commented Jan 29, 2020 •

edited

Loading

miurahr commented Jan 29, 2020 •

edited

Loading

michaelfecher commented Jan 30, 2020 •

edited

Loading

michaelfecher commented Jan 30, 2020 •

edited

Loading

miurahr commented Feb 3, 2020 •

edited

Loading