Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide method for extracting specific files from an archive #58

Closed
michaelfecher opened this issue Jan 27, 2020 · 16 comments · Fixed by #64
Closed

provide method for extracting specific files from an archive #58

michaelfecher opened this issue Jan 27, 2020 · 16 comments · Fixed by #64
Labels
enhancement New feature or request for extraction Issue on extraction, decompression or decryption

Comments

@michaelfecher
Copy link

Is your feature request related to a problem? Please describe.
For a serverless batch process, I have big 7z files floating around (2.5Gb+), which I don't want to extract completely (30Gb+).
Otherwise the serverless aproach would be very expensive.

Describe the solution you'd like
Instead I want to selectively extract files.
The argument of the method could be of a list of strings, which should match with the given file list of the archive and only extract the matches.
This would enable a leaner solution and focus on the files I am interested in.

Describe alternatives you've considered
There are currently no alternatives, because I already tried to do it via the 7z binary provisioning via copying the binaries + libraries. :)
It happens, that there is a permission denied error - regardless of what I am doing.
The support also couldn't assist here.

@miurahr miurahr added enhancement New feature or request for extraction Issue on extraction, decompression or decryption help wanted Extra attention is needed labels Jan 27, 2020
@miurahr
Copy link
Owner

miurahr commented Jan 27, 2020

py7zr was designed to do it in future, but current implementation does not have the function.

An extraction main function is built with five code blocks.

  1. setup list of directories, files and symlinks from header data with output path.
    for f in self.files:

    py7zr/py7zr/py7zr.py

    Lines 753 to 754 in 6ba709c

    self.worker.register_filelike(f.id, outfilename)
    target_files.append((outfilename, f.file_properties()))
  2. make directories.

    py7zr/py7zr/py7zr.py

    Lines 755 to 757 in 6ba709c

    for target_dir in sorted(target_dirs):
    try:
    target_dir.mkdir()
  3. decompress archive file walking through lists. If path is None, nothing is written.
    self.worker.extract(self.fp, multithread=multi_thread)
  4. create symbolic link
    sym_dst.symlink_to(sym_src)
  5. set metadata of files and directories, such as creation time, permission etc.

    py7zr/py7zr/py7zr.py

    Lines 791 to 792 in 6ba709c

    for o, p in target_files:
    self._set_file_property(o, p)

Currently step.1 works for all of items. It calls 'self. worker.register_filelike()' against each target items. when calling it with 'None' as a target path, then decompress function will skip it.

@miurahr
Copy link
Owner

miurahr commented Jan 27, 2020

A branch https://github.com/miurahr/py7zr/commits/topic-extraction-filter try to realize it with dirty hack.

Please see test case to know how to specify files.

py7zr/tests/test_basic.py

Lines 372 to 383 in c3f580e

@pytest.mark.api
def test_py7zr_extract_specified_file(tmp_path):
archive = py7zr.SevenZipFile(open(os.path.join(testdata_path, 'test_1.7z'), 'rb'))
expected = [{'filename': 'scripts/py7zr', 'mode': 33261, 'mtime': 1552522208,
'digest': 'b0385e71d6a07eb692f5fb9798e9d33aaf87be7dfff936fd2473eab2a593d4fd'}
]
archive.extract(path=tmp_path, targets=['scripts', 'scripts/py7zr'])
assert tmp_path.joinpath('scripts').is_dir()
assert tmp_path.joinpath('scripts/py7zr').exists()
assert not tmp_path.joinpath('setup.cfg').exists()
assert not tmp_path.joinpath('setup.py').exists()
check_output(expected, tmp_path)

There are two items to extract in this test case, one is 'scripts' directory and another is 'scripts/py7zr' file. Other files 'setup.py' and 'setup.cfg' is skipped.

Any feedback?

@miurahr miurahr removed the help wanted Extra attention is needed label Jan 27, 2020
@michaelfecher
Copy link
Author

michaelfecher commented Jan 27, 2020

I nearly implemented a similar solution,
because I wasn't aware that you respond that fast 😄

# specific_file_list: List[str] as an argument to the function
file_list_pattern = '|'.join('(?:{0})'.format(x) for x in specific_file_list)
file_pattern = re.compile(file_list_pattern)
# in for loop, straight after iteration
if file_pattern.match(f.filename) is None:
    continue

I only wasn't sure if it's the intended solution, because I only had a brief look on the master code.
And I was misleaded by the code, because I thought all this worker stuff needs to be adapted as well.

Not sure why you find your solution "dirty"?
I only can assume that you named it "dirty", because you packed everything together in one function...

@miurahr
Copy link
Owner

miurahr commented Jan 27, 2020

Your code does almost as same as I tried.
if re.match('file1 | file2 | ..', f.filename) and if f.filename in file_list produce same result.

py7zr/py7zr/py7zr.py

Lines 737 to 739 in c3f580e

if targets is not None and f.filename not in targets:
self.worker.register_filelike(f.id, None)
continue

There are several design considerations for API.

  1. py7zr has a method getnames() which returns all of archived files as list. It would be better that value of getnames() can be used as an argument.

  2. It is necessary to split extract function into several internal functions for a better maintenance.

  3. When user specified files under directories, py7zr should make these directories before extraction of files. If user does not specify parent directory, method extract() call become failed.

  4. My hack does not recognize path separate character difference ('/' or '').

  5. Is it better to accept regex expression for argument?

  6. extraction core currently run through all of archive data. When skipping target files, just ignore decompressed data ( fileish becomes NullHandler). It can reduce I/O but cannot reduce CPU time.

py7zr/py7zr/compression.py

Lines 263 to 274 in c3f580e

def extract_single(self, fp: BinaryIO, files, src_start: int) -> None:
"""Single thread extractor that takes file lists in single 7zip folder."""
fp.seek(src_start)
for f in files:
fileish = self.target_filepath.get(f.id, NullHandler()) # type: Handler
fileish.open()
# Skip empty file read
if f.emptystream:
fileish.write(b'')
else:
self.decompress(fp, f.folder, fileish, f.uncompressed[-1], f.compressed)
fileish.close()

@miurahr
Copy link
Owner

miurahr commented Jan 27, 2020

Here is my idea how to utilize getnames()

py7zr/tests/test_basic.py

Lines 391 to 397 in e51772a

allfiles = archive.getnames()
filter_pattern = re.compile(r'scripts.*')
targets = []
for f in allfiles:
if filter_pattern.match(f):
targets.append(f)
archive.extract(path=tmp_path, targets=targets)

If it is convenient, I'd like to add new method such as extract_re(path=<outdir>, filter=<regex>).

@michaelfecher
Copy link
Author

michaelfecher commented Jan 28, 2020

Your code does almost as same as I tried.
if re.match('file1 | file2 | ..', f.filename) and if f.filename in file_list produce same result.

py7zr/py7zr/py7zr.py

Lines 737 to 739 in c3f580e

if targets is not None and f.filename not in targets:
self.worker.register_filelike(f.id, None)
continue

There are several design considerations for API.

1. py7zr has a method `getnames()` which returns all of archived files as list. It would be better that value of `getnames()` can be used as an argument.

2. It is necessary to split extract function into several internal functions for a  better maintenance.

3. When user specified files under directories,  py7zr should make these directories before extraction of files. If user does not specify parent directory, method `extract()` call become failed.

4. My hack does not recognize path separate character difference ('/' or '').

5. Is it better to accept regex expression for argument?

6. extraction core currently run through all of archive data. When skipping target files, just ignore  **decompressed** data ( fileish becomes NullHandler). It can reduce I/O but cannot reduce CPU time.

py7zr/py7zr/compression.py

Lines 263 to 274 in c3f580e

def extract_single(self, fp: BinaryIO, files, src_start: int) -> None:
"""Single thread extractor that takes file lists in single 7zip folder."""
fp.seek(src_start)
for f in files:
fileish = self.target_filepath.get(f.id, NullHandler()) # type: Handler
fileish.open()
# Skip empty file read
if f.emptystream:
fileish.write(b'')
else:
self.decompress(fp, f.folder, fileish, f.uncompressed[-1], f.compressed)
fileish.close()

When trying to run with your changes, I get an error:

  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/py7zr.py", line 772, in extract
    self.worker.extract(self.fp, multithread=multi_thread)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 261, in extract
    self.extract_single(fp, self.files, self.src_start)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 268, in extract_single
    fileish.open()
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 111, in open
    self.fp = self.target.open(mode=mode)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1203, in open
    opener=self._opener)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1058, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'FUBAR/65155/feedback.txt'

called the extract method like this:

zip_location = '/home/mf/code/py7zr/BIG_ARCHIVE_FILE.7z'
archive = py7zr.SevenZipFile(zip_location, mode='r')
# targets are in the archive!
targets = ['FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']
archive.extract(targets=targets)

Also an error occurs, if I provide an output path...

  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/py7zr.py", line 772, in extract
    self.worker.extract(self.fp, multithread=multi_thread)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 261, in extract
    self.extract_single(fp, self.files, self.src_start)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 268, in extract_single
    fileish.open()
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/site-packages/py7zr/compression.py", line 111, in open
    self.fp = self.target.open(mode=mode)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1203, in open
    opener=self._opener)
  File "/home/mf/miniconda3/envs/xds/lib/python3.7/pathlib.py", line 1058, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/logs/FUBAR/65155/feedback.txt'

/tmp/logs/ was provided as path argument

@miurahr
Copy link
Owner

miurahr commented Jan 29, 2020

It is what I mentioned at

3. When user specified files under directories, py7zr should make these directories before extraction of files. If user does not specify parent directory, method extract()call become failed.

@michaelfecher
Copy link
Author

michaelfecher commented Jan 29, 2020

very specifically asked, because the 2nd sentence confuses me...
the workaround would be to create the dirs for targets and path in the client code before calling extract()?

@miurahr
Copy link
Owner

miurahr commented Jan 29, 2020

means you should call with

- targets = ['FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']
+ targets = ['FUBAR', 'FUBAR/65155', 'FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']

@miurahr
Copy link
Owner

miurahr commented Jan 29, 2020

Please see

py7zr/tests/test_basic.py

Lines 374 to 379 in e51772a

def test_py7zr_extract_specified_file(tmp_path):
archive = py7zr.SevenZipFile(open(os.path.join(testdata_path, 'test_1.7z'), 'rb'))
expected = [{'filename': 'scripts/py7zr', 'mode': 33261, 'mtime': 1552522208,
'digest': 'b0385e71d6a07eb692f5fb9798e9d33aaf87be7dfff936fd2473eab2a593d4fd'}
]
archive.extract(path=tmp_path, targets=['scripts', 'scripts/py7zr'])

That is not archive.extract(path=tmp_path, targets=['scripts/py7zr']) but archive.extract(path=tmp_path, targets=['scripts', 'scripts/py7zr'])

@michaelfecher
Copy link
Author

michaelfecher commented Jan 30, 2020

means you should call with

- targets = ['FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']
+ targets = ['FUBAR', 'FUBAR/65155', 'FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']

Thanks for the hint!
Adapted my code accordingly.
Unfortunately, I'm hitting another issue now :/

I am reading in the 7z file BEFORE a loop.
In the loop, I run the extraction to extract the corresponding files via the extract function.
The first iteration is fine, everything behaves as it should.
Unfortunately in the 2nd iteration, there occurs an error during the extract method:

  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/py7zr.py", line 772, in extract
    self.worker.extract(self.fp, multithread=multi_thread)
  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/compression.py", line 261, in extract
    self.extract_single(fp, self.files, self.src_start)
  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/compression.py", line 273, in extract_single
    self.decompress(fp, f.folder, fileish, f.uncompressed[-1], f.compressed)
  File "/home/mf/miniconda3/envs/xds/lib/python3.6/site-packages/py7zr/compression.py", line 301, in decompress
    assert out_remaining == 0
AssertionError

My code for the extraction looks like this:

from pathlib import Path

archive = py7zr.SevenZipFile(open(zip_location, 'rb'))
for not_revant, files_to_extract_list in fubar.items():
     unique_dirs = {str(p) for b in files_to_extract_list 
                                 for p in Path(b).parents
                                 if str(p) is not '.'}
     sorted_unique_dirs = sorted(unique_dirs , key=len)
     all_dirs_and_files = [*sorted_unique_dirs, *files_to_extract_list]
     archive.extract(path='/tmp/logs',
                        targets=all_dirs_and_files)

all_dirs_and_files variables per run:

1st iteration:
['FUBAR', 'FUBAR/65155', 'FUBAR/65155/feedback.txt.6', 'FUBAR/65155/feedback.txt.5', 'FUBAR/65155/feedback.txt.4', 'FUBAR/65155/feedback.txt.3', 'FUBAR/65155/feedback.txt.2', 'FUBAR/65155/feedback.txt.1', 'FUBAR/65155/feedback.txt']

2nd iteration:
['FUBAR', 'FUBAR/65268', 'FUBAR/65268/feedback.txt.5', 'FUBAR/65268/feedback.txt.4', 'FUBAR/65268/feedback.txt.3', 'FUBAR/65268/feedback.txt.2', 'FUBAR/65268/feedback.txt.1', 'FUBAR/65268/feedback.txt']

@michaelfecher
Copy link
Author

michaelfecher commented Jan 30, 2020

strange...
when I move

archive = py7zr.SevenZipFile(open(zip_location, 'rb'))

in the loop, then it works.
Is that intended?
Asking, because I'm used to open the file once, doing the stuff and close the file or rely on auto-closing ala with open(...).
Not knowing the details of the implementation, but won't there be an issue with the amounf of file handlers?
Still I'm super happy, that it works now 👍
Big thanks for your support so far!!
Will check it out, how it will behave with the big 7z files ;)

@miurahr
Copy link
Owner

miurahr commented Feb 3, 2020

7-zip format is basically use 'solid' archive, that all files are compressed in single archive stream.
When extracting data form the stream, decompressor should read the data from begging even target data is placed at end of stream.

Both extract() and extractall() method have to process all the archive data, read all the archive data, even that is several giga bytes.
extract() read all the data and not write some data, then write specified chunk to file.
extractall() read all the data and write all data chunk as target files.

After you called extract() method, an internal file pointer has positioned to end of data.
We can seek file pointer to start of data at each iteration, but it is quite inefficient.

You want to process large archive (30Gb) and looping method, if it is twice, you read 30Gb x 2 = 60Gb from disk. If it is ten times of loop, you read 300Gb from disk!

Solid 7-zip format does not support random access by its nature, but optimized to compression ratio.

Users are recommended to construct a list of files to extract, you can use loop there, then call extact() only once.

@miurahr

This comment has been minimized.

@miurahr

This comment has been minimized.

@miurahr
Copy link
Owner

miurahr commented Feb 4, 2020

Thanks @michaelfecher for testing.
Now PR #64 provide extracting specific files and support iterating.
See

py7zr/tests/test_basic.py

Lines 405 to 412 in ae9e76a

@pytest.mark.api
def test_py7zr_extract_and_reset_iteration(tmp_path):
archive = py7zr.SevenZipFile(open(os.path.join(testdata_path, 'test_1.7z'), 'rb'))
iterations = archive.getnames()
for target in iterations:
archive.extract(path=tmp_path, targets=[target])
archive.reset()
archive.close()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request for extraction Issue on extraction, decompression or decryption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants