Extract archive into buffer (dict of io.BytesIO()) #111

Zoynels · 2020-04-09T12:10:11Z

My issue #110
Extract archive into dict of io.BytesIO(), where keys are filepath and values are io.BytesIO()
Not work with symlinks

Extract archive into buffer (dict of io.BytesIO()) Add several test for this

codecov · 2020-04-09T12:31:17Z

Codecov Report

Merging #111 into master will increase coverage by 0.12%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #111      +/-   ##
==========================================
+ Coverage   84.41%   84.53%   +0.12%     
==========================================
  Files          10       10              
  Lines        2258     2270      +12     
  Branches      378      381       +3     
==========================================
+ Hits         1906     1919      +13     
  Misses        230      230              
+ Partials      122      121       -1

Flag	Coverage Δ
#linux	`83.43% <100.00%> (+0.16%)`	⬆️
#mac	`83.21% <100.00%> (+0.13%)`	⬆️
#py35	`80.00% <100.00%> (+0.15%)`	⬆️
#py36	`84.28% <100.00%> (+0.12%)`	⬆️
#py37	`84.28% <100.00%> (+0.12%)`	⬆️
#py38	`84.11% <100.00%> (+1.19%)`	⬆️
#windows	`83.25% <100.00%> (+0.13%)`	⬆️

Impacted Files	Coverage Δ
py7zr/compression.py	`95.72% <100.00%> (-0.72%)`	⬇️
py7zr/py7zr.py	`85.50% <100.00%> (+0.57%)`	⬆️

miurahr · 2020-04-12T06:43:44Z

Strange.. Travis-CI test result is not appeared in check in github but it was failed. Please fix the failures.

see https://travis-ci.org/github/miurahr/py7zr/builds/673297726

Errors are as follows:

mypy runtests: commands[0] | mypy py7zr
py7zr/py7zr.py:730: error: No return value expected
py7zr/compression.py:51: error: Need type annotation for '_dict'

and

check runtests: commands[1] | flake8 py7zr tests setup.py
py7zr/compression.py:73:45: E127 continuation line over-indented for visual indent
py7zr/py7zr.py:728:126: E501 line too long (126 > 125 characters)

You can run tox -e check, mypy on your desktop.

miurahr · 2020-04-12T06:51:30Z

Now update github integration of travis-ci. The result will show from next push.

miurahr · 2020-04-12T07:00:35Z

@Zoynels could you merge master HEAD into this PR branch and push again?

tests/test_extract.py

py7zr/compression.py

miurahr · 2020-04-12T12:40:26Z

Looks good and a failure is not related to here.

miurahr · 2020-04-13T03:34:37Z

I have several questions related to API design, @Zoynels

This is an alternative interface against extract() and extractall() so it should be same functionality as same as file system. How to treat meta data, such as filename, timestamp, permissions.
7zip is an archive format, that may have directories, symbolic link and other special files. How to provide these to API users?
extract() and extractall() are considered opposite function against write() and writeall(). From a view point of API design, it is better that a return value from extractall() is usable for writeall() input.
Why value is stored in dict, how key become, how to handle archive order of files?
extract() has a functionality to filter contents with file spec. when the filter is specified, how output dictionary become?
Current py7zr are carefully designed not to keep many memory, only use a limited size(8MB) of buffer , to avoid memory leak. I'm afraid that it use python memory and variable self._dict cannot freed even close() is called. Because memory management of content buffer is responsible by user application, it may be better for API to accept buffer variable from user application not in py7zr class.
7zip file format allows multiple files with same file path in archive. How become it?

Zoynels · 2020-04-13T05:51:31Z

have several questions related to API design, @Zoynels

This is an alternative interface against extract() and extractall() so it should be same functionality as same as file system. How to treat meta data, such as filename, timestamp, permissions.

Now it stores only filename and data of extracted file. So if it should store such metadata as timestamp, permissions we need more complex dict storage, e.i.:
_dict = {}
_dict[filename1] = {}
_dict[filename1]["iobytes"] = iobytes
_dict[filename1]["timestamp"] = timestamp
_dict[filename1]["permissions"] = permissions
_dict[filename1]["is_symlink"] = is_symlink
and etc.
Maybe it will be good. Now I don't have such need, but or future for someone it will need. For how lets store only basic information in such format.
Or create class instead of dict.

7zip is an archive format, that may have directories, symbolic link and other special files. How to provide these to API users?

Now _dict stores only files, not directories. If we need to save all structure, then I see several ways:

plain dict where keys of _dict store string (or it could be pathlib.Path()) filepath of files and directories, and by metainformation it define what it store.
dict of dict, where could be subdirectories.
create class to store information instead of dict way 1 or 2

extract() and extractall() are considered opposite function against write() and writeall(). From a view point of API design, it is better that a return value from extractall() is usable for writeall() input.

I think for extract(), extractall(), write(), writeall() when extract into real files it should return True/False if write/extraction finished with success or failure.

Why value is stored in dict, how key become, how to handle archive order of files?
_dict is simple way where extraction of files have same sort like in archive.

Is it could store information in class and it could be more good. What do you think?

extract() has a functionality to filter contents with file spec. when the filter is specified, how output dictionary become?

It uses standart way of extraction, only when saves data into io it saves it into io.iobytes() instead of real file. All other functionality is same. But now it don't restore junction, symlinks as I don't fully understand how filenames should stores. I write about it in #112

Current py7zr are carefully designed not to keep many memory, only use a limited size(8MB) of buffer , to avoid memory leak. I'm afraid that it use python memory and variable self._dict cannot freed even close() is called. Because memory management of content buffer is responsible by user application, it may be better for API to accept buffer variable from user application not in py7zr class.

_dict is filled only if extract into dict and if extract into filesystem it will be empty. If you think that instead of return_dict: bool = False we could place None or _dict which will be filled with extracted information. In this way only dict could be placed and class of file couldn't be created. I thin it is also good option.
Several open questions will be eliminated.

7zip file format allows multiple files with same file path in archive. How become it?
It now uses standart way which tested:

filename
filename_0

miurahr · 2020-04-13T11:57:32Z

This is an alternative interface against extract() and extractall() so it should be same functionality as same as file system. How to treat meta data, such as filename, timestamp, permissions.

Now it stores only filename and data of extracted file. So if it should store such metadata as timestamp, permissions we need more complex dict storage, e.i.:
_dict = {}
_dict[filename1] = {}
_dict[filename1]["iobytes"] = iobytes
_dict[filename1]["timestamp"] = timestamp
_dict[filename1]["permissions"] = permissions
_dict[filename1]["is_symlink"] = is_symlink
and etc.
Maybe it will be good. Now I don't have such need, but or future for someone it will need. For how lets store only basic information in such format.
Or create class instead of dict.

We have an interface SevenZipFile.list() that return a List[FileInfo].

That include these meta data. If user combine this interface with its contents, it may be ok.

miurahr · 2020-04-13T12:14:08Z

When looking into python standard ZipFile API, which has a same purpose API named read()

https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.read

ZipFile.read(name, pwd=None)
Return the bytes of the file name in the archive. name is the name of the file in the archive, or a ZipInfo object.

It seems to be better SevenZipFile and ZipFile share the same API. Unfortunately TarFile class does not have an interface for same purpose.

Zoynels · 2020-04-13T17:25:00Z

When looking into python standard ZipFile API, which has a same purpose API named read()

https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.read

ZipFile.read(name, pwd=None)
Return the bytes of the file name in the archive. name is the name of the file in the archive, or a ZipInfo object.

It seems to be better SevenZipFile and ZipFile share the same API. Unfortunately TarFile class does not have an interface for same purpose.

Yes this is the best choice. You are right that it hasn't read() intarface, but:

https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.extractfile

TarFile.extractfile(member)
Extract a member from the archive as a file object. member may be a filename or a TarInfo object. If member is a regular file or a link, an io.BufferedReader object is returned. Otherwise, None is returned.

Changed in version 3.3: Return an io.BufferedReader object.
It returns io.BufferedReader, which has read() interface.
I don't know your library well, but I couldn't find any links to tarfile class in library.
If you could implement such functionality, then there will be no need to export into BytesIO

miurahr · 2020-04-13T23:53:01Z

TarFile.extractfile(member)
Extract a member from the archive as a file object. member may be a filename or a TarInfo object. If member is a regular file or a link, an io.BufferedReader object is returned. Otherwise, None is returned.

Changed in version 3.3: Return an io.BufferedReader object.
It returns io.BufferedReader, which has read() interface.

I don't know your library well, but I couldn't find any links to tarfile class in library.
If you could implement such functionality, then there will be no need to export into BytesIO

Tar file is a simple concatenate of files, so it can return opened archive file, with seeking file pointer to head of an inside file in archive file.

And as you know ,7-zip is an archive format with solid-compression support, which is unable to extract in random access but good compression ratio.

We cannot provide an interface which argument is a member or filename, which requires random access mode, or read entire archive file for extraction of each portion of archive.
I have provided a function with an argument of filespec/file filter, so it can be return io.BytesIO(), or we can provide interface to register call-back function which is called when py7zr is ready to provide data.

miurahr · 2020-04-13T23:58:43Z

Here is my proposal for an interface design.

extract(filespec) extractall()
extract files with filter, or extract all files onto filesystem

read(filespec), readall()
extract files with filter, or extract all files into io.ByteIO then return dictionary
archive(), archiveall()
make archive with compressed files on a file system. currently implement as methods write() and writeall()
write(buf, file property)
compress file from data into an archive

miurahr

I'd like to introduce methods read() and readall() for dictionary return.
It help keep method signature for extract() and extractall(), and help understand read() as similar as ZipFile class.

miurahr · 2020-04-15T01:51:10Z

py7zr/py7zr.py

@@ -644,15 +644,16 @@ def test(self) -> bool:
        """Test archive using CRC digests."""
        return self._test_digests()

-    def extractall(self, path: Optional[Any] = None) -> None:
+    def extractall(self, path: Optional[Any] = None, return_dict: bool = False) -> Optional[Dict[str, IO[Any]]]:


Shall we keep extractall(self, path)-> None as is and introduce readall(self, path) -> Dict[str, IO[Any]]?

miurahr · 2020-04-15T01:51:50Z

py7zr/py7zr.py


-    def extract(self, path: Optional[Any] = None, targets: Optional[List[str]] = None) -> None:
+    def extract(self, path: Optional[Any] = None, targets: Optional[List[str]] = None,


keep extract(self, path)->None method signature as is and introduce read(self, path) -> Dict[str, IO[Any]]Introduce internal fucntion_extract(self, path, targets, return_dict) -> Optional[Dict]]` that is your proposal.

That keeps method signature and introduce read()

miurahr · 2020-04-15T01:56:02Z

tests/test_extract.py

-    assert tmp_path.joinpath('test/test2.txt').open('rb').read() == bytes('This file is located in a folder.', 'ascii')
-    assert tmp_path.joinpath('test1.txt').open('rb').read() == bytes('This file is located in the root.', 'ascii')
+    if not return_dict:
+        archive.extractall(path=tmp_path, return_dict=return_dict)


can change to archive.exttractall(path=tmp_path)

miurahr · 2020-04-15T01:56:32Z

tests/test_extract.py

+        assert tmp_path.joinpath('test/test2.txt').open('rb').read() == bytes('This file is located in a folder.', 'ascii')
+        assert tmp_path.joinpath('test1.txt').open('rb').read() == bytes('This file is located in the root.', 'ascii')
+    else:
+        _dict = archive.extractall(return_dict=return_dict)


It become _dict = archive.readall()

miurahr · 2020-05-01T05:45:29Z

Here is no activity by original proposer @Zoynels in recent two weeks. We have good discussion about API and hope merge it with modification.
I'd like to take over the topic in my local.

Extract archive into buffer (dict of io.BytesIO()) (Update of #111)

Zoynels added 2 commits April 9, 2020 18:04

Extract archive into buffer (dict of io.BytesIO())

68b5d40

Extract archive into buffer (dict of io.BytesIO()) Add several test for this

Update test

720923e

Zoynels added 9 commits April 10, 2020 10:24

Add additional test for empty archive

da585ce

Remove None file

cb737db

Update same delimiter = "/" for filename in Windows and other

72d9023

Update test: return_dict in same function as base extractall

ccdaa0a

Update test_extract.py

f14dabe

Update test_extract.py

96f071b

Update test_extract.py

d50de2d

Update test_extract.py

49c758e

Update test_extract.py

59923e5

Zoynels changed the title ~~Extract archive into buffer (dict of io.BytesIO()) #110~~ Extract archive into buffer (dict of io.BytesIO()) Apr 10, 2020

miurahr requested changes Apr 12, 2020

View reviewed changes

Zoynels added 13 commits April 12, 2020 13:38

Merge remote-tracking branch 'upstream/master'

f66778b

fix check

8233f84

fix flake8

94fc8d4

@pytest.mark.parametrize('return_dict', [False, True])

eac4007

@pytest.mark.parametrize('return_dict', [False, True])

f935e23

Update test_extract.py

80dfd01

Update test_extract.py

337507f

Update test_extract.py

869be59

type definition

4911dc8

Fix type definition

27769b2

Unite extract_single

5a1f491

Fix test

3558170

Update extract_single

a046d66

Zoynels requested a review from miurahr April 12, 2020 12:31

Merge branch 'master' into master

4727442

miurahr requested changes Apr 15, 2020

View reviewed changes

This comment has been minimized.

Sign in to view

miurahr closed this May 1, 2020

miurahr added a commit that referenced this pull request May 6, 2020

Merge pull request #119 from miurahr/topic-memory-api

e78c1b9

Extract archive into buffer (dict of io.BytesIO()) (Update of #111)

miurahr mentioned this pull request Apr 2, 2024

read/readall dumps the decompressed files to memory, instead of streaming them #579

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract archive into buffer (dict of io.BytesIO()) #111

Extract archive into buffer (dict of io.BytesIO()) #111

Zoynels commented Apr 9, 2020 •

edited

codecov bot commented Apr 9, 2020 •

edited

miurahr commented Apr 12, 2020 •

edited

miurahr commented Apr 12, 2020

miurahr commented Apr 12, 2020

miurahr commented Apr 12, 2020

miurahr commented Apr 13, 2020 •

edited

Zoynels commented Apr 13, 2020

miurahr commented Apr 13, 2020

miurahr commented Apr 13, 2020

Zoynels commented Apr 13, 2020 •

edited

miurahr commented Apr 13, 2020

miurahr commented Apr 13, 2020

miurahr left a comment

miurahr Apr 15, 2020

miurahr Apr 15, 2020

miurahr Apr 15, 2020

miurahr Apr 15, 2020

This comment has been minimized.

miurahr commented May 1, 2020 •

edited


		def extract(self, path: Optional[Any] = None, targets: Optional[List[str]] = None) -> None:
		def extract(self, path: Optional[Any] = None, targets: Optional[List[str]] = None,

Extract archive into buffer (dict of io.BytesIO()) #111

Extract archive into buffer (dict of io.BytesIO()) #111

Conversation

Zoynels commented Apr 9, 2020 • edited

codecov bot commented Apr 9, 2020 • edited

Codecov Report

miurahr commented Apr 12, 2020 • edited

miurahr commented Apr 12, 2020

miurahr commented Apr 12, 2020

miurahr commented Apr 12, 2020

miurahr commented Apr 13, 2020 • edited

Zoynels commented Apr 13, 2020

miurahr commented Apr 13, 2020

miurahr commented Apr 13, 2020

Zoynels commented Apr 13, 2020 • edited

miurahr commented Apr 13, 2020

miurahr commented Apr 13, 2020

miurahr left a comment

Choose a reason for hiding this comment

miurahr Apr 15, 2020

Choose a reason for hiding this comment

miurahr Apr 15, 2020

Choose a reason for hiding this comment

miurahr Apr 15, 2020

Choose a reason for hiding this comment

miurahr Apr 15, 2020

Choose a reason for hiding this comment

This comment has been minimized.

miurahr commented May 1, 2020 • edited

Zoynels commented Apr 9, 2020 •

edited

codecov bot commented Apr 9, 2020 •

edited

miurahr commented Apr 12, 2020 •

edited

miurahr commented Apr 13, 2020 •

edited

Zoynels commented Apr 13, 2020 •

edited

miurahr commented May 1, 2020 •

edited