Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract archive into buffer (dict of io.BytesIO()) #111

Closed
wants to merge 25 commits into from

Conversation

Zoynels
Copy link
Contributor

@Zoynels Zoynels commented Apr 9, 2020

My issue #110
Extract archive into dict of io.BytesIO(), where keys are filepath and values are io.BytesIO()
Not work with symlinks

Extract archive into buffer (dict of io.BytesIO())
Add several test for this
@codecov
Copy link

codecov bot commented Apr 9, 2020

Codecov Report

Merging #111 into master will increase coverage by 0.12%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #111      +/-   ##
==========================================
+ Coverage   84.41%   84.53%   +0.12%     
==========================================
  Files          10       10              
  Lines        2258     2270      +12     
  Branches      378      381       +3     
==========================================
+ Hits         1906     1919      +13     
  Misses        230      230              
+ Partials      122      121       -1     
Flag Coverage Δ
#linux 83.43% <100.00%> (+0.16%) ⬆️
#mac 83.21% <100.00%> (+0.13%) ⬆️
#py35 80.00% <100.00%> (+0.15%) ⬆️
#py36 84.28% <100.00%> (+0.12%) ⬆️
#py37 84.28% <100.00%> (+0.12%) ⬆️
#py38 84.11% <100.00%> (+1.19%) ⬆️
#windows 83.25% <100.00%> (+0.13%) ⬆️
Impacted Files Coverage Δ
py7zr/compression.py 95.72% <100.00%> (-0.72%) ⬇️
py7zr/py7zr.py 85.50% <100.00%> (+0.57%) ⬆️

@Zoynels Zoynels changed the title Extract archive into buffer (dict of io.BytesIO()) #110 Extract archive into buffer (dict of io.BytesIO()) Apr 10, 2020
@miurahr
Copy link
Owner

miurahr commented Apr 12, 2020

Strange.. Travis-CI test result is not appeared in check in github but it was failed. Please fix the failures.

see https://travis-ci.org/github/miurahr/py7zr/builds/673297726

Errors are as follows:

mypy runtests: commands[0] | mypy py7zr
py7zr/py7zr.py:730: error: No return value expected
py7zr/compression.py:51: error: Need type annotation for '_dict'

and

check runtests: commands[1] | flake8 py7zr tests setup.py
py7zr/compression.py:73:45: E127 continuation line over-indented for visual indent
py7zr/py7zr.py:728:126: E501 line too long (126 > 125 characters)

You can run tox -e check, mypy on your desktop.

@miurahr
Copy link
Owner

miurahr commented Apr 12, 2020

Now update github integration of travis-ci. The result will show from next push.

@miurahr
Copy link
Owner

miurahr commented Apr 12, 2020

@Zoynels could you merge master HEAD into this PR branch and push again?

tests/test_extract.py Outdated Show resolved Hide resolved
tests/test_extract.py Outdated Show resolved Hide resolved
tests/test_extract.py Outdated Show resolved Hide resolved
tests/test_extract.py Outdated Show resolved Hide resolved
tests/test_extract.py Outdated Show resolved Hide resolved
tests/test_extract.py Outdated Show resolved Hide resolved
tests/test_extract.py Outdated Show resolved Hide resolved
tests/test_extract.py Outdated Show resolved Hide resolved
py7zr/compression.py Outdated Show resolved Hide resolved
py7zr/compression.py Outdated Show resolved Hide resolved
@Zoynels Zoynels requested a review from miurahr April 12, 2020 12:31
@miurahr
Copy link
Owner

miurahr commented Apr 12, 2020

Looks good and a failure is not related to here.

@miurahr
Copy link
Owner

miurahr commented Apr 13, 2020

I have several questions related to API design, @Zoynels

  1. This is an alternative interface against extract() and extractall() so it should be same functionality as same as file system. How to treat meta data, such as filename, timestamp, permissions.

  2. 7zip is an archive format, that may have directories, symbolic link and other special files. How to provide these to API users?

  3. extract() and extractall() are considered opposite function against write() and writeall(). From a view point of API design, it is better that a return value from extractall() is usable for writeall() input.

  4. Why value is stored in dict, how key become, how to handle archive order of files?

  5. extract() has a functionality to filter contents with file spec. when the filter is specified, how output dictionary become?

  6. Current py7zr are carefully designed not to keep many memory, only use a limited size(8MB) of buffer , to avoid memory leak. I'm afraid that it use python memory and variable self._dict cannot freed even close() is called. Because memory management of content buffer is responsible by user application, it may be better for API to accept buffer variable from user application not in py7zr class.

  7. 7zip file format allows multiple files with same file path in archive. How become it?

@Zoynels
Copy link
Contributor Author

Zoynels commented Apr 13, 2020

have several questions related to API design, @Zoynels

  1. This is an alternative interface against extract() and extractall() so it should be same functionality as same as file system. How to treat meta data, such as filename, timestamp, permissions.

Now it stores only filename and data of extracted file. So if it should store such metadata as timestamp, permissions we need more complex dict storage, e.i.:
_dict = {}
_dict[filename1] = {}
_dict[filename1]["iobytes"] = iobytes
_dict[filename1]["timestamp"] = timestamp
_dict[filename1]["permissions"] = permissions
_dict[filename1]["is_symlink"] = is_symlink
and etc.
Maybe it will be good. Now I don't have such need, but or future for someone it will need. For how lets store only basic information in such format.
Or create class instead of dict.

  1. 7zip is an archive format, that may have directories, symbolic link and other special files. How to provide these to API users?

Now _dict stores only files, not directories. If we need to save all structure, then I see several ways:

  1. plain dict where keys of _dict store string (or it could be pathlib.Path()) filepath of files and directories, and by metainformation it define what it store.
  2. dict of dict, where could be subdirectories.
  3. create class to store information instead of dict way 1 or 2
  1. extract() and extractall() are considered opposite function against write() and writeall(). From a view point of API design, it is better that a return value from extractall() is usable for writeall() input.

I think for extract(), extractall(), write(), writeall() when extract into real files it should return True/False if write/extraction finished with success or failure.

  1. Why value is stored in dict, how key become, how to handle archive order of files?
    _dict is simple way where extraction of files have same sort like in archive.

Is it could store information in class and it could be more good. What do you think?

  1. extract() has a functionality to filter contents with file spec. when the filter is specified, how output dictionary become?

It uses standart way of extraction, only when saves data into io it saves it into io.iobytes() instead of real file. All other functionality is same. But now it don't restore junction, symlinks as I don't fully understand how filenames should stores. I write about it in #112

  1. Current py7zr are carefully designed not to keep many memory, only use a limited size(8MB) of buffer , to avoid memory leak. I'm afraid that it use python memory and variable self._dict cannot freed even close() is called. Because memory management of content buffer is responsible by user application, it may be better for API to accept buffer variable from user application not in py7zr class.

_dict is filled only if extract into dict and if extract into filesystem it will be empty. If you think that instead of return_dict: bool = False we could place None or _dict which will be filled with extracted information. In this way only dict could be placed and class of file couldn't be created. I thin it is also good option.
Several open questions will be eliminated.

  1. 7zip file format allows multiple files with same file path in archive. How become it?
    It now uses standart way which tested:
  • filename
  • filename_0

@miurahr
Copy link
Owner

miurahr commented Apr 13, 2020

This is an alternative interface against extract() and extractall() so it should be same functionality as same as file system. How to treat meta data, such as filename, timestamp, permissions.

Now it stores only filename and data of extracted file. So if it should store such metadata as timestamp, permissions we need more complex dict storage, e.i.:
_dict = {}
_dict[filename1] = {}
_dict[filename1]["iobytes"] = iobytes
_dict[filename1]["timestamp"] = timestamp
_dict[filename1]["permissions"] = permissions
_dict[filename1]["is_symlink"] = is_symlink
and etc.
Maybe it will be good. Now I don't have such need, but or future for someone it will need. For how lets store only basic information in such format.
Or create class instead of dict.

We have an interface SevenZipFile.list() that return a List[FileInfo].

That include these meta data. If user combine this interface with its contents, it may be ok.

@miurahr
Copy link
Owner

miurahr commented Apr 13, 2020

When looking into python standard ZipFile API, which has a same purpose API named read()

https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.read

ZipFile.read(name, pwd=None)
Return the bytes of the file name in the archive. name is the name of the file in the archive, or a ZipInfo object.

It seems to be better SevenZipFile and ZipFile share the same API. Unfortunately TarFile class does not have an interface for same purpose.

@Zoynels
Copy link
Contributor Author

Zoynels commented Apr 13, 2020

When looking into python standard ZipFile API, which has a same purpose API named read()

https://docs.python.org/3/library/zipfile.html#zipfile.ZipFile.read

ZipFile.read(name, pwd=None)
Return the bytes of the file name in the archive. name is the name of the file in the archive, or a ZipInfo object.

It seems to be better SevenZipFile and ZipFile share the same API. Unfortunately TarFile class does not have an interface for same purpose.

Yes this is the best choice. You are right that it hasn't read() intarface, but:

https://docs.python.org/3/library/tarfile.html#tarfile.TarFile.extractfile

TarFile.extractfile(member)
Extract a member from the archive as a file object. member may be a filename or a TarInfo object. If member is a regular file or a link, an io.BufferedReader object is returned. Otherwise, None is returned.

Changed in version 3.3: Return an io.BufferedReader object.
It returns io.BufferedReader, which has read() interface.
I don't know your library well, but I couldn't find any links to tarfile class in library.
If you could implement such functionality, then there will be no need to export into BytesIO

@miurahr
Copy link
Owner

miurahr commented Apr 13, 2020

TarFile.extractfile(member)
Extract a member from the archive as a file object. member may be a filename or a TarInfo object. If member is a regular file or a link, an io.BufferedReader object is returned. Otherwise, None is returned.

Changed in version 3.3: Return an io.BufferedReader object.
It returns io.BufferedReader, which has read() interface.

I don't know your library well, but I couldn't find any links to tarfile class in library.
If you could implement such functionality, then there will be no need to export into BytesIO

Tar file is a simple concatenate of files, so it can return opened archive file, with seeking file pointer to head of an inside file in archive file.

And as you know ,7-zip is an archive format with solid-compression support, which is unable to extract in random access but good compression ratio.

We cannot provide an interface which argument is a member or filename, which requires random access mode, or read entire archive file for extraction of each portion of archive.
I have provided a function with an argument of filespec/file filter, so it can be return io.BytesIO(), or we can provide interface to register call-back function which is called when py7zr is ready to provide data.

@miurahr
Copy link
Owner

miurahr commented Apr 13, 2020

Here is my proposal for an interface design.

  • extract(filespec) extractall()
    extract files with filter, or extract all files onto filesystem
  • read(filespec), readall()
    extract files with filter, or extract all files into io.ByteIO then return dictionary

  • archive(), archiveall()
    make archive with compressed files on a file system. currently implement as methods write() and writeall()

  • write(buf, file property)
    compress file from data into an archive

Copy link
Owner

@miurahr miurahr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to introduce methods read() and readall() for dictionary return.
It help keep method signature for extract() and extractall(), and help understand read() as similar as ZipFile class.

@@ -644,15 +644,16 @@ def test(self) -> bool:
"""Test archive using CRC digests."""
return self._test_digests()

def extractall(self, path: Optional[Any] = None) -> None:
def extractall(self, path: Optional[Any] = None, return_dict: bool = False) -> Optional[Dict[str, IO[Any]]]:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we keep extractall(self, path)-> None as is and introduce readall(self, path) -> Dict[str, IO[Any]]?


def extract(self, path: Optional[Any] = None, targets: Optional[List[str]] = None) -> None:
def extract(self, path: Optional[Any] = None, targets: Optional[List[str]] = None,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep extract(self, path)->None method signature as is and introduce read(self, path) -> Dict[str, IO[Any]]Introduce internal fucntion_extract(self, path, targets, return_dict) -> Optional[Dict]]` that is your proposal.

That keeps method signature and introduce read()

assert tmp_path.joinpath('test/test2.txt').open('rb').read() == bytes('This file is located in a folder.', 'ascii')
assert tmp_path.joinpath('test1.txt').open('rb').read() == bytes('This file is located in the root.', 'ascii')
if not return_dict:
archive.extractall(path=tmp_path, return_dict=return_dict)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can change to archive.exttractall(path=tmp_path)

assert tmp_path.joinpath('test/test2.txt').open('rb').read() == bytes('This file is located in a folder.', 'ascii')
assert tmp_path.joinpath('test1.txt').open('rb').read() == bytes('This file is located in the root.', 'ascii')
else:
_dict = archive.extractall(return_dict=return_dict)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It become _dict = archive.readall()

@andrebrait

This comment has been minimized.

@miurahr
Copy link
Owner

miurahr commented May 1, 2020

Here is no activity by original proposer @Zoynels in recent two weeks. We have good discussion about API and hope merge it with modification.
I'd like to take over the topic in my local.

@miurahr miurahr closed this May 1, 2020
miurahr added a commit that referenced this pull request May 6, 2020
Extract archive into buffer (dict of io.BytesIO()) (Update of #111)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants