Add seekable and readable into fileobject #34

belldandyxtq · 2019-07-24T05:48:45Z

This commit adds seekable and readable function into fileobject, which was missing.
These two functions are often used by other libraries for switching behavior for different underlying file system.
Missing forwarding of those functions can cause some libraries not work properly with ChainerIO.
For example, zipfile binds its seekable attribute to the underlying filesystem and it checks the zip file by seeking in init, missing the seekable function can cause nested zip cannot be correctly opened, as the nested zip fails to be seekable, which requires by the zipfile init module

This commit adds seekable field into fileobject, which was missing. Without seekable, the nested zip cannot be correctly opened.

kuenishi

I don't think using the word 'nested' is clear enough to understand the content structure, especially in test code. How about using term 'internal' carefully?

kuenishi · 2019-07-30T07:32:06Z

chainerio/containers/zip.py

+            warnings.warn('In the current Python, Chainerio has to read '
+                          'the whole file content from the zip '
+                          'on open, which might cause performance or '
+                          'memory issues. '


I think the message is not clear, and it's hard to estimate the memory usage. How about this?
"In Python 3.6 or older, to open an internal zip file included in a zip file, ChainerIO reads the whole content of the internal zip. If it is large, it may cause memory issue."

Since all the files will be read on open, it is not limited to internal zip file case.
How about this:
"In Python 3.6 or older, to ensure the seekable and readable attribute are correctly set, ChainerIO reads its whole content on file open. If the file is large, it may cause performance or memory issue."

The scope of "its whole content" is still not clear and hard to estimate the amount of memory needed.

"In Python 3.6 or older, ChainerIO reads the whole content of the file to open from zip. It may cause performance or memory issue. For more details, read chainerio/containers/zip.py:24"

It might be too long to include everything

So for example I don't understand opening a file named C (100MB) included in B.zip (1GB) included in A.zip (10GB). How much memory is needed when that message is printed? Which of A, B, C "a file" means?

That is necessary info to use this library confortably.

The nested zip case is a little bit confusing as "open_as_container" is an "open" in ChainerIO, so in that case,
opening C in B.zip in A.zip involves two opens
=> open B.zip
=> open C
According to reads the whole content of the file to open from zip., for me, it means read entire B.zip and then C. Since C is in B.zip and MIGHT already loaded when reading B.zip, so it needs B.zip (1 GB).

For more simple cases, reading C(100MB) in B.zip(1GB), the sentence "reads the whole content of the file to open from zip." means C.

"In Python 3.6 or older, ChainerIO reads the whole content of the file to open from zip. Especially, when using open_as_container to open another container in zip, ChainerIO reads that container as well, Such behavior may cause performance or memory issue."

To cover the internal container case

kuenishi · 2019-07-30T08:41:22Z

Thanks for explanation in #34 (comment) and finally I get it.
In the latter case just opening a file in a zip (non-nested case) ChainerIO should not read the whole content and users may or may not need the file object seekable. I don't like that behaviour because I just don't have to do that.

How about limiting all-content-reading into just opening a zip container from container object?

belldandyxtq · 2019-07-30T08:46:28Z

I have thought about that case. Since other libraries may also need seekable or readable, If we do that, then we might need to include all these libraries, which is not a good idea. Or we need to give user an option to open with seekable, or readable.

kuenishi · 2019-07-30T10:05:14Z

Which library actually needs file objects being seekable? Forcing users to buy unnecessary memory is not a good pay-off too. I also don't think giving a flag is a good idea. As ChainerIO knows whether it's nested or not so it can be hidden under the water, like when open_as_container is called, if the base_handler is descendant of zip container then re-wrap and replace the base_file_object with io.BytesIO wrap.

belldandyxtq · 2019-08-06T10:05:53Z

Closes in favor of #38

belldandyxtq added the cat:bug Bug report or fix. label Jul 24, 2019

belldandyxtq force-pushed the add_seekable branch from 39742b0 to 4c0ce13 Compare July 24, 2019 07:44

belldandyxtq added 4 commits July 24, 2019 21:06

Add seekable field into fileobject

6889b51

This commit adds seekable field into fileobject, which was missing. Without seekable, the nested zip cannot be correctly opened.

add tests for nested zip

288d5cf

add support for Python < 3.7

0fd2b40

add readable function as well

e73a8e7

belldandyxtq force-pushed the add_seekable branch from 51e484b to e73a8e7 Compare July 24, 2019 12:07

kuenishi self-requested a review July 25, 2019 06:25

belldandyxtq added 2 commits July 25, 2019 16:33

add warning message

bc916d2

Update the warning message

dcf5f7d

belldandyxtq changed the title ~~Add seekable field into fileobject~~ Add seekable and readable into fileobject Jul 30, 2019

kuenishi requested changes Jul 30, 2019

View reviewed changes

belldandyxtq mentioned this pull request Aug 6, 2019

New design of open wrapper #37

Closed

3 tasks

belldandyxtq closed this Aug 6, 2019

belldandyxtq deleted the add_seekable branch October 1, 2019 07:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add seekable and readable into fileobject #34

Add seekable and readable into fileobject #34

belldandyxtq commented Jul 24, 2019 •

edited

kuenishi left a comment

kuenishi Jul 30, 2019

belldandyxtq Jul 30, 2019 •

edited

kuenishi Jul 30, 2019

belldandyxtq Jul 30, 2019 •

edited

kuenishi Jul 30, 2019

kuenishi Jul 30, 2019

belldandyxtq Jul 30, 2019

belldandyxtq Jul 30, 2019

kuenishi commented Jul 30, 2019

belldandyxtq commented Jul 30, 2019 •

edited

kuenishi commented Jul 30, 2019

belldandyxtq commented Aug 6, 2019

Add seekable and readable into fileobject #34

Add seekable and readable into fileobject #34

Conversation

belldandyxtq commented Jul 24, 2019 • edited

kuenishi left a comment

Choose a reason for hiding this comment

kuenishi Jul 30, 2019

Choose a reason for hiding this comment

belldandyxtq Jul 30, 2019 • edited

Choose a reason for hiding this comment

kuenishi Jul 30, 2019

Choose a reason for hiding this comment

belldandyxtq Jul 30, 2019 • edited

Choose a reason for hiding this comment

kuenishi Jul 30, 2019

Choose a reason for hiding this comment

kuenishi Jul 30, 2019

Choose a reason for hiding this comment

belldandyxtq Jul 30, 2019

Choose a reason for hiding this comment

belldandyxtq Jul 30, 2019

Choose a reason for hiding this comment

kuenishi commented Jul 30, 2019

belldandyxtq commented Jul 30, 2019 • edited

kuenishi commented Jul 30, 2019

belldandyxtq commented Aug 6, 2019

belldandyxtq commented Jul 24, 2019 •

edited

belldandyxtq Jul 30, 2019 •

edited

belldandyxtq Jul 30, 2019 •

edited

belldandyxtq commented Jul 30, 2019 •

edited