-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add seekable and readable into fileobject #34
Conversation
39742b0
to
4c0ce13
Compare
This commit adds seekable field into fileobject, which was missing. Without seekable, the nested zip cannot be correctly opened.
51e484b
to
e73a8e7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using the word 'nested' is clear enough to understand the content structure, especially in test code. How about using term 'internal' carefully?
warnings.warn('In the current Python, Chainerio has to read ' | ||
'the whole file content from the zip ' | ||
'on open, which might cause performance or ' | ||
'memory issues. ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the message is not clear, and it's hard to estimate the memory usage. How about this?
"In Python 3.6 or older, to open an internal zip file included in a zip file, ChainerIO reads the whole content of the internal zip. If it is large, it may cause memory issue."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all the files will be read on open, it is not limited to internal zip file case.
How about this:
"In Python 3.6 or older, to ensure the seekable and readable attribute are correctly set, ChainerIO reads its whole content on file open. If the file is large, it may cause performance or memory issue."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scope of "its whole content" is still not clear and hard to estimate the amount of memory needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"In Python 3.6 or older, ChainerIO reads the whole content of the file to open from zip. It may cause performance or memory issue. For more details, read chainerio/containers/zip.py:24"
It might be too long to include everything
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for example I don't understand opening a file named C (100MB) included in B.zip (1GB) included in A.zip (10GB). How much memory is needed when that message is printed? Which of A, B, C "a file" means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is necessary info to use this library confortably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nested zip case is a little bit confusing as "open_as_container" is an "open" in ChainerIO, so in that case,
opening C in B.zip in A.zip involves two opens
=> open B.zip
=> open C
According to reads the whole content of the file to open from zip.
, for me, it means read entire B.zip and then C. Since C is in B.zip and MIGHT already loaded when reading B.zip, so it needs B.zip (1 GB).
For more simple cases, reading C(100MB) in B.zip(1GB), the sentence "reads the whole content of the file to open from zip." means C.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"In Python 3.6 or older, ChainerIO reads the whole content of the file to open from zip. Especially, when using open_as_container to open another container in zip, ChainerIO reads that container as well, Such behavior may cause performance or memory issue."
To cover the internal container case
Thanks for explanation in #34 (comment) and finally I get it. How about limiting all-content-reading into just opening a zip container from container object? |
I have thought about that case. Since other libraries may also need seekable or readable, If we do that, then we might need to include all these libraries, which is not a good idea. Or we need to give user an option to open with seekable, or readable. |
Which library actually needs file objects being seekable? Forcing users to buy unnecessary memory is not a good pay-off too. I also don't think giving a flag is a good idea. As ChainerIO knows whether it's nested or not so it can be hidden under the water, like when open_as_container is called, if the base_handler is descendant of zip container then re-wrap and replace the base_file_object with |
Closes in favor of #38 |
This commit adds seekable and readable function into fileobject, which was missing.
These two functions are often used by other libraries for switching behavior for different underlying file system.
Missing forwarding of those functions can cause some libraries not work properly with ChainerIO.
For example, zipfile binds its seekable attribute to the underlying filesystem and it checks the zip file by seeking in
init
, missing the seekable function can cause nested zip cannot be correctly opened, as thenested zip
fails to be seekable, which requires by the zipfile init module