Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add seekable and readable into fileobject #34

Closed
wants to merge 6 commits into from
Closed

Conversation

belldandyxtq
Copy link
Member

@belldandyxtq belldandyxtq commented Jul 24, 2019

This commit adds seekable and readable function into fileobject, which was missing.
These two functions are often used by other libraries for switching behavior for different underlying file system.
Missing forwarding of those functions can cause some libraries not work properly with ChainerIO.
For example, zipfile binds its seekable attribute to the underlying filesystem and it checks the zip file by seeking in init, missing the seekable function can cause nested zip cannot be correctly opened, as the nested zip fails to be seekable, which requires by the zipfile init module

This commit adds seekable field into fileobject, which was missing.
Without seekable, the nested zip cannot be correctly opened.
@belldandyxtq belldandyxtq changed the title Add seekable field into fileobject Add seekable and readable into fileobject Jul 30, 2019
Copy link
Member

@kuenishi kuenishi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think using the word 'nested' is clear enough to understand the content structure, especially in test code. How about using term 'internal' carefully?

warnings.warn('In the current Python, Chainerio has to read '
'the whole file content from the zip '
'on open, which might cause performance or '
'memory issues. '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the message is not clear, and it's hard to estimate the memory usage. How about this?
"In Python 3.6 or older, to open an internal zip file included in a zip file, ChainerIO reads the whole content of the internal zip. If it is large, it may cause memory issue."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all the files will be read on open, it is not limited to internal zip file case.
How about this:
"In Python 3.6 or older, to ensure the seekable and readable attribute are correctly set, ChainerIO reads its whole content on file open. If the file is large, it may cause performance or memory issue."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scope of "its whole content" is still not clear and hard to estimate the amount of memory needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"In Python 3.6 or older, ChainerIO reads the whole content of the file to open from zip. It may cause performance or memory issue. For more details, read chainerio/containers/zip.py:24"

It might be too long to include everything

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for example I don't understand opening a file named C (100MB) included in B.zip (1GB) included in A.zip (10GB). How much memory is needed when that message is printed? Which of A, B, C "a file" means?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is necessary info to use this library confortably.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nested zip case is a little bit confusing as "open_as_container" is an "open" in ChainerIO, so in that case,
opening C in B.zip in A.zip involves two opens
=> open B.zip
=> open C
According to reads the whole content of the file to open from zip., for me, it means read entire B.zip and then C. Since C is in B.zip and MIGHT already loaded when reading B.zip, so it needs B.zip (1 GB).

For more simple cases, reading C(100MB) in B.zip(1GB), the sentence "reads the whole content of the file to open from zip." means C.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"In Python 3.6 or older, ChainerIO reads the whole content of the file to open from zip. Especially, when using open_as_container to open another container in zip, ChainerIO reads that container as well, Such behavior may cause performance or memory issue."

To cover the internal container case

@kuenishi
Copy link
Member

Thanks for explanation in #34 (comment) and finally I get it.
In the latter case just opening a file in a zip (non-nested case) ChainerIO should not read the whole content and users may or may not need the file object seekable. I don't like that behaviour because I just don't have to do that.

How about limiting all-content-reading into just opening a zip container from container object?

@belldandyxtq
Copy link
Member Author

belldandyxtq commented Jul 30, 2019

I have thought about that case. Since other libraries may also need seekable or readable, If we do that, then we might need to include all these libraries, which is not a good idea. Or we need to give user an option to open with seekable, or readable.

@kuenishi
Copy link
Member

Which library actually needs file objects being seekable? Forcing users to buy unnecessary memory is not a good pay-off too. I also don't think giving a flag is a good idea. As ChainerIO knows whether it's nested or not so it can be hidden under the water, like when open_as_container is called, if the base_handler is descendant of zip container then re-wrap and replace the base_file_object with io.BytesIO wrap.

@belldandyxtq belldandyxtq mentioned this pull request Aug 6, 2019
3 tasks
@belldandyxtq
Copy link
Member Author

Closes in favor of #38

@belldandyxtq belldandyxtq deleted the add_seekable branch October 1, 2019 07:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cat:bug Bug report or fix.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants