Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF: move get_filepath_buffer into get_handle #37639

Merged
merged 2 commits into from Nov 13, 2020
Merged

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Nov 5, 2020

This PR makes get_filepath_buffer a private function and is called inside get_handle - one function to rule all of IO ;)

This PR will make it easier for future PRs to make the IO-related interface of read/to_* more consistent as most of them should support compression/memory mapping (for reading)/(binary) file handles/storage options.

Notes to keep track of future follow-up PRs:

  • context manager for get_handle
  • storage_options for to_excel

@twoertwein twoertwein marked this pull request as ready for review November 5, 2020 08:31
@twoertwein twoertwein marked this pull request as draft November 5, 2020 19:41
@jreback jreback added IO Data IO issues that don't fit into a more specific label Refactor Internal refactoring of code labels Nov 6, 2020
@jreback jreback merged commit 6d1541e into pandas-dev:master Nov 13, 2020
@jreback
Copy link
Contributor

jreback commented Nov 13, 2020

thanks @twoertwein very nice!

@arw2019
Copy link
Member

arw2019 commented Dec 10, 2020

@twoertwein I have a question about the code you added in this PR. In #31817 I'm adding a new csv reader engine (pyarrow) to which I have to provide BytesIO (StringIO is not accepted). I currently have a wrapper to handle this conversion:

class BytesIOWrapper:
    """
    Allows the pyarrow engine for read_csv() to read from string buffers
    """

    def __init__(self, string_buffer: StringIO, encoding: str = "utf-8"):
        self.string_buffer = string_buffer
        self.encoding = encoding

    def __getattr__(self, attr: str):
        return getattr(self.string_buffer, attr)

    def read(self, size: int = -1):
        content = self.string_buffer.read(size)
        return content.encode(self.encoding)

but I'm wondering if this can be done using pandas.io.common.get_handle? I have tried but couldn't quite get it to work. Is it possible with the code as is? If it's not is it something worth adding to get_handle?

Thanks so much!

@twoertwein
Copy link
Member Author

Yes, get_handle would be a good place for this and you probably wouldn't even need any new arguments (is_text and mode might be sufficient).

Towards the end of get_handle there is code for the opposite (wrap binary handles in a TextIOWapper). Adding an elif-block with your wrapper after the TextIOWrapper would be a good place. On the other side, it could also be beneficial to add your wrapper for any text handle(?) before applying compression as we currently do not support compression for text handles.

I assume that the following code might get you quite far

if not (is_text or _is_binary_mode(handle, mode)):
    handle = BytesIOWrapper(handle, encoding) 
    mode = mode.replace("b", "")

@twoertwein
Copy link
Member Author

@arw2019 I think that the BytesIOWrapper is on its own already really useful: I think many doc-strings state that the function accepts any file handle but in fact it is in most cases either text or a binary handle.

About adding the BytesIOWrapper to get_handle: you probbly also need to make sure that the wrapper is itself not added to created_handles, otherwise close will be called on the wrapper which then will call close on the underlying buffer. If the buffer is provided by a user, we shouldn't close it. If the buffer is created by us, it is already in created_handle and will be closed. Any easier solution might be to make close a no-op, then you (or a later part in get_handle) can add it to created_handles.

@arw2019
Copy link
Member

arw2019 commented Dec 11, 2020

@twoertwein Thank you so much for these responses!!!

I'm thinking I'll do a separate PR to add the BytesIOWrapper class and integrate it into get_handle (as #31817 is already quite bloated). I'll cc you on that and would love your feedback if you have time to look!

@arw2019 arw2019 mentioned this pull request Dec 31, 2020
5 tasks
akx added a commit to akx/pandas that referenced this pull request Oct 10, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 11, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 14, 2022
akx added a commit to akx/pandas that referenced this pull request Oct 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG:Cannot write as xlsx to GCS
5 participants