Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG/ENH: compression for google cloud storage in to_csv #35681

Merged
merged 3 commits into from
Sep 3, 2020

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Aug 12, 2020

By inferring the compression before converting the path to a file object df.to_csv("gs://mybucket/test2.csv.gz", compression="infer", mode="wb") works. By wrapping fsspec file-objects in a TextIOWrapper df.to_csv("gs://mybucket/test2.csv", mode="wb") works as well. Path-like objects that are internally converted to file-like objects (in get_filepath_or_buffer) are now always opened in binary mode (unless text mode is explicitly requested) and the potentially changed mode is returned (no need to specify mode="wb" for google cloud files). As long as the google file is opened in binary mode (which is now always the case), we also honor the requested encoding.

This PR also fixes Zip compression for file objects not having a name.

@twoertwein twoertwein force-pushed the google_storage branch 2 times, most recently from c73915e to 1fe2c39 Compare August 12, 2020 13:39
@twoertwein twoertwein force-pushed the google_storage branch 4 times, most recently from 4b5272d to ca1800b Compare August 13, 2020 13:32
pandas/_libs/parsers.pyx Outdated Show resolved Hide resolved
@twoertwein twoertwein marked this pull request as ready for review August 13, 2020 22:03
@twoertwein twoertwein force-pushed the google_storage branch 4 times, most recently from 78925fd to a3d012c Compare August 13, 2020 23:56
@jreback jreback added Compat pandas objects compatability with Numpy or Python functions IO Google labels Aug 14, 2020
@jreback jreback added this to the 1.2 milestone Aug 14, 2020
@twoertwein
Copy link
Member Author

twoertwein commented Aug 14, 2020

it seems that only the windows py37 machine on azure actually tests the google cloud interface?!

EDIT: and one linux machine on travis

pandas/io/orc.py Outdated Show resolved Hide resolved
@twoertwein twoertwein force-pushed the google_storage branch 2 times, most recently from e7248e4 to 7b03c2a Compare August 26, 2020 02:52
@@ -162,13 +165,13 @@ def is_fsspec_url(url: FilePathOrBuffer) -> bool:
)


def get_filepath_or_buffer(
def get_filepath_or_buffer( # type: ignore[assignment]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my local mypy needs that for line 170 and 172 but the CI mypy needs it apparently at that line (TypeVars cannot have default values, could be fixed with @overload)

@@ -583,12 +638,15 @@ def __init__(
self.archive_name = archive_name
kwargs_zip: Dict[str, Any] = {"compression": zipfile.ZIP_DEFLATED}
kwargs_zip.update(kwargs)
super().__init__(file, mode, **kwargs_zip)
super().__init__(file, mode, **kwargs_zip) # type: ignore[arg-type]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

complains about file being IOBase but we cannot have an assert not isisntance(file, IOBase) since io.StringIO inherits from IOBase

pandas/io/json/_json.py Outdated Show resolved Hide resolved
fname, mode="wb", compression=compression, storage_options=storage_options,
)
f, _ = get_handle(path_or_buf, "wb", compression=compression, is_text=False)
return f, True, compression
f, _ = get_handle(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the auxiliary file handles should be closed as well? I think that only happens for compressed files. But that would require yet another interface change.

@twoertwein
Copy link
Member Author

Typing of filepath_or_buffer is a mess. I also added IOBase to FilePathOrBuffer (io.BytesIO and io.StringIO do not seem to be covered by isinstance(..., typing.IO)).

I ended up using a dataclass for IOargs, NamedTuple does not support TypeVars in it. I use TypeVars for encoding (stays None/str) and mode (stays None/str or it changes from None to str).

If IOargs was retrieved inside an if-branch, I 'unpack' it (using the tuppe names) into the previous variables, otherwise I use ioargs. instead of the previous variable names.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks really good. a couple of comments.

pandas/io/common.py Outdated Show resolved Hide resolved
pandas/io/common.py Show resolved Hide resolved
pandas/io/feather_format.py Outdated Show resolved Hide resolved
pandas/io/json/_json.py Show resolved Hide resolved
pandas/io/json/_json.py Outdated Show resolved Hide resolved
compression=ioargs.compression,
is_text=False,
)
return f, True, ioargs.compression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might need a slight refactoring but can you use IOargs here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it returns similar information but in this case f is always BinaryIO (according to the current type annotations). If we were to use IOargs, we would give it a broader type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the following options would make sense for this function:

  1. inline it (it is used only one time and it is a private function)
  2. simplify it (the first if-block is imho not necessary, can be handled by the elif-block)
  3. simplification from (2) and move the function to io/common and use it more than once: the pattern to convert the compression string/dict to a dict, calling get_filepath_or_buffer, and then calling get_handle is present in many to_* (and read_*) functions.

I'm tempted to go for the second option in this PR and leave the third option for a future PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will not touch that for now: I don't understand the difference between self._output_file and self._file in StataWriter. Option (2) would affect these two variables. Having had only a brief look at StataWriter, it seems that this class distinguishes between writing to a compressed file and writing to a buffer (get_handle should take care of that, after that you should be able to treat them the same?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this can all addressed later. this writer might need some restructuing.

pandas/tests/io/test_common.py Show resolved Hide resolved
def test_read_csv_gcs(monkeypatch):
@pytest.fixture
def gcs_buffer(monkeypatch):
"""Emulate GCS using a binary buffer."""
from fsspec import AbstractFileSystem, registry

registry.target.clear() # noqa # remove state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, does this make it easier / cleaner?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alimcmaster1 commented one some duplication across the GCS tests. I'm not 100% sure how the monkeypatch part works, the CI seems to be happy about it. Double/triple checking that I didn't invalidated these tests would be good.

@jreback
Copy link
Contributor

jreback commented Aug 27, 2020

@WillAyd @TomAugspurger @simonjayhawkins if any comments; also using Generic protocols now so if you guys can review types (not to get bogged down of course.....)

@twoertwein twoertwein force-pushed the google_storage branch 8 times, most recently from 1063e3f to 90c4e01 Compare August 30, 2020 14:47
get_handle: fsspec file objects need to be wrapped

get_filepath_or_buffer: path-like objects that are internally converted to file-like objects are opened in binary mode; named tuple

_BytesZipFile: work with filename-less objects
…gnore statements (mypy will compile about filepath_or_buffer)
@twoertwein twoertwein force-pushed the google_storage branch 2 times, most recently from 1357c4a to 475e8e8 Compare August 31, 2020 23:18
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm @twoertwein

thanks for the patch!

compression=ioargs.compression,
is_text=False,
)
return f, True, ioargs.compression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this can all addressed later. this writer might need some restructuing.

@jreback jreback merged commit 361166f into pandas-dev:master Sep 3, 2020
@twoertwein twoertwein deleted the google_storage branch September 3, 2020 15:56
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
6 participants