-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG/ENH: compression for google cloud storage in to_csv #35681
Conversation
27e881d
to
33d2f44
Compare
c73915e
to
1fe2c39
Compare
4b5272d
to
ca1800b
Compare
378d37f
to
0362178
Compare
78925fd
to
a3d012c
Compare
it seems that only the windows py37 machine on azure actually tests the google cloud interface?! EDIT: and one linux machine on travis |
b2a4eb9
to
479b7bf
Compare
d78fb7f
to
7a686cc
Compare
3af5df6
to
18927c0
Compare
e7248e4
to
7b03c2a
Compare
@@ -162,13 +165,13 @@ def is_fsspec_url(url: FilePathOrBuffer) -> bool: | |||
) | |||
|
|||
|
|||
def get_filepath_or_buffer( | |||
def get_filepath_or_buffer( # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my local mypy needs that for line 170 and 172 but the CI mypy needs it apparently at that line (TypeVar
s cannot have default values, could be fixed with @overload
)
@@ -583,12 +638,15 @@ def __init__( | |||
self.archive_name = archive_name | |||
kwargs_zip: Dict[str, Any] = {"compression": zipfile.ZIP_DEFLATED} | |||
kwargs_zip.update(kwargs) | |||
super().__init__(file, mode, **kwargs_zip) | |||
super().__init__(file, mode, **kwargs_zip) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
complains about file
being IOBase
but we cannot have an assert not isisntance(file, IOBase)
since io.StringIO
inherits from IOBase
fname, mode="wb", compression=compression, storage_options=storage_options, | ||
) | ||
f, _ = get_handle(path_or_buf, "wb", compression=compression, is_text=False) | ||
return f, True, compression | ||
f, _ = get_handle( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the auxiliary file handles should be closed as well? I think that only happens for compressed files. But that would require yet another interface change.
Typing of I ended up using a dataclass for IOargs, If IOargs was retrieved inside an if-branch, I 'unpack' it (using the tuppe names) into the previous variables, otherwise I use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks really good. a couple of comments.
compression=ioargs.compression, | ||
is_text=False, | ||
) | ||
return f, True, ioargs.compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might need a slight refactoring but can you use IOargs here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it returns similar information but in this case f
is always BinaryIO
(according to the current type annotations). If we were to use IOargs, we would give it a broader type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the following options would make sense for this function:
- inline it (it is used only one time and it is a private function)
- simplify it (the first if-block is imho not necessary, can be handled by the elif-block)
- simplification from (2) and move the function to io/common and use it more than once: the pattern to convert the compression string/dict to a dict, calling
get_filepath_or_buffer
, and then callingget_handle
is present in manyto_*
(andread_*
) functions.
I'm tempted to go for the second option in this PR and leave the third option for a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will not touch that for now: I don't understand the difference between self._output_file
and self._file
in StataWriter
. Option (2) would affect these two variables. Having had only a brief look at StataWriter
, it seems that this class distinguishes between writing to a compressed file and writing to a buffer (get_handle
should take care of that, after that you should be able to treat them the same?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this can all addressed later. this writer might need some restructuing.
def test_read_csv_gcs(monkeypatch): | ||
@pytest.fixture | ||
def gcs_buffer(monkeypatch): | ||
"""Emulate GCS using a binary buffer.""" | ||
from fsspec import AbstractFileSystem, registry | ||
|
||
registry.target.clear() # noqa # remove state | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks fine, does this make it easier / cleaner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alimcmaster1 commented one some duplication across the GCS tests. I'm not 100% sure how the monkeypatch part works, the CI seems to be happy about it. Double/triple checking that I didn't invalidated these tests would be good.
@WillAyd @TomAugspurger @simonjayhawkins if any comments; also using Generic protocols now so if you guys can review types (not to get bogged down of course.....) |
1063e3f
to
90c4e01
Compare
get_handle: fsspec file objects need to be wrapped get_filepath_or_buffer: path-like objects that are internally converted to file-like objects are opened in binary mode; named tuple _BytesZipFile: work with filename-less objects
…gnore statements (mypy will compile about filepath_or_buffer)
…es; refine type for filepath_or_buffer
1357c4a
to
475e8e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm @twoertwein
thanks for the patch!
compression=ioargs.compression, | ||
is_text=False, | ||
) | ||
return f, True, ioargs.compression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this can all addressed later. this writer might need some restructuing.
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
By inferring the compression before converting the path to a file object
df.to_csv("gs://mybucket/test2.csv.gz", compression="infer", mode="wb")
works. By wrapping fsspec file-objects in aTextIOWrapper
df.to_csv("gs://mybucket/test2.csv", mode="wb")
works as well. Path-like objects that are internally converted to file-like objects (inget_filepath_or_buffer
) are now always opened in binary mode (unless text mode is explicitly requested) and the potentially changed mode is returned (no need to specifymode="wb"
for google cloud files). As long as the google file is opened in binary mode (which is now always the case), we also honor the requestedencoding
.This PR also fixes Zip compression for file objects not having a name.