Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression not handled for GCS files #514

Closed
gelioz opened this issue Jun 27, 2020 · 5 comments
Closed

Compression not handled for GCS files #514

gelioz opened this issue Jun 27, 2020 · 5 comments

Comments

@gelioz
Copy link
Contributor

gelioz commented Jun 27, 2020

Problem description

When reading gzip-encoded blob from GCS, file object returns compressed binary data instead of decompressed text.

Steps/code to reproduce the problem

In [30]: smart_open.open('gs://test/file.json.gz').read()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-30-4e7f4d55e81d> in <module>
----> 1 smart_open.open('gs://test/file.json.gz').read()

/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py in read(self, size, chars, firstline)
    502                 break
    503             try:
--> 504                 newchars, decodedbytes = self.decode(data, self.errors)
    505             except UnicodeDecodeError as exc:
    506                 if firstline:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Also, I was able to find root cause - gcs.Reader class not implements .name attribute which cause compression.compression_wrapper function to skip decompressing (because os.path.splitext(file_obj.name) returns ('unknown', '')).

Versions

Darwin-19.3.0-x86_64-i386-64bit
Python 3.7.7 (default, Mar 10 2020, 15:43:33)
[Clang 11.0.0 (clang-1100.0.33.17)]
smart_open 2.0.0

@mpenkov
Copy link
Collaborator

mpenkov commented Jun 28, 2020

Hi @gelioz, thank you for reporting this.

We have already fixed the issue in the develop branch (see #506). Are you able to confirm?

@gelioz
Copy link
Contributor Author

gelioz commented Jun 30, 2020

Thank you, @mpenkov.

Yes, I can confirm it works for me on develop branch. Is there any estimate on when 2.0.1 will be released?

@bradmurray
Copy link

Does this also fix the same issue for writing? I can currently read from a .gz file on GCS with no problem, but when I write it is writing uncompressed even thought the filename ends in .gz.

@gelioz
Copy link
Contributor Author

gelioz commented Jun 30, 2020

Yes, it does - no issues with file compression on writing.

@mpenkov
Copy link
Collaborator

mpenkov commented Jul 1, 2020

https://pypi.org/project/smart-open/2.1.0/

@mpenkov mpenkov closed this as completed Jul 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants