Compression not handled for GCS files #514

gelioz · 2020-06-27T23:49:55Z

Problem description

When reading gzip-encoded blob from GCS, file object returns compressed binary data instead of decompressed text.

Steps/code to reproduce the problem

In [30]: smart_open.open('gs://test/file.json.gz').read()
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-30-4e7f4d55e81d> in <module>
----> 1 smart_open.open('gs://test/file.json.gz').read()

/usr/local/Cellar/python/3.7.7/Frameworks/Python.framework/Versions/3.7/lib/python3.7/codecs.py in read(self, size, chars, firstline)
    502                 break
    503             try:
--> 504                 newchars, decodedbytes = self.decode(data, self.errors)
    505             except UnicodeDecodeError as exc:
    506                 if firstline:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

Also, I was able to find root cause - gcs.Reader class not implements .name attribute which cause compression.compression_wrapper function to skip decompressing (because os.path.splitext(file_obj.name) returns ('unknown', '')).

Versions

Darwin-19.3.0-x86_64-i386-64bit
Python 3.7.7 (default, Mar 10 2020, 15:43:33)
[Clang 11.0.0 (clang-1100.0.33.17)]
smart_open 2.0.0

The text was updated successfully, but these errors were encountered:

mpenkov · 2020-06-28T00:16:44Z

Hi @gelioz, thank you for reporting this.

We have already fixed the issue in the develop branch (see #506). Are you able to confirm?

gelioz · 2020-06-30T11:45:36Z

Thank you, @mpenkov.

Yes, I can confirm it works for me on develop branch. Is there any estimate on when 2.0.1 will be released?

bradmurray · 2020-06-30T17:13:49Z

Does this also fix the same issue for writing? I can currently read from a .gz file on GCS with no problem, but when I write it is writing uncompressed even thought the filename ends in .gz.

gelioz · 2020-06-30T18:13:48Z

Yes, it does - no issues with file compression on writing.

mpenkov · 2020-07-01T01:16:40Z

https://pypi.org/project/smart-open/2.1.0/

mpenkov closed this as completed Jul 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compression not handled for GCS files #514

Compression not handled for GCS files #514

gelioz commented Jun 27, 2020

mpenkov commented Jun 28, 2020 •

edited

gelioz commented Jun 30, 2020

bradmurray commented Jun 30, 2020

gelioz commented Jun 30, 2020

mpenkov commented Jul 1, 2020

Compression not handled for GCS files #514

Compression not handled for GCS files #514

Comments

gelioz commented Jun 27, 2020

Problem description

Steps/code to reproduce the problem

Versions

mpenkov commented Jun 28, 2020 • edited

gelioz commented Jun 30, 2020

bradmurray commented Jun 30, 2020

gelioz commented Jun 30, 2020

mpenkov commented Jul 1, 2020

mpenkov commented Jun 28, 2020 •

edited