Refactoring smart_open to share compression and encoding functionality #185

mpenkov · 2018-04-07T11:53:32Z

All of our transport methods (file, S3, HDFS, WebHDFS) now read and
write bytes. A shared compression layer sits over that and performs
compression and decompression transparently. A shared encoding layer
sits over that, and performs compression decompression transparently.

The benefit of this approach is that it decouples actual I/O,
compression and encoding into independent layers. The I/O layers now
have to worry only about binary I/O. The compression all happens in one
place, so adding new codecs is simple. Finally, encoding also happens
in one place, with the same benefits.

Other things I did:

ripped out S3 text I/O, we do not need this anymore
rewrote HDFS as a IOBase-based separate module
split http module
rewrote WebHDFS subsystem based on io.IOBase
get rid of some unused imports

All of our transport methods (file, S3, HDFS, WebHDFS) now read and write bytes. A shared compression layer sits over that and performs compression and decompression transparently. A shared encoding layer sits over that, and performs compression decompression transparently. The benefit of this approach is that it decouples actual I/O, compression and encoding into independent layers. The I/O layers now have to worry only about binary I/O. The compression all happens in one place, so adding new codecs is simple. Finally, encoding also happens in one place, with the same benefits. Other things I did: - ripped out S3 text I/O, we do not need this anymore - rewrote HDFS as a IOBase-based separate module - split http module - rewrote WebHDFS subsystem based on io.IOBase - get rid of some unused imports

This needs a separate implementation of seek to work under Py2. gzip with hdfs/http wasn't supported at all prior to the refactoring, so it is a separate issue.

By default, the system encoding is used when opening a file. The tests expect the encoding to be UTF-8, so if the system encoding happens to be anything else, the tests will fail. Some Py2 installations use ascii as the default encoding.

menshikh-iv

LGTM for me

but please check that changes for HDFS/HTTP works as expected (at least manually), later we need to resolve #151

piskvorky · 2018-04-14T08:21:42Z

integration-tests/test_http.py

+        self.assertTrue(text.startswith('В начале июля, в чрезвычайно'.encode('utf-8')))
+        self.assertTrue(text.endswith('улизнуть, чтобы никто не видал.\n'.encode('utf-8')))
+
+    @unittest.skipIf(six.PY2, 'gzip support does not work on Py2')


Is this true? I'm pretty sure I've been opening .gz files with smart_open in Python 2.

Yes, it's true. The master branch of smart_open currently has limited support for gzip: it works for local files and S3 only, regardless of which Python version you have installed. To the best of my understanding, on-the-fly gzip decompression never worked for HTTP, WebHDFS and HDFS. You can confirm this by running these same integration tests against master. You'll get an error similar to the following:

====================================================================== ERROR: test_read_gzip_text (__main__.ReadTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "integration-tests/test_http_copy.py", line 47, in test_read_gzip_text text = fin.read() File "/Users/misha/envs/smartopen2/lib/python2.7/codecs.py", line 486, in read newdata = self.stream.read() File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 261, in read self._read(readsize) File "/usr/local/Cellar/python@2/2.7.14_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/gzip.py", line 295, in _read pos = self.fileobj.tell() # Save current position UnsupportedOperation: seek

Basically, Py2.7 gzip expects a .seek() operation to be implemented on the file object. Until someone explicitly implements seeking for HTTP, we won't be able to use Py2.7 gzip.

@menshikh-iv Can you please double-check and correct me if I'm wrong?

There's some code here (https://github.com/RaRe-Technologies/smart_open/blob/master/smart_open/smart_open_lib.py#L756) to address the seek issue, but it doesn't seem to be helping, because the integration test above is failing.

@mpenkov I checked, works fine with py2 and smart_open==1.5.7, using this code

import subprocess import smart_open import time port = 8008 command = ['python', '-m', 'SimpleHTTPServer', str(port)] s = subprocess.Popen(command) time.sleep(1) url = 'http://localhost:%d/smart_open/tests/test_data/crlf_at_1k_boundary.warc.gz' % port with smart_open.smart_open(url, encoding='utf-8') as fin: text = fin.read() print(text) s.terminate()

@piskvorky @menshikh-iv Thanks for checking! I can confirm your code works. I will investigate and fix.

I had a closer look at why gzip was working in Py2 despite the lack of seek. Unfortunately, it seems like it works at the expense of streaming functionality: this line reads the entire file into memory before gzip-decompressing. We could reimplement the same thing in the refactored branch, but is it worth it? We're basically surrendering the benefit of streaming without the user knowing it - it could cause out-of-memory situations on the user side if the file is sufficiently large.

@piskvorky @menshikh-iv How do you think it is best to proceed?

No -- a lack of streaming is definitely a bug. Can you open an issue for it?

Thanks for investigating @mpenkov! It's a pleasure to work with such knowledgeable and dedicated people.

@piskvorky Thank you! I've opened #189.

Sorry for misleading you earlier, my first investigation overlooked this buffering detail.

mpenkov requested a review from menshikh-iv April 7, 2018 11:53

mpenkov added 3 commits April 7, 2018 21:04

fixup in webhdfs.py: detect Py2 correctly

b9c5ebc

Disable gzip tests for Py2

1ceaab2

This needs a separate implementation of seek to work under Py2. gzip with hdfs/http wasn't supported at all prior to the refactoring, so it is a separate issue.

fix unit tests by explicitly specifying encoding

9ed0aa5

By default, the system encoding is used when opening a file. The tests expect the encoding to be UTF-8, so if the system encoding happens to be anything else, the tests will fail. Some Py2 installations use ascii as the default encoding.

menshikh-iv approved these changes Apr 9, 2018

View reviewed changes

adding HTTP integration tests

945e763

piskvorky reviewed Apr 14, 2018

View reviewed changes

mpenkov added 2 commits April 15, 2018 13:57

when seek is missing but required, buffer in memory

0ce0651

work around missing .seekable in py2

65d05c4

menshikh-iv mentioned this pull request Apr 15, 2018

add support for local file ignore_extension #173

Closed

mpenkov added 7 commits April 15, 2018 15:35

fixup in webhdfs.py: expect response.content to be bytes

df5893c

added some sample code for WebHDFS/HDFS integration tests

edd291f

specify working directory for HTTP server

ec944cf

include http tests in travis.yml

de7be43

sleep for 1s to avoid race condition

92e4c5f

fixup in http integration tests

54b78cf

set Accept-Encoding header, point tests at github.com

4418b38

menshikh-iv merged commit b499f87 into master Apr 15, 2018

menshikh-iv mentioned this pull request Apr 15, 2018

Add integration and regression tests #151

Open

piskvorky deleted the refactor-rebased branch April 15, 2018 14:02

This was referenced Apr 22, 2018

ignore_extension not supported on file_smart_open #172

Closed

HDFS, WebHDFS, HTTP I/O is not encoding-aware #146

Closed

Test compressed files over http #111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring smart_open to share compression and encoding functionality #185

Refactoring smart_open to share compression and encoding functionality #185

mpenkov commented Apr 7, 2018

menshikh-iv left a comment

piskvorky Apr 14, 2018

mpenkov Apr 14, 2018 •

edited

mpenkov Apr 14, 2018

menshikh-iv Apr 14, 2018 •

edited

mpenkov Apr 14, 2018

mpenkov Apr 14, 2018

piskvorky Apr 14, 2018 •

edited

mpenkov Apr 15, 2018

Refactoring smart_open to share compression and encoding functionality #185

Refactoring smart_open to share compression and encoding functionality #185

Conversation

mpenkov commented Apr 7, 2018

menshikh-iv left a comment

Choose a reason for hiding this comment

piskvorky Apr 14, 2018

Choose a reason for hiding this comment

mpenkov Apr 14, 2018 • edited

Choose a reason for hiding this comment

mpenkov Apr 14, 2018

Choose a reason for hiding this comment

menshikh-iv Apr 14, 2018 • edited

Choose a reason for hiding this comment

mpenkov Apr 14, 2018

Choose a reason for hiding this comment

mpenkov Apr 14, 2018

Choose a reason for hiding this comment

piskvorky Apr 14, 2018 • edited

Choose a reason for hiding this comment

mpenkov Apr 15, 2018

Choose a reason for hiding this comment

mpenkov Apr 14, 2018 •

edited

menshikh-iv Apr 14, 2018 •

edited

piskvorky Apr 14, 2018 •

edited