New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add generic HTTP and HTTPS streaming support. #107
Conversation
This patch adds support for custom S3 servers in the connection string. It also adds explicit support for setting the server port, and whether or not to use SSL, both as paramaters to the smart_open function as well as within the connection string. These changes are neccessary to be able to connect to s3proxy and other custom s3 servers which don't run on the default port, or neccessarily use SSL.
Adds support for opening vanilla HTTP and HTTPS addresses. Supports efficient streaming, gzip and bz2 compression, as well as Kerberos and username/password (basic) http authentication.
smart_open/smart_open_lib.py
Outdated
@@ -573,23 +588,6 @@ def __enter__(self): | |||
def __exit__(self, type, value, traceback): | |||
pass | |||
|
|||
|
|||
def make_closing(base, **attrs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we drop support for Python 2.6? Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, sorry, didn't think those comments were still active since python 2.7 has been around so long and there's no reason I know of for people to stick with 2.6. I can revert this part if you need 2.6 support -- let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still support 2.6
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, reverted in latest.
Oh wow, that's really cool! Interesting feature too. What is your use-case for this, how do you use HTTP streaming yourself, if you don't mind me asking? |
I don't have a specific use-case for http streaming specifically (although I can imagine some) -- it's more that I like the idea of being able to use smart_open as a unified go-to tool for any type of uri. I definitely do have use-cases where I want to pull a web page, or a compressed file off of an http server. |
smart_open/smart_open_lib.py
Outdated
|
||
If none of those are set, will connect unauthenticated. | ||
""" | ||
PY2 = sys.version_info[0] == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python 2 check is now used in two places. Better refactor to a global.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored in latest.
smart_open/smart_open_lib.py
Outdated
|
||
response = HttpReadStream(url, **kwargs) | ||
|
||
if url.endswith('.gz'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the gz and bz2 handling code is identical to file_smart_open
. Please refactor to avoid duplication.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good Point. This will also make it easier to add support for other compression types later.
Done in latest version.
smart_open/tests/test_smart_open.py
Outdated
with smart_open.HttpOpenRead(smart_open.ParseUri('https://www.google.com'), 'r') as smart_open_object: | ||
content = smart_open_object.readline().decode('utf-8') | ||
expected_str = '<!doctype html>' | ||
self.assertEqual(content[:len(expected_str)], expected_str) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add tests for authentication using requests_mock
. It would also be great if the tests worked without access to internet and google.com
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixed. I added a test for Basic http auth, and made the other tests not actually go to google.com. I didn't add a Kerbeos test because I'm not sure that's easily doable without running within a Kerborized environment.
Thanks for the new feature! Just minor refactoring and better tests with mocking are needed before merging. |
We still want to maintain Python 2.6 compatibility, so don't rely on contextlib.closing.
Now they don't require internet access, and will test for Basic authentication in the HTTP header.
http => https, and remove old versions of the tests.
Thanks. I've addressed all the comments in the latest version of the PR. Please take a look. Thanks, |
smart_open/tests/test_smart_open.py
Outdated
@@ -168,6 +168,31 @@ def test_webhdfs_read(self): | |||
smart_open_object = smart_open.WebHdfsOpenRead(smart_open.ParseUri("webhdfs://127.0.0.1:8440/path/file")) | |||
self.assertEqual(smart_open_object.read().decode("utf-8"), "line1\nline2") | |||
|
|||
@responses.activate | |||
def test_http_read(self): | |||
"""Does webhdfs read method work correctly""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a typo in the docstring ""Does http read method work correctly"""
smart_open/tests/test_smart_open.py
Outdated
|
||
@responses.activate | ||
def test_https_readline(self): | ||
"""Does webhdfs read method work correctly""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in docstring as well
Thanks a lot for the code changes! |
smart_open/tests/test_smart_open.py
Outdated
"""Does http authentication work correctly""" | ||
responses.add(responses.GET, "http://127.0.0.1/index.html", body='line1\nline2') | ||
smart_open_object = smart_open.HttpOpenRead(smart_open.ParseUri("http://127.0.0.1/index.html"), | ||
user='me', password='pass') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: we use hanging (not vertical) indent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, fixed
All comments addressed in the latest version. |
Great! Thanks for the new feature! |
Adds support for opening vanilla HTTP and HTTPS addresses.
Supports efficient streaming, gzip and bz2 compression,
as well as Kerberos and username/password (basic) http
authentication.