Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generic HTTP and HTTPS streaming support. #107

Merged
merged 20 commits into from Mar 14, 2017
Merged

Conversation

robottwo
Copy link
Contributor

Adds support for opening vanilla HTTP and HTTPS addresses.
Supports efficient streaming, gzip and bz2 compression,
as well as Kerberos and username/password (basic) http
authentication.

Daniel Eisner and others added 14 commits November 29, 2016 11:51
This patch adds support for custom S3 servers in the connection string.
It also adds explicit support for setting the server port, and whether
or not to use SSL, both as paramaters to the smart_open function as
well as within the connection string.

These changes are neccessary to be able to connect to s3proxy and
other custom s3 servers which don't run on the default port,
or neccessarily use SSL.
 into RaRe-Technologies-master

Conflicts:
	smart_open/tests/test_smart_open.py
Adds support for opening vanilla HTTP and HTTPS addresses.
Supports efficient streaming, gzip and bz2 compression,
as well as Kerberos and username/password (basic) http
authentication.
@@ -573,23 +588,6 @@ def __enter__(self):
def __exit__(self, type, value, traceback):
pass


def make_closing(base, **attrs):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we drop support for Python 2.6? Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, sorry, didn't think those comments were still active since python 2.7 has been around so long and there's no reason I know of for people to stick with 2.6. I can revert this part if you need 2.6 support -- let me know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still support 2.6

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, reverted in latest.

@piskvorky
Copy link
Owner

Oh wow, that's really cool! Interesting feature too.

What is your use-case for this, how do you use HTTP streaming yourself, if you don't mind me asking?

@robottwo
Copy link
Contributor Author

robottwo commented Mar 1, 2017

I don't have a specific use-case for http streaming specifically (although I can imagine some) -- it's more that I like the idea of being able to use smart_open as a unified go-to tool for any type of uri. I definitely do have use-cases where I want to pull a web page, or a compressed file off of an http server.


If none of those are set, will connect unauthenticated.
"""
PY2 = sys.version_info[0] == 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python 2 check is now used in two places. Better refactor to a global.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in latest.


response = HttpReadStream(url, **kwargs)

if url.endswith('.gz'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the gz and bz2 handling code is identical to file_smart_open. Please refactor to avoid duplication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Point. This will also make it easier to add support for other compression types later.

Done in latest version.

with smart_open.HttpOpenRead(smart_open.ParseUri('https://www.google.com'), 'r') as smart_open_object:
content = smart_open_object.readline().decode('utf-8')
expected_str = '<!doctype html>'
self.assertEqual(content[:len(expected_str)], expected_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add tests for authentication using requests_mock. It would also be great if the tests worked without access to internet and google.com.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed. I added a test for Basic http auth, and made the other tests not actually go to google.com. I didn't add a Kerbeos test because I'm not sure that's easily doable without running within a Kerborized environment.

@tmylk
Copy link
Contributor

tmylk commented Mar 5, 2017

Thanks for the new feature! Just minor refactoring and better tests with mocking are needed before merging.

We still want to maintain Python 2.6 compatibility,
so don't rely on contextlib.closing.
Now they don't require internet access, and will test for
Basic authentication in the HTTP header.
http => https, and remove old versions of the tests.
@robottwo
Copy link
Contributor Author

Thanks. I've addressed all the comments in the latest version of the PR. Please take a look.

Thanks,
-Dan

@@ -168,6 +168,31 @@ def test_webhdfs_read(self):
smart_open_object = smart_open.WebHdfsOpenRead(smart_open.ParseUri("webhdfs://127.0.0.1:8440/path/file"))
self.assertEqual(smart_open_object.read().decode("utf-8"), "line1\nline2")

@responses.activate
def test_http_read(self):
"""Does webhdfs read method work correctly"""
Copy link
Contributor

@tmylk tmylk Mar 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a typo in the docstring ""Does http read method work correctly"""


@responses.activate
def test_https_readline(self):
"""Does webhdfs read method work correctly"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in docstring as well

@tmylk
Copy link
Contributor

tmylk commented Mar 13, 2017

Thanks a lot for the code changes!
May I ask for some cosmetic changes - an example of the new functionality in the README and a note in the changelog?
Then will merge and release this week.

"""Does http authentication work correctly"""
responses.add(responses.GET, "http://127.0.0.1/index.html", body='line1\nline2')
smart_open_object = smart_open.HttpOpenRead(smart_open.ParseUri("http://127.0.0.1/index.html"),
user='me', password='pass')
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: we use hanging (not vertical) indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, fixed

@robottwo
Copy link
Contributor Author

All comments addressed in the latest version.

@tmylk tmylk merged commit a1c2c03 into piskvorky:master Mar 14, 2017
@tmylk
Copy link
Contributor

tmylk commented Mar 14, 2017

Great! Thanks for the new feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants