Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document passthrough use case #333

Merged
merged 6 commits into from Jul 1, 2019
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
112 changes: 81 additions & 31 deletions README.rst
Expand Up @@ -12,11 +12,25 @@ smart_open — utils for streaming large files in Python
What?
=====

``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to S3, HDFS, WebHDFS, HTTP, or local storage. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to storages such as S3, HDFS, WebHDFS, HTTP, HTTPS, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top.

``smart_open`` is well-tested, well-documented, and has a simple, Pythonic API:

Why?
====

Working with large remote files, for example using Amazon's `boto <http://docs.pythonboto.org/en/latest/>`_ and `boto3 <https://boto3.readthedocs.io/en/latest/>`_ Python library, is a pain.
``boto``'s ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files, because they're loaded fully into RAM, no streaming.
There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate.

``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make.


How?
=====

``smart_open`` is well-tested, well-documented, and has a simple Pythonic API:


.. _doctools_before_examples:
Expand Down Expand Up @@ -61,7 +75,7 @@ What?
... break
'<!doctype html>\n'

Other examples of URLs that ``smart_open`` accepts::
Other examples of URIs that ``smart_open`` accepts::

s3://my_bucket/my_key
s3://my_key:my_secret@my_bucket/my_key
Expand All @@ -80,6 +94,13 @@ Other examples of URLs that ``smart_open`` accepts::

.. _doctools_after_examples:


Documentation
=============

Built-in help
-------------

For detailed API info, see the online help:

.. code-block:: python
Expand All @@ -88,7 +109,8 @@ For detailed API info, see the online help:

or click `here <https://github.com/RaRe-Technologies/smart_open/blob/master/help.txt>`__ to view the help in your browser.

More examples:
More examples
-------------

.. code-block:: python

Expand Down Expand Up @@ -134,29 +156,6 @@ More examples:
with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'here we stand')

Why?
----

Working with large S3 files using Amazon's default Python library, `boto <http://docs.pythonboto.org/en/latest/>`_ and `boto3 <https://boto3.readthedocs.io/en/latest/>`_, is a pain.
Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming).
There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate.

``smart_open`` shields you from that. It builds on boto3 but offers a cleaner, Pythonic API. The result is less code for you to write and fewer bugs to make.

Installation
------------
::

pip install smart_open

Or, if you prefer to install from the `source tar.gz <http://pypi.python.org/pypi/smart_open>`_::

python setup.py test # run unit tests
python setup.py install

To run the unit tests (optional), you'll also need to install `mock <https://pypi.python.org/pypi/mock>`_ , `moto <https://github.com/spulec/moto>`_ and `responses <https://github.com/getsentry/responses>`_ (``pip install mock moto responses``).
The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request.

Supported Compression Formats
-----------------------------

Expand Down Expand Up @@ -185,6 +184,7 @@ For 2.7, use `backports.lzma`_.

.. _backports.lzma: https://pypi.org/project/backports.lzma/


Transport-specific Options
--------------------------

Expand Down Expand Up @@ -260,6 +260,52 @@ Since going over all (or select) keys in an S3 bucket is a very common operation
annual/monthly_rain/2012.monthly_rain.nc 13


File-like Binary Streams
------------------------

The ``open`` function also accepts file-like objects.
This is useful when you already have a `binary file <https://docs.python.org/3/glossary.html#term-binary-file>`_ open, and would like to wrap it with transparent decompression:


.. code-block:: python

>>> import io, gzip
>>>
>>> # Prepare some gzipped binary data in memory, as an example.
>>> # Note that any binary file will do; we're using BytesIO here for simplicity.
>>> buf = io.BytesIO()
>>> with gzip.GzipFile(fileobj=buf, mode='w') as fout:
... fout.write(b'this is a bytestring')
>>> buf.seek(0)
>>>
>>> # Use case starts here.
>>> buf.name = 'file.gz' # add a .name attribute so smart_open knows what compressor to use
>>> import smart_open
>>> smart_open.open(buf, 'rb').read() # will gzip-decompress transparently!
b'this is a bytestring'


In this case, ``smart_open`` relied on the ``.name`` attribute of our `binary I/O stream <https://docs.python.org/3/library/io.html#binary-i-o>`_ ``buf`` object to determine which decompressor to use.
If your file object doesn't have one, set the ``.name`` attribute to an appropriate value.
Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function).
Otherwise, the transparent decompression will not occur.


Installation
============
::

pip install smart_open

Or, if you prefer to install from the `source tar.gz <http://pypi.python.org/pypi/smart_open>`_::

python setup.py test # run unit tests
python setup.py install

To run the unit tests (optional), you'll also need to install `mock <https://pypi.python.org/pypi/mock>`_ , `moto <https://github.com/spulec/moto>`_ and `responses <https://github.com/getsentry/responses>`_ (``pip install mock moto responses``).
The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request.


Migrating to the new ``open`` function
--------------------------------------

Expand Down Expand Up @@ -294,13 +340,17 @@ Before:

.. code-block:: python

>>> import smart_open
>>> smart_open.smart_open('s3://commoncrawl/robots.txt').read(32) # 'rb' used to be default
b'User-Agent: *\nDisallow: /'

After:

.. code-block:: python

>>> import smart_open
>>> smart_open.open('s3://commoncrawl/robots.txt', 'rb').read(32)
b'User-Agent: *\nDisallow: /'

The ``ignore_extension`` keyword parameter is now called ``ignore_ext``.
It behaves identically otherwise.
Expand All @@ -312,7 +362,7 @@ transport layer, e.g. HTTP, S3, etc. The old function accepted these directly:

>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> session = boto3.Session(profile_name='smart_open')
>>> smart_open(url, 'r', session=session).read(32)
>>> smart_open.smart_open(url, 'r', session=session).read(32)
'first line\nsecond line\nthird lin'

The new function accepts a ``transport_params`` keyword argument. It's a dict.
Expand All @@ -335,14 +385,14 @@ Removed parameters:
- ``profile_name``

**The profile_name parameter has been removed.**
Pass an entire boto3.Session object instead.
Pass an entire ``boto3.Session`` object instead.

Before:

.. code-block:: python

>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> smart_open(url, 'r', profile_name='smart_open').read(32)
>>> smart_open.smart_open(url, 'r', profile_name='smart_open').read(32)
'first line\nsecond line\nthird lin'

After:
Expand All @@ -361,7 +411,7 @@ If you pass an invalid parameter name, the ``smart_open.open`` function will war
Keep an eye on your logs for WARNING messages from ``smart_open``.

Comments, bug reports
---------------------
=====================

``smart_open`` lives on `Github <https://github.com/RaRe-Technologies/smart_open>`_. You can file
issues or pull requests there. Suggestions, pull requests and improvements welcome!
Expand Down