Skip to content

Commit

Permalink
document passthrough use case (#333)
Browse files Browse the repository at this point in the history
* document passthrough use case

* Update README.rst

* Reorganize README

- add headings + use consistent heading levels
- shuffle sections around for a more natural flow

* Clarify file-object inputs

* Add link to Python's "binary file" glossary

* add link to Python's "binary I/O" documentation
  • Loading branch information
mpenkov committed Jul 1, 2019
1 parent aea05d0 commit e27f86c
Showing 1 changed file with 81 additions and 31 deletions.
112 changes: 81 additions & 31 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,25 @@ smart_open — utils for streaming large files in Python
What?
=====

``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to S3, HDFS, WebHDFS, HTTP, or local storage. It supports transparent, on-the-fly (de-)compression for a variety of different formats.
``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to storages such as S3, HDFS, WebHDFS, HTTP, HTTPS, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats.

``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top.

``smart_open`` is well-tested, well-documented, and has a simple, Pythonic API:

Why?
====

Working with large remote files, for example using Amazon's `boto <http://docs.pythonboto.org/en/latest/>`_ and `boto3 <https://boto3.readthedocs.io/en/latest/>`_ Python library, is a pain.
``boto``'s ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files, because they're loaded fully into RAM, no streaming.
There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate.

``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make.


How?
=====

``smart_open`` is well-tested, well-documented, and has a simple Pythonic API:


.. _doctools_before_examples:
Expand Down Expand Up @@ -61,7 +75,7 @@ What?
... break
'<!doctype html>\n'
Other examples of URLs that ``smart_open`` accepts::
Other examples of URIs that ``smart_open`` accepts::

s3://my_bucket/my_key
s3://my_key:my_secret@my_bucket/my_key
Expand All @@ -80,6 +94,13 @@ Other examples of URLs that ``smart_open`` accepts::

.. _doctools_after_examples:


Documentation
=============

Built-in help
-------------

For detailed API info, see the online help:

.. code-block:: python
Expand All @@ -88,7 +109,8 @@ For detailed API info, see the online help:
or click `here <https://github.com/RaRe-Technologies/smart_open/blob/master/help.txt>`__ to view the help in your browser.

More examples:
More examples
-------------

.. code-block:: python
Expand Down Expand Up @@ -134,29 +156,6 @@ More examples:
with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout:
fout.write(b'here we stand')
Why?
----

Working with large S3 files using Amazon's default Python library, `boto <http://docs.pythonboto.org/en/latest/>`_ and `boto3 <https://boto3.readthedocs.io/en/latest/>`_, is a pain.
Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming).
There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate.

``smart_open`` shields you from that. It builds on boto3 but offers a cleaner, Pythonic API. The result is less code for you to write and fewer bugs to make.

Installation
------------
::

pip install smart_open

Or, if you prefer to install from the `source tar.gz <http://pypi.python.org/pypi/smart_open>`_::

python setup.py test # run unit tests
python setup.py install

To run the unit tests (optional), you'll also need to install `mock <https://pypi.python.org/pypi/mock>`_ , `moto <https://github.com/spulec/moto>`_ and `responses <https://github.com/getsentry/responses>`_ (``pip install mock moto responses``).
The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request.

Supported Compression Formats
-----------------------------

Expand Down Expand Up @@ -185,6 +184,7 @@ For 2.7, use `backports.lzma`_.

.. _backports.lzma: https://pypi.org/project/backports.lzma/


Transport-specific Options
--------------------------

Expand Down Expand Up @@ -260,6 +260,52 @@ Since going over all (or select) keys in an S3 bucket is a very common operation
annual/monthly_rain/2012.monthly_rain.nc 13
File-like Binary Streams
------------------------

The ``open`` function also accepts file-like objects.
This is useful when you already have a `binary file <https://docs.python.org/3/glossary.html#term-binary-file>`_ open, and would like to wrap it with transparent decompression:


.. code-block:: python
>>> import io, gzip
>>>
>>> # Prepare some gzipped binary data in memory, as an example.
>>> # Note that any binary file will do; we're using BytesIO here for simplicity.
>>> buf = io.BytesIO()
>>> with gzip.GzipFile(fileobj=buf, mode='w') as fout:
... fout.write(b'this is a bytestring')
>>> buf.seek(0)
>>>
>>> # Use case starts here.
>>> buf.name = 'file.gz' # add a .name attribute so smart_open knows what compressor to use
>>> import smart_open
>>> smart_open.open(buf, 'rb').read() # will gzip-decompress transparently!
b'this is a bytestring'
In this case, ``smart_open`` relied on the ``.name`` attribute of our `binary I/O stream <https://docs.python.org/3/library/io.html#binary-i-o>`_ ``buf`` object to determine which decompressor to use.
If your file object doesn't have one, set the ``.name`` attribute to an appropriate value.
Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function).
Otherwise, the transparent decompression will not occur.


Installation
============
::

pip install smart_open

Or, if you prefer to install from the `source tar.gz <http://pypi.python.org/pypi/smart_open>`_::

python setup.py test # run unit tests
python setup.py install

To run the unit tests (optional), you'll also need to install `mock <https://pypi.python.org/pypi/mock>`_ , `moto <https://github.com/spulec/moto>`_ and `responses <https://github.com/getsentry/responses>`_ (``pip install mock moto responses``).
The tests are also run automatically with `Travis CI <https://travis-ci.org/RaRe-Technologies/smart_open>`_ on every commit push & pull request.


Migrating to the new ``open`` function
--------------------------------------

Expand Down Expand Up @@ -294,13 +340,17 @@ Before:

.. code-block:: python
>>> import smart_open
>>> smart_open.smart_open('s3://commoncrawl/robots.txt').read(32) # 'rb' used to be default
b'User-Agent: *\nDisallow: /'
After:

.. code-block:: python
>>> import smart_open
>>> smart_open.open('s3://commoncrawl/robots.txt', 'rb').read(32)
b'User-Agent: *\nDisallow: /'
The ``ignore_extension`` keyword parameter is now called ``ignore_ext``.
It behaves identically otherwise.
Expand All @@ -312,7 +362,7 @@ transport layer, e.g. HTTP, S3, etc. The old function accepted these directly:
>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> session = boto3.Session(profile_name='smart_open')
>>> smart_open(url, 'r', session=session).read(32)
>>> smart_open.smart_open(url, 'r', session=session).read(32)
'first line\nsecond line\nthird lin'
The new function accepts a ``transport_params`` keyword argument. It's a dict.
Expand All @@ -335,14 +385,14 @@ Removed parameters:
- ``profile_name``

**The profile_name parameter has been removed.**
Pass an entire boto3.Session object instead.
Pass an entire ``boto3.Session`` object instead.

Before:

.. code-block:: python
>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> smart_open(url, 'r', profile_name='smart_open').read(32)
>>> smart_open.smart_open(url, 'r', profile_name='smart_open').read(32)
'first line\nsecond line\nthird lin'
After:
Expand All @@ -361,7 +411,7 @@ If you pass an invalid parameter name, the ``smart_open.open`` function will war
Keep an eye on your logs for WARNING messages from ``smart_open``.

Comments, bug reports
---------------------
=====================

``smart_open`` lives on `Github <https://github.com/RaRe-Technologies/smart_open>`_. You can file
issues or pull requests there. Suggestions, pull requests and improvements welcome!
Expand Down

0 comments on commit e27f86c

Please sign in to comment.