diff --git a/README.rst b/README.rst index fe6006c9..7d382afc 100644 --- a/README.rst +++ b/README.rst @@ -12,11 +12,25 @@ smart_open — utils for streaming large files in Python What? ===== -``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to S3, HDFS, WebHDFS, HTTP, or local storage. It supports transparent, on-the-fly (de-)compression for a variety of different formats. +``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to storages such as S3, HDFS, WebHDFS, HTTP, HTTPS, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats. ``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top. -``smart_open`` is well-tested, well-documented, and has a simple, Pythonic API: + +Why? +==== + +Working with large remote files, for example using Amazon's `boto `_ and `boto3 `_ Python library, is a pain. +``boto``'s ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files, because they're loaded fully into RAM, no streaming. +There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate. + +``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make. + + +How? +===== + +``smart_open`` is well-tested, well-documented, and has a simple Pythonic API: .. _doctools_before_examples: @@ -61,7 +75,7 @@ What? ... break '\n' -Other examples of URLs that ``smart_open`` accepts:: +Other examples of URIs that ``smart_open`` accepts:: s3://my_bucket/my_key s3://my_key:my_secret@my_bucket/my_key @@ -80,6 +94,13 @@ Other examples of URLs that ``smart_open`` accepts:: .. _doctools_after_examples: + +Documentation +============= + +Built-in help +------------- + For detailed API info, see the online help: .. code-block:: python @@ -88,7 +109,8 @@ For detailed API info, see the online help: or click `here `__ to view the help in your browser. -More examples: +More examples +------------- .. code-block:: python @@ -134,29 +156,6 @@ More examples: with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout: fout.write(b'here we stand') -Why? ----- - -Working with large S3 files using Amazon's default Python library, `boto `_ and `boto3 `_, is a pain. -Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming). -There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate. - -``smart_open`` shields you from that. It builds on boto3 but offers a cleaner, Pythonic API. The result is less code for you to write and fewer bugs to make. - -Installation ------------- -:: - - pip install smart_open - -Or, if you prefer to install from the `source tar.gz `_:: - - python setup.py test # run unit tests - python setup.py install - -To run the unit tests (optional), you'll also need to install `mock `_ , `moto `_ and `responses `_ (``pip install mock moto responses``). -The tests are also run automatically with `Travis CI `_ on every commit push & pull request. - Supported Compression Formats ----------------------------- @@ -185,6 +184,7 @@ For 2.7, use `backports.lzma`_. .. _backports.lzma: https://pypi.org/project/backports.lzma/ + Transport-specific Options -------------------------- @@ -260,6 +260,52 @@ Since going over all (or select) keys in an S3 bucket is a very common operation annual/monthly_rain/2012.monthly_rain.nc 13 +File-like Binary Streams +------------------------ + +The ``open`` function also accepts file-like objects. +This is useful when you already have a `binary file `_ open, and would like to wrap it with transparent decompression: + + +.. code-block:: python + + >>> import io, gzip + >>> + >>> # Prepare some gzipped binary data in memory, as an example. + >>> # Note that any binary file will do; we're using BytesIO here for simplicity. + >>> buf = io.BytesIO() + >>> with gzip.GzipFile(fileobj=buf, mode='w') as fout: + ... fout.write(b'this is a bytestring') + >>> buf.seek(0) + >>> + >>> # Use case starts here. + >>> buf.name = 'file.gz' # add a .name attribute so smart_open knows what compressor to use + >>> import smart_open + >>> smart_open.open(buf, 'rb').read() # will gzip-decompress transparently! + b'this is a bytestring' + + +In this case, ``smart_open`` relied on the ``.name`` attribute of our `binary I/O stream `_ ``buf`` object to determine which decompressor to use. +If your file object doesn't have one, set the ``.name`` attribute to an appropriate value. +Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function). +Otherwise, the transparent decompression will not occur. + + +Installation +============ +:: + + pip install smart_open + +Or, if you prefer to install from the `source tar.gz `_:: + + python setup.py test # run unit tests + python setup.py install + +To run the unit tests (optional), you'll also need to install `mock `_ , `moto `_ and `responses `_ (``pip install mock moto responses``). +The tests are also run automatically with `Travis CI `_ on every commit push & pull request. + + Migrating to the new ``open`` function -------------------------------------- @@ -294,13 +340,17 @@ Before: .. code-block:: python + >>> import smart_open >>> smart_open.smart_open('s3://commoncrawl/robots.txt').read(32) # 'rb' used to be default + b'User-Agent: *\nDisallow: /' After: .. code-block:: python + >>> import smart_open >>> smart_open.open('s3://commoncrawl/robots.txt', 'rb').read(32) + b'User-Agent: *\nDisallow: /' The ``ignore_extension`` keyword parameter is now called ``ignore_ext``. It behaves identically otherwise. @@ -312,7 +362,7 @@ transport layer, e.g. HTTP, S3, etc. The old function accepted these directly: >>> url = 's3://smart-open-py37-benchmark-results/test.txt' >>> session = boto3.Session(profile_name='smart_open') - >>> smart_open(url, 'r', session=session).read(32) + >>> smart_open.smart_open(url, 'r', session=session).read(32) 'first line\nsecond line\nthird lin' The new function accepts a ``transport_params`` keyword argument. It's a dict. @@ -335,14 +385,14 @@ Removed parameters: - ``profile_name`` **The profile_name parameter has been removed.** -Pass an entire boto3.Session object instead. +Pass an entire ``boto3.Session`` object instead. Before: .. code-block:: python >>> url = 's3://smart-open-py37-benchmark-results/test.txt' - >>> smart_open(url, 'r', profile_name='smart_open').read(32) + >>> smart_open.smart_open(url, 'r', profile_name='smart_open').read(32) 'first line\nsecond line\nthird lin' After: @@ -361,7 +411,7 @@ If you pass an invalid parameter name, the ``smart_open.open`` function will war Keep an eye on your logs for WARNING messages from ``smart_open``. Comments, bug reports ---------------------- +===================== ``smart_open`` lives on `Github `_. You can file issues or pull requests there. Suggestions, pull requests and improvements welcome!