From 17b0f326fa0dae214dbf38ec6c377e649b622d44 Mon Sep 17 00:00:00 2001 From: Michael Penkov Date: Mon, 1 Jul 2019 16:25:17 +0900 Subject: [PATCH 1/6] document passthrough use case --- README.rst | 26 ++++++++++++++++++++++++-- 1 file changed, 24 insertions(+), 2 deletions(-) diff --git a/README.rst b/README.rst index fe6006c9..3a49d54a 100644 --- a/README.rst +++ b/README.rst @@ -134,6 +134,24 @@ More examples: with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout: fout.write(b'here we stand') +The ``open`` function also accepts file-like objects. +This is useful when you already have an open file, and would like to transparently decompress it. + +.. code-block:: python + + >>> import io + >>> filepath = 'smart_open/tests/test_data/1984.txt.gz' + >>> with io.open(filepath, 'rb') as open_file: + ... fin.name = filepath + ... with open(open_file, 'rb') as fin: + ... print(repr(fin.readline())) + b'It was a bright cold day in April, and the clocks were striking thirteen.\n' + +In this case, ``smart_open`` relied on the ``.name`` attribute of our file object to determine which decompressor to use. +If your file object doesn't have one, set the ``.name`` attribute to an appropriate value. +Furthermore, that value has to end with **known** file extension (see the ``register_compressor`` function). +Otherwise, the transparent decompression will **not occur**. + Why? ---- @@ -294,13 +312,17 @@ Before: .. code-block:: python + >>> import smart_open >>> smart_open.smart_open('s3://commoncrawl/robots.txt').read(32) # 'rb' used to be default + b'User-Agent: *\nDisallow: /' After: .. code-block:: python + >>> import smart_open >>> smart_open.open('s3://commoncrawl/robots.txt', 'rb').read(32) + b'User-Agent: *\nDisallow: /' The ``ignore_extension`` keyword parameter is now called ``ignore_ext``. It behaves identically otherwise. @@ -312,7 +334,7 @@ transport layer, e.g. HTTP, S3, etc. The old function accepted these directly: >>> url = 's3://smart-open-py37-benchmark-results/test.txt' >>> session = boto3.Session(profile_name='smart_open') - >>> smart_open(url, 'r', session=session).read(32) + >>> smart_open.smart_open(url, 'r', session=session).read(32) 'first line\nsecond line\nthird lin' The new function accepts a ``transport_params`` keyword argument. It's a dict. @@ -342,7 +364,7 @@ Before: .. code-block:: python >>> url = 's3://smart-open-py37-benchmark-results/test.txt' - >>> smart_open(url, 'r', profile_name='smart_open').read(32) + >>> smart_open.smart_open(url, 'r', profile_name='smart_open').read(32) 'first line\nsecond line\nthird lin' After: From 76eecd4b5282cf256166d291e15821907ba52b32 Mon Sep 17 00:00:00 2001 From: Michael Penkov Date: Mon, 1 Jul 2019 16:26:36 +0900 Subject: [PATCH 2/6] Update README.rst --- README.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.rst b/README.rst index 3a49d54a..1f31f31b 100644 --- a/README.rst +++ b/README.rst @@ -149,7 +149,7 @@ This is useful when you already have an open file, and would like to transparent In this case, ``smart_open`` relied on the ``.name`` attribute of our file object to determine which decompressor to use. If your file object doesn't have one, set the ``.name`` attribute to an appropriate value. -Furthermore, that value has to end with **known** file extension (see the ``register_compressor`` function). +Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function). Otherwise, the transparent decompression will **not occur**. Why? From 04b439d596abcb5b2be9c0e12277661c9914762f Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Mon, 1 Jul 2019 10:58:04 +0200 Subject: [PATCH 3/6] Reorganize README - add headings + use consistent heading levels - shuffle sections around for a more natural flow --- README.rst | 112 +++++++++++++++++++++++++++++++---------------------- 1 file changed, 66 insertions(+), 46 deletions(-) diff --git a/README.rst b/README.rst index 1f31f31b..5cce7f57 100644 --- a/README.rst +++ b/README.rst @@ -12,10 +12,24 @@ smart_open — utils for streaming large files in Python What? ===== -``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to S3, HDFS, WebHDFS, HTTP, or local storage. It supports transparent, on-the-fly (de-)compression for a variety of different formats. +``smart_open`` is a Python 2 & Python 3 library for **efficient streaming of very large files** from/to storages such as S3, HDFS, WebHDFS, HTTP, HTTPS, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats. ``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top. + +Why? +==== + +Working with large remote files, for example using Amazon's `boto `_ and `boto3 `_ Python library is a pain. +Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming). +There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate. + +``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make. + + +How? +===== + ``smart_open`` is well-tested, well-documented, and has a simple, Pythonic API: @@ -61,7 +75,7 @@ What? ... break '\n' -Other examples of URLs that ``smart_open`` accepts:: +Other examples of URIs that ``smart_open`` accepts:: s3://my_bucket/my_key s3://my_key:my_secret@my_bucket/my_key @@ -80,6 +94,13 @@ Other examples of URLs that ``smart_open`` accepts:: .. _doctools_after_examples: + +Documentation +============= + +Built-in help +------------- + For detailed API info, see the online help: .. code-block:: python @@ -88,7 +109,8 @@ For detailed API info, see the online help: or click `here `__ to view the help in your browser. -More examples: +More examples +------------- .. code-block:: python @@ -134,47 +156,6 @@ More examples: with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout: fout.write(b'here we stand') -The ``open`` function also accepts file-like objects. -This is useful when you already have an open file, and would like to transparently decompress it. - -.. code-block:: python - - >>> import io - >>> filepath = 'smart_open/tests/test_data/1984.txt.gz' - >>> with io.open(filepath, 'rb') as open_file: - ... fin.name = filepath - ... with open(open_file, 'rb') as fin: - ... print(repr(fin.readline())) - b'It was a bright cold day in April, and the clocks were striking thirteen.\n' - -In this case, ``smart_open`` relied on the ``.name`` attribute of our file object to determine which decompressor to use. -If your file object doesn't have one, set the ``.name`` attribute to an appropriate value. -Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function). -Otherwise, the transparent decompression will **not occur**. - -Why? ----- - -Working with large S3 files using Amazon's default Python library, `boto `_ and `boto3 `_, is a pain. -Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming). -There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate. - -``smart_open`` shields you from that. It builds on boto3 but offers a cleaner, Pythonic API. The result is less code for you to write and fewer bugs to make. - -Installation ------------- -:: - - pip install smart_open - -Or, if you prefer to install from the `source tar.gz `_:: - - python setup.py test # run unit tests - python setup.py install - -To run the unit tests (optional), you'll also need to install `mock `_ , `moto `_ and `responses `_ (``pip install mock moto responses``). -The tests are also run automatically with `Travis CI `_ on every commit push & pull request. - Supported Compression Formats ----------------------------- @@ -203,6 +184,7 @@ For 2.7, use `backports.lzma`_. .. _backports.lzma: https://pypi.org/project/backports.lzma/ + Transport-specific Options -------------------------- @@ -278,6 +260,44 @@ Since going over all (or select) keys in an S3 bucket is a very common operation annual/monthly_rain/2012.monthly_rain.nc 13 +File-like Binary Streams +------------------------ + +The ``open`` function also accepts file-like objects. +This is useful when you already have an open file, and would like to wrap it with transparent decompression: + + +.. code-block:: python + + >>> import io + >>> filepath = 'smart_open/tests/test_data/1984.txt.gz' + >>> with io.open(filepath, 'rb') as open_file: + ... fin.name = filepath + ... with open(open_file, 'rb') as fin: + ... print(repr(fin.readline())) + b'It was a bright cold day in April, and the clocks were striking thirteen.\n' + +In this case, ``smart_open`` relied on the ``.name`` attribute of our file object to determine which decompressor to use. +If your file object doesn't have one, set the ``.name`` attribute to an appropriate value. +Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function). +Otherwise, the transparent decompression will not occur. + + +Installation +============ +:: + + pip install smart_open + +Or, if you prefer to install from the `source tar.gz `_:: + + python setup.py test # run unit tests + python setup.py install + +To run the unit tests (optional), you'll also need to install `mock `_ , `moto `_ and `responses `_ (``pip install mock moto responses``). +The tests are also run automatically with `Travis CI `_ on every commit push & pull request. + + Migrating to the new ``open`` function -------------------------------------- @@ -357,7 +377,7 @@ Removed parameters: - ``profile_name`` **The profile_name parameter has been removed.** -Pass an entire boto3.Session object instead. +Pass an entire ``boto3.Session`` object instead. Before: @@ -383,7 +403,7 @@ If you pass an invalid parameter name, the ``smart_open.open`` function will war Keep an eye on your logs for WARNING messages from ``smart_open``. Comments, bug reports ---------------------- +===================== ``smart_open`` lives on `Github `_. You can file issues or pull requests there. Suggestions, pull requests and improvements welcome! From 7633cf21b4740c92998d1b97be6e637b87fa2ff6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Mon, 1 Jul 2019 11:08:00 +0200 Subject: [PATCH 4/6] Clarify file-object inputs --- README.rst | 30 +++++++++++++++++++----------- 1 file changed, 19 insertions(+), 11 deletions(-) diff --git a/README.rst b/README.rst index 5cce7f57..74fa7fe2 100644 --- a/README.rst +++ b/README.rst @@ -20,8 +20,8 @@ What? Why? ==== -Working with large remote files, for example using Amazon's `boto `_ and `boto3 `_ Python library is a pain. -Its ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files (loaded in RAM, no streaming). +Working with large remote files, for example using Amazon's `boto `_ and `boto3 `_ Python library, is a pain. +``boto``'s ``key.set_contents_from_string()`` and ``key.get_contents_as_string()`` methods only work for small files, because they're loaded fully into RAM, no streaming. There are nasty hidden gotchas when using ``boto``'s multipart upload functionality that is needed for large files, and a lot of boilerplate. ``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make. @@ -30,7 +30,7 @@ There are nasty hidden gotchas when using ``boto``'s multipart upload functional How? ===== -``smart_open`` is well-tested, well-documented, and has a simple, Pythonic API: +``smart_open`` is well-tested, well-documented, and has a simple Pythonic API: .. _doctools_before_examples: @@ -269,15 +269,23 @@ This is useful when you already have an open file, and would like to wrap it wit .. code-block:: python - >>> import io - >>> filepath = 'smart_open/tests/test_data/1984.txt.gz' - >>> with io.open(filepath, 'rb') as open_file: - ... fin.name = filepath - ... with open(open_file, 'rb') as fin: - ... print(repr(fin.readline())) - b'It was a bright cold day in April, and the clocks were striking thirteen.\n' + >>> import io, gzip + >>> + >>> # Prepare some gzipped binary data in memory, as an example. + >>> # Note that any binary file will do; we're using BytesIO here for simplicity. + >>> buf = io.BytesIO() + >>> with gzip.GzipFile(fileobj=buf, mode='w') as fout: + ... fout.write(b'this is a bytestring') + >>> buf.seek(0) + >>> + >>> # Use case starts here. + >>> buf.name = 'file.gz' # add a .name attribute so smart_open knows what compressor to use + >>> import smart_open + >>> smart_open.open(buf, 'rb').read() # will gzip-decompress transparently! + b'this is a bytestring' + -In this case, ``smart_open`` relied on the ``.name`` attribute of our file object to determine which decompressor to use. +In this case, ``smart_open`` relied on the ``.name`` attribute of our file-like ``buf`` object to determine which decompressor to use. If your file object doesn't have one, set the ``.name`` attribute to an appropriate value. Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function). Otherwise, the transparent decompression will not occur. From 3158d4ecc96254c0335bfa6a5cbf2265c50f1de0 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Mon, 1 Jul 2019 11:16:42 +0200 Subject: [PATCH 5/6] Add link to Python's "binary file" glossary --- README.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/README.rst b/README.rst index 74fa7fe2..2aac4381 100644 --- a/README.rst +++ b/README.rst @@ -264,7 +264,7 @@ File-like Binary Streams ------------------------ The ``open`` function also accepts file-like objects. -This is useful when you already have an open file, and would like to wrap it with transparent decompression: +This is useful when you already have a `binary file `_ open, and would like to wrap it with transparent decompression: .. code-block:: python @@ -281,7 +281,7 @@ This is useful when you already have an open file, and would like to wrap it wit >>> # Use case starts here. >>> buf.name = 'file.gz' # add a .name attribute so smart_open knows what compressor to use >>> import smart_open - >>> smart_open.open(buf, 'rb').read() # will gzip-decompress transparently! + >>> smart_open.open(buf, 'rb').read() # will gzip-decompress transparently! b'this is a bytestring' From dcabb06b5f2e76dd36a22e72125addae9ebb6a70 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Radim=20=C5=98eh=C5=AF=C5=99ek?= Date: Mon, 1 Jul 2019 12:01:07 +0200 Subject: [PATCH 6/6] add link to Python's "binary I/O" documentation --- README.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.rst b/README.rst index 2aac4381..7d382afc 100644 --- a/README.rst +++ b/README.rst @@ -285,7 +285,7 @@ This is useful when you already have a `binary file `_ ``buf`` object to determine which decompressor to use. If your file object doesn't have one, set the ``.name`` attribute to an appropriate value. Furthermore, that value has to end with a **known** file extension (see the ``register_compressor`` function). Otherwise, the transparent decompression will not occur.