Add documentation on how to use ijson with library methods

open-contracting · Feb 28, 2020 · 9ba38b8 · 9ba38b8
1 parent da0767c
commit 9ba38b8
Show file tree

Hide file tree

Showing 3 changed files with 42 additions and 2 deletions.
diff --git a/docs/api/cli.rst b/docs/api/cli.rst
@@ -0,0 +1,6 @@
+Command-line utilities
+======================
+
+.. automodule:: ocdskit.cli.commands.base
+   :members:
+   :undoc-members:
diff --git a/docs/contributing.rst b/docs/contributing.rst
@@ -14,9 +14,9 @@ Adding a command
 Streaming
 ---------
 
-A naive program buffers all inputs into memory before writing any outputs. Since OCDS files can be very large, all OCDS commands read inputs and write outputs progressively or one-at-a-time (that is, they "stream"), as much as possible. Streaming writes outputs faster and requires less memory than buffering. All OCDS commands:
+All OCDS commands:
 
--  stream input, using ``ijson`` to iteratively parse the JSON inputs with a read buffer of 64 kB
+-  stream input, using `ijson <https://pypi.org/project/ijson/>`__ to iteratively parse the JSON inputs with a read buffer of 64 kB
 -  stream output, using `json.JSONDecoder.iterencode() <https://docs.python.org/3/library/json.html#json.JSONEncoder.iterencode>`__ with a `default <https://docs.python.org/3/library/json.html#json.JSONEncoder.default>`__ method that postpones the evaluation of iterators
 -  postpone the evaluation of inputs by using iterators instead of lists (for example, ``package-releases`` sets the package's ``releases`` to an iterator), using the `itertools <https://docs.python.org/2/library/itertools.html>`__ module
 

diff --git a/docs/library.rst b/docs/library.rst
@@ -1,8 +1,41 @@
 Python library
 ==============
 
+Working with streams
+--------------------
+
+A naive program buffers all inputs into memory before writing any outputs. Since OCDS files can be very large, the :doc:`command-line interface<cli>` reads inputs and writes outputs progressively or one-at-a-time (that is, it "streams"), as much as possible. Streaming writes outputs faster and requires less memory than buffering.
+
+Output
+~~~~~~
+
 Several library methods return dictionaries with generators as values, which can't be serialized using the ``json`` module without extra work. Use the :func:`ocdskit.util.json_dumps`, :func:`ocdskit.util.json_dump` and :func:`ocdskit.util.iterencode` methods instead.
 
+Input
+~~~~~
+
+The command-line interface uses `ijson <https://pypi.org/project/ijson/>`__ to iteratively parse the JSON inputs with a read buffer of 64 kB. To do the same in your code:
+
+.. code-block:: python
+
+   import ijson
+
+   with open(filename) as f:
+       for item in ijson.items(f, ''):
+           # do stuff
+
+If you are parsing `concatenated JSON <https://en.wikipedia.org/wiki/JSON_streaming#Concatenated_JSON>`__, use :code:`multiple_values=True`:
+
+.. code-block:: python
+
+   for item in ijson.items(f, '', multiple_values=True):
+
+If you are working with files that :ref:`embed OCDS data<embedded-data>`, set the ``prefix`` argument (:code:`''` above) as described in `ijson's documentation <https://github.com/ICRAR/ijson#prefix>`__. For example:
+
+.. code-block:: python
+
+   for item in ijson.items(f, 'results.item'):
+
 .. toctree::
    :maxdepth: 2
 
@@ -11,4 +44,5 @@ Several library methods return dictionaries with generators as values, which can
    api/mapping_sheet
    api/schema
    api/util
+   api/cli
    api/exceptions