Skip to content

Commit

Permalink
Add documentation on how to use ijson with library methods
Browse files Browse the repository at this point in the history
  • Loading branch information
jpmckinney committed Feb 28, 2020
1 parent da0767c commit 9ba38b8
Show file tree
Hide file tree
Showing 3 changed files with 42 additions and 2 deletions.
6 changes: 6 additions & 0 deletions docs/api/cli.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Command-line utilities
======================

.. automodule:: ocdskit.cli.commands.base
:members:
:undoc-members:
4 changes: 2 additions & 2 deletions docs/contributing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ Adding a command
Streaming
---------

A naive program buffers all inputs into memory before writing any outputs. Since OCDS files can be very large, all OCDS commands read inputs and write outputs progressively or one-at-a-time (that is, they "stream"), as much as possible. Streaming writes outputs faster and requires less memory than buffering. All OCDS commands:
All OCDS commands:

- stream input, using ``ijson`` to iteratively parse the JSON inputs with a read buffer of 64 kB
- stream input, using `ijson <https://pypi.org/project/ijson/>`__ to iteratively parse the JSON inputs with a read buffer of 64 kB
- stream output, using `json.JSONDecoder.iterencode() <https://docs.python.org/3/library/json.html#json.JSONEncoder.iterencode>`__ with a `default <https://docs.python.org/3/library/json.html#json.JSONEncoder.default>`__ method that postpones the evaluation of iterators
- postpone the evaluation of inputs by using iterators instead of lists (for example, ``package-releases`` sets the package's ``releases`` to an iterator), using the `itertools <https://docs.python.org/2/library/itertools.html>`__ module

Expand Down
34 changes: 34 additions & 0 deletions docs/library.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,41 @@
Python library
==============

Working with streams
--------------------

A naive program buffers all inputs into memory before writing any outputs. Since OCDS files can be very large, the :doc:`command-line interface<cli>` reads inputs and writes outputs progressively or one-at-a-time (that is, it "streams"), as much as possible. Streaming writes outputs faster and requires less memory than buffering.

Output
~~~~~~

Several library methods return dictionaries with generators as values, which can't be serialized using the ``json`` module without extra work. Use the :func:`ocdskit.util.json_dumps`, :func:`ocdskit.util.json_dump` and :func:`ocdskit.util.iterencode` methods instead.

Input
~~~~~

The command-line interface uses `ijson <https://pypi.org/project/ijson/>`__ to iteratively parse the JSON inputs with a read buffer of 64 kB. To do the same in your code:

.. code-block:: python
import ijson
with open(filename) as f:
for item in ijson.items(f, ''):
# do stuff
If you are parsing `concatenated JSON <https://en.wikipedia.org/wiki/JSON_streaming#Concatenated_JSON>`__, use :code:`multiple_values=True`:

.. code-block:: python
for item in ijson.items(f, '', multiple_values=True):
If you are working with files that :ref:`embed OCDS data<embedded-data>`, set the ``prefix`` argument (:code:`''` above) as described in `ijson's documentation <https://github.com/ICRAR/ijson#prefix>`__. For example:

.. code-block:: python
for item in ijson.items(f, 'results.item'):
.. toctree::
:maxdepth: 2

Expand All @@ -11,4 +44,5 @@ Several library methods return dictionaries with generators as values, which can
api/mapping_sheet
api/schema
api/util
api/cli
api/exceptions

0 comments on commit 9ba38b8

Please sign in to comment.