diff --git a/docs/api/cli.rst b/docs/api/cli.rst new file mode 100644 index 00000000..0fdf9334 --- /dev/null +++ b/docs/api/cli.rst @@ -0,0 +1,6 @@ +Command-line utilities +====================== + +.. automodule:: ocdskit.cli.commands.base + :members: + :undoc-members: diff --git a/docs/contributing.rst b/docs/contributing.rst index a46c12ac..5c862834 100644 --- a/docs/contributing.rst +++ b/docs/contributing.rst @@ -14,9 +14,9 @@ Adding a command Streaming --------- -A naive program buffers all inputs into memory before writing any outputs. Since OCDS files can be very large, all OCDS commands read inputs and write outputs progressively or one-at-a-time (that is, they "stream"), as much as possible. Streaming writes outputs faster and requires less memory than buffering. All OCDS commands: +All OCDS commands: -- stream input, using ``ijson`` to iteratively parse the JSON inputs with a read buffer of 64 kB +- stream input, using `ijson `__ to iteratively parse the JSON inputs with a read buffer of 64 kB - stream output, using `json.JSONDecoder.iterencode() `__ with a `default `__ method that postpones the evaluation of iterators - postpone the evaluation of inputs by using iterators instead of lists (for example, ``package-releases`` sets the package's ``releases`` to an iterator), using the `itertools `__ module diff --git a/docs/library.rst b/docs/library.rst index 8b464dd6..8aa09ae3 100644 --- a/docs/library.rst +++ b/docs/library.rst @@ -1,8 +1,41 @@ Python library ============== +Working with streams +-------------------- + +A naive program buffers all inputs into memory before writing any outputs. Since OCDS files can be very large, the :doc:`command-line interface` reads inputs and writes outputs progressively or one-at-a-time (that is, it "streams"), as much as possible. Streaming writes outputs faster and requires less memory than buffering. + +Output +~~~~~~ + Several library methods return dictionaries with generators as values, which can't be serialized using the ``json`` module without extra work. Use the :func:`ocdskit.util.json_dumps`, :func:`ocdskit.util.json_dump` and :func:`ocdskit.util.iterencode` methods instead. +Input +~~~~~ + +The command-line interface uses `ijson `__ to iteratively parse the JSON inputs with a read buffer of 64 kB. To do the same in your code: + +.. code-block:: python + + import ijson + + with open(filename) as f: + for item in ijson.items(f, ''): + # do stuff + +If you are parsing `concatenated JSON `__, use :code:`multiple_values=True`: + +.. code-block:: python + + for item in ijson.items(f, '', multiple_values=True): + +If you are working with files that :ref:`embed OCDS data`, set the ``prefix`` argument (:code:`''` above) as described in `ijson's documentation `__. For example: + +.. code-block:: python + + for item in ijson.items(f, 'results.item'): + .. toctree:: :maxdepth: 2 @@ -11,4 +44,5 @@ Several library methods return dictionaries with generators as values, which can api/mapping_sheet api/schema api/util + api/cli api/exceptions