docs: Update documentation to Kingfisher Process v2, closes #782

open-contracting · Apr 9, 2024 · 2a199a5 · 2a199a5
1 parent 33a2e58
commit 2a199a5
Show file tree

Hide file tree

Showing 5 changed files with 75 additions and 25 deletions.
diff --git a/docs/kingfisher_process.rst b/docs/kingfisher_process.rst
@@ -3,26 +3,51 @@
 Integrate with Kingfisher Process
 =================================
 
-Besides storing the scraped data on disk, you can also send them to an instance of `Kingfisher Process <https://kingfisher-process.readthedocs.io/>`_ for processing.
+.. seealso::
 
-Version 1
----------
+   -  :ref:`increment`, about the ``keep_collection_open`` spider argument
+   -  :class:`~kingfisher_scrapy.base_spiders.base_spider.BaseSpider`, about the ``ocds_version`` class attribute
 
-You need to deploy an instance of Kingfisher Process, including its `web app <https://kingfisher-process.readthedocs.io/en/latest/web.html#web-app>`__. Then, set the following either as environment variables or as Scrapy settings in ``kingfisher_scrapy.settings.py``:
+Kingfisher Collect has optional integration with `Kingfisher Process <https://kingfisher-process.readthedocs.io/>`__, through the :class:`~kingfisher_scrapy.extensions.kingfisher_process_api2.KingfisherProcessAPI2` extension.
 
-``KINGFISHER_API_URI``
-  The URL from which Kingfisher Process' `web app <https://kingfisher-process.readthedocs.io/en/latest/web.html#web-app>`_ is served. Do not include a trailing slash.
-``KINGFISHER_API_KEY``
-  One of the API keys in Kingfisher Process' `API_KEYS <https://kingfisher-process.readthedocs.io/en/latest/config.html#web-api>`__ setting.
+After deploying and starting an instance of Kingfisher Process, set the following either as environment variables or as Scrapy settings in ``kingfisher_scrapy.settings.py``:
 
-To run a spider:
+``KINGFISHER_API2_URL``
+  The URL of Kingfisher Process' web API, for example: ``http://user:pass@localhost:8000``
+``RABBIT_URL``
+  The URL of the RabbitMQ message broker, for example: ``amqp://user:pass@localhost:5672``
+``RABBIT_EXCHANGE_NAME``
+  The name of the exchange in RabbitMQ, for example: ``kingfisher_process_development``
+``RABBIT_ROUTING_KEY``
+  The routing key for messages sent to RabbitMQ, equal to the exchange name with an ``_api`` suffix, for example: ``kingfisher_process_development_api``
 
-.. code-block:: bash
-
-   env KINGFISHER_API_URI='http://127.0.0.1:5000' KINGFISHER_API_KEY=1234 scrapy crawl spider_name
+Add a note to the collection
+----------------------------
 
-To add a note to the collection in Kingfisher Process:
+Add a note to the ``collection_note`` table in Kingfisher Process. For example, to track provenance:
 
 .. code-block:: bash
 
    scrapy crawl spider_name -a note='Started by NAME.'
+
+Select which processing steps to run
+------------------------------------
+
+Kingfisher Process stores OCDS data, and upgrades it if the spider sets a class attribute of ``ocds_version = '1.0'``. It can also perform the optional steps below.
+
+Run structural checks and create compiled releases
+  .. code-block:: bash
+
+     scrapy crawl spider_name -a steps=check,compile
+Run structural checks only
+  .. code-block:: bash
+
+     scrapy crawl spider_name -a steps=check
+Create compiled releases only
+  .. code-block:: bash
+
+     scrapy crawl spider_name -a steps=compile
+Do neither
+  .. code-block:: bash
+
+     scrapy crawl spider_name -a steps=
diff --git a/docs/local.rst b/docs/local.rst
@@ -131,7 +131,7 @@ And so on. However, as you learned in :ref:`how-it-works`, each crawl writes dat
 
     scrapy crawl spider_name -a from_date=2020-10-15 -a until_date=2020-10-31 -a crawl_time=2020-10-14T12:34:56
 
-If you are integrating with :doc:`Kingfisher Process<kingfisher_process>`, remember to set the ``keep_collection_open`` spider argument, in order to not close the collection when the crawl is finished:
+If you are integrating with :doc:`Kingfisher Process<kingfisher_process>`, remember to set the ``keep_collection_open`` spider argument to ``'true'``, in order to not close the collection when the crawl is finished:
 
 .. code-block:: bash
 

diff --git a/docs/logs.rst b/docs/logs.rst
@@ -103,26 +103,33 @@ CRITICAL: Unhandled error in Deferred:
 ERROR: Spider error processing <GET https:…> (referer: None)
   An exception was raised in the spider's code. (See the ``spider_exceptions/…`` statistics below.)
 
-  .. attention:: Action needed.
+  .. attention:: Action needed
 
 ERROR: Error processing {…}
   An exception was raised in an item pipeline, like ``jsonschema.exceptions.ValidationError``.
 
-  .. attention:: Action needed.
+  .. attention:: Action needed
 
 ERROR: Error caught on signal handler: …
   An exception was raised in an extension.
 
-  .. attention:: Action needed.
+  .. attention:: Action needed
 
 ERROR: Error downloading <GET https:…>
   An exception was raised by the downloader, typically after failed retries by the `RetryMiddleware <https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry>`__ downloader middleware. (See the ``downloader/exception_type_count/…`` statistics below.)
-WARNING: Failed to post [https:…]. File API status code: 500
-  Issued by the :class:`~kingfisher_scrapy.extensions.KingfisherProcessAPI` extension.
+ERROR: Failed to create collection: HTTP {code} ({text}) {{headers}}
+  Issued by the :class:`~kingfisher_scrapy.extensions.kingfisher_process_api2.KingfisherProcessAPI2` extension.
 
-  .. admonition:: Potential action
+  .. attention:: Action needed
+
+     Run the ``./manage.py load`` command in Kingfisher Process, once the crawl is finished.
+
+ERROR: Failed to close collection: HTTP {code} ({text}) {{headers}}
+  Issued by the :class:`~kingfisher_scrapy.extensions.kingfisher_process_api2.KingfisherProcessAPI2` extension.
+
+  .. attention:: Action needed
 
-     If you need the collection in Kingfisher Process to be complete, re-run the spider.
+     Run the ``./manage.py closecollection`` command in Kingfisher Process.
 
 WARNING: Dropped: Duplicate File: '….json'
   Issued by the :class:`~kingfisher_scrapy.pipelines.Validate` pipeline.

diff --git a/kingfisher_scrapy/extensions/kingfisher_process_api2.py b/kingfisher_scrapy/extensions/kingfisher_process_api2.py
@@ -27,8 +27,27 @@ def reset(self):
 
 class KingfisherProcessAPI2:
     """
-    If the ``KINGFISHER_API2_URL`` environment variable or configuration setting is set,
-    then messages are sent to a Kingfisher Process API for the ``item_scraped`` and ``spider_closed`` signals.
+    If the ``KINGFISHER_API2_URL``, ``RABBIT_URL``, ``RABBIT_EXCHANGE_NAME`` and ``RABBIT_ROUTING_KEY`` environment
+    variables or configuration settings are set, then OCDS data is stored in Kingfisher Process, incrementally.
+
+    When the spider is opened, a collection is created in Kingfisher Process via its web API. The API also receives the
+    ``note`` and ``steps`` spider arguments (if set) and the spider's ``ocds_version`` class attribute.
+
+    When an item is scraped, a message is published to the exchange for Kingfisher Process in RabbitMQ, with the path
+    to the file written by the :class:`~kingfisher_scrapy.extensions.files_store.FilesStore` extension.
+
+    When the spider is closed, the collection is closed in Kingfisher Process via its web API, unless the
+    ``keep_collection_open`` spider argument was set to ``'true'``. The API also receives the crawl statistics and the
+    reason why the spider was closed.
+
+    .. note::
+
+       If the ``DATABASE_URL`` environment variable or configuration setting is set, this extension is disabled
+       and the :class:`~kingfisher_scrapy.extensions.database_store.DatabaseStore` extension is enabled.
+
+    .. note::
+
+       This extension ignores items generated by the :ref:`pluck` command.
     """
 
     def __init__(self, url, stats, rabbit_url, rabbit_exchange_name, rabbit_routing_key):

diff --git a/kingfisher_scrapy/settings.py b/kingfisher_scrapy/settings.py
@@ -102,8 +102,7 @@
 # To send exceptions and log records to Sentry.
 SENTRY_DSN = os.getenv('SENTRY_DSN')
 
-# To send items to Kingfisher Process (version 2). If the API has basic authentication, add the username and password
-# to the URL, like http://user:pass@localhost:8000
+# To send items to Kingfisher Process (version 2).
 KINGFISHER_API2_URL = os.getenv('KINGFISHER_API2_URL')
 RABBIT_URL = os.getenv('RABBIT_URL')
 RABBIT_EXCHANGE_NAME = os.getenv('RABBIT_EXCHANGE_NAME')