Skip to content

Commit

Permalink
docs: Update documentation to Kingfisher Process v2, closes #782
Browse files Browse the repository at this point in the history
  • Loading branch information
jpmckinney committed Apr 9, 2024
1 parent 33a2e58 commit 2a199a5
Show file tree
Hide file tree
Showing 5 changed files with 75 additions and 25 deletions.
51 changes: 38 additions & 13 deletions docs/kingfisher_process.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,51 @@
Integrate with Kingfisher Process
=================================

Besides storing the scraped data on disk, you can also send them to an instance of `Kingfisher Process <https://kingfisher-process.readthedocs.io/>`_ for processing.
.. seealso::

Version 1
---------
- :ref:`increment`, about the ``keep_collection_open`` spider argument
- :class:`~kingfisher_scrapy.base_spiders.base_spider.BaseSpider`, about the ``ocds_version`` class attribute

You need to deploy an instance of Kingfisher Process, including its `web app <https://kingfisher-process.readthedocs.io/en/latest/web.html#web-app>`__. Then, set the following either as environment variables or as Scrapy settings in ``kingfisher_scrapy.settings.py``:
Kingfisher Collect has optional integration with `Kingfisher Process <https://kingfisher-process.readthedocs.io/>`__, through the :class:`~kingfisher_scrapy.extensions.kingfisher_process_api2.KingfisherProcessAPI2` extension.

``KINGFISHER_API_URI``
The URL from which Kingfisher Process' `web app <https://kingfisher-process.readthedocs.io/en/latest/web.html#web-app>`_ is served. Do not include a trailing slash.
``KINGFISHER_API_KEY``
One of the API keys in Kingfisher Process' `API_KEYS <https://kingfisher-process.readthedocs.io/en/latest/config.html#web-api>`__ setting.
After deploying and starting an instance of Kingfisher Process, set the following either as environment variables or as Scrapy settings in ``kingfisher_scrapy.settings.py``:

To run a spider:
``KINGFISHER_API2_URL``
The URL of Kingfisher Process' web API, for example: ``http://user:pass@localhost:8000``
``RABBIT_URL``
The URL of the RabbitMQ message broker, for example: ``amqp://user:pass@localhost:5672``
``RABBIT_EXCHANGE_NAME``
The name of the exchange in RabbitMQ, for example: ``kingfisher_process_development``
``RABBIT_ROUTING_KEY``
The routing key for messages sent to RabbitMQ, equal to the exchange name with an ``_api`` suffix, for example: ``kingfisher_process_development_api``

.. code-block:: bash
env KINGFISHER_API_URI='http://127.0.0.1:5000' KINGFISHER_API_KEY=1234 scrapy crawl spider_name
Add a note to the collection
----------------------------

To add a note to the collection in Kingfisher Process:
Add a note to the ``collection_note`` table in Kingfisher Process. For example, to track provenance:

.. code-block:: bash
scrapy crawl spider_name -a note='Started by NAME.'
Select which processing steps to run
------------------------------------

Kingfisher Process stores OCDS data, and upgrades it if the spider sets a class attribute of ``ocds_version = '1.0'``. It can also perform the optional steps below.

Run structural checks and create compiled releases
.. code-block:: bash
scrapy crawl spider_name -a steps=check,compile
Run structural checks only
.. code-block:: bash
scrapy crawl spider_name -a steps=check
Create compiled releases only
.. code-block:: bash
scrapy crawl spider_name -a steps=compile
Do neither
.. code-block:: bash
scrapy crawl spider_name -a steps=
2 changes: 1 addition & 1 deletion docs/local.rst
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,7 @@ And so on. However, as you learned in :ref:`how-it-works`, each crawl writes dat
scrapy crawl spider_name -a from_date=2020-10-15 -a until_date=2020-10-31 -a crawl_time=2020-10-14T12:34:56
If you are integrating with :doc:`Kingfisher Process<kingfisher_process>`, remember to set the ``keep_collection_open`` spider argument, in order to not close the collection when the crawl is finished:
If you are integrating with :doc:`Kingfisher Process<kingfisher_process>`, remember to set the ``keep_collection_open`` spider argument to ``'true'``, in order to not close the collection when the crawl is finished:

.. code-block:: bash
Expand Down
21 changes: 14 additions & 7 deletions docs/logs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -103,26 +103,33 @@ CRITICAL: Unhandled error in Deferred:
ERROR: Spider error processing <GET https:…> (referer: None)
An exception was raised in the spider's code. (See the ``spider_exceptions/…`` statistics below.)

.. attention:: Action needed.
.. attention:: Action needed

ERROR: Error processing {…}
An exception was raised in an item pipeline, like ``jsonschema.exceptions.ValidationError``.

.. attention:: Action needed.
.. attention:: Action needed

ERROR: Error caught on signal handler: …
An exception was raised in an extension.

.. attention:: Action needed.
.. attention:: Action needed

ERROR: Error downloading <GET https:…>
An exception was raised by the downloader, typically after failed retries by the `RetryMiddleware <https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#module-scrapy.downloadermiddlewares.retry>`__ downloader middleware. (See the ``downloader/exception_type_count/…`` statistics below.)
WARNING: Failed to post [https:…]. File API status code: 500
Issued by the :class:`~kingfisher_scrapy.extensions.KingfisherProcessAPI` extension.
ERROR: Failed to create collection: HTTP {code} ({text}) {{headers}}
Issued by the :class:`~kingfisher_scrapy.extensions.kingfisher_process_api2.KingfisherProcessAPI2` extension.

.. admonition:: Potential action
.. attention:: Action needed

Run the ``./manage.py load`` command in Kingfisher Process, once the crawl is finished.

ERROR: Failed to close collection: HTTP {code} ({text}) {{headers}}
Issued by the :class:`~kingfisher_scrapy.extensions.kingfisher_process_api2.KingfisherProcessAPI2` extension.

.. attention:: Action needed

If you need the collection in Kingfisher Process to be complete, re-run the spider.
Run the ``./manage.py closecollection`` command in Kingfisher Process.

WARNING: Dropped: Duplicate File: '….json'
Issued by the :class:`~kingfisher_scrapy.pipelines.Validate` pipeline.
Expand Down
23 changes: 21 additions & 2 deletions kingfisher_scrapy/extensions/kingfisher_process_api2.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,27 @@ def reset(self):

class KingfisherProcessAPI2:
"""
If the ``KINGFISHER_API2_URL`` environment variable or configuration setting is set,
then messages are sent to a Kingfisher Process API for the ``item_scraped`` and ``spider_closed`` signals.
If the ``KINGFISHER_API2_URL``, ``RABBIT_URL``, ``RABBIT_EXCHANGE_NAME`` and ``RABBIT_ROUTING_KEY`` environment
variables or configuration settings are set, then OCDS data is stored in Kingfisher Process, incrementally.
When the spider is opened, a collection is created in Kingfisher Process via its web API. The API also receives the
``note`` and ``steps`` spider arguments (if set) and the spider's ``ocds_version`` class attribute.
When an item is scraped, a message is published to the exchange for Kingfisher Process in RabbitMQ, with the path
to the file written by the :class:`~kingfisher_scrapy.extensions.files_store.FilesStore` extension.
When the spider is closed, the collection is closed in Kingfisher Process via its web API, unless the
``keep_collection_open`` spider argument was set to ``'true'``. The API also receives the crawl statistics and the
reason why the spider was closed.
.. note::
If the ``DATABASE_URL`` environment variable or configuration setting is set, this extension is disabled
and the :class:`~kingfisher_scrapy.extensions.database_store.DatabaseStore` extension is enabled.
.. note::
This extension ignores items generated by the :ref:`pluck` command.
"""

def __init__(self, url, stats, rabbit_url, rabbit_exchange_name, rabbit_routing_key):
Expand Down
3 changes: 1 addition & 2 deletions kingfisher_scrapy/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,8 +102,7 @@
# To send exceptions and log records to Sentry.
SENTRY_DSN = os.getenv('SENTRY_DSN')

# To send items to Kingfisher Process (version 2). If the API has basic authentication, add the username and password
# to the URL, like http://user:pass@localhost:8000
# To send items to Kingfisher Process (version 2).
KINGFISHER_API2_URL = os.getenv('KINGFISHER_API2_URL')
RABBIT_URL = os.getenv('RABBIT_URL')
RABBIT_EXCHANGE_NAME = os.getenv('RABBIT_EXCHANGE_NAME')
Expand Down

0 comments on commit 2a199a5

Please sign in to comment.