Skip to content

Latest commit

 

History

History
204 lines (142 loc) · 10.2 KB

index.rst

File metadata and controls

204 lines (142 loc) · 10.2 KB

Contributing

There are mainly two types of contributions: spiders and features.

Write a spider

Learn the data source's access methods

Read its API documentation or bulk download documentation. Navigate the API, in your browser or with curl. Inspect its responses, to determine where the OCDS data is located, and whether it includes information like pagination links, total pages or total results.

Choose a base class

Access methods for OCDS data are very similar. Spiders therefore share a lot of logic by inheriting from one of the :doc:`base_spider` classes:

Write the spider

After choosing a base class, read its documentation, as well as its parent class' documentation. It's also helpful to read existing spiders that inherit from the same class. A few other pointers:

  • Write different callback methods for different response types. Writing a single callback with many if-else branches to handle different response types is very hard to reason about.
  • The default parse callback method should be for "leaf" responses: that is, responses that cause no further requests to be yielded, besides pagination requests.
  • Have a look at the :mod:`~kingfisher_scrapy.util` module, which contains useful functions, notably :func:`~kingfisher_scrapy.util.handle_http_error`.

After writing the spider, add a docstring for :ref:`spider metadata<spider-metadata>`.

Since many class attributes that control a spider's behavior, please put the class attributes in this order, including comments with class names:

class NewSpider(ParentSpider):
   """
   The typical docstring.
   """
   name = 'new_spider'
   # Any other class attributes from Scrapy, including `download_delay`, `download_timeout`, `user_agent`, `custom_settings`

   # BaseSpider
   ocds_version = '1.0'
   date_format = 'datetime'
   default_from_date = '2000-01-01T00:00:00'
   default_until_date = '2010-01-01T00:00:00'
   date_required = True
   unflatten = True
   unflatten_args = {}
   line_delimited = True
   root_path = 'item'
   root_path_max_length = 1
   skip_pluck = 'A reason'

   # SimpleSpider
   data_type = 'release_package'
   encoding = 'iso-8859-1'

   # CompressedFileSpider
   resize_package = True
   file_name_must_contain = '-'

   # LinksSpider
   next_page_formatter = staticmethod(parameters('page'))
   next_pointer = '/next_page/uri'

   # PeriodicSpider
   pattern = 'https://example.com/{}'
   start_requests_callback = 'parse_list'

   # IndexSpider
   total_pages_pointer = '/data/last_page'
   count_pointer = '/meta/count'
   limit = 1000
   use_page = True
   formatter = staticmethod(parameters('pageNumber'))
   param_page = 'pageNumber'
   param_limit = 'customLimit'
   param_offset = = 'customOffset'
   additional_params = {'pageSize': 1000}
   base_url = 'https://example.com/elsewhere'
   yield_list_results = False

Test the spider

  1. Run the spider:

    scrapy crawl spider_name

    It can be helpful to write the log to a file:

    scrapy crawl spider_name --logfile=debug.log
  2. :doc:`Check the log for errors and warnings<../logs>`

  3. Check whether the data is as expected, in format and number

  4. Integrate it with Kingfisher Process<../kingfisher_process> and check for errors and warnings in its logs

Scrapy offers some debugging features that we haven't used yet:

Commit the spider

  1. Update docs/spiders.rst with the :ref:`updatedocs` command:

    scrapy updatedocs
  2. Check the metadata of all spiders, with the :ref:`checkall` command:

    scrapy checkall --loglevel=WARNING

After reviewing the output, you can commit your changes to a branch and make a pull request.

Write a feature

Learn Scrapy

Read the Scrapy documentation. In particular, learn the data flow and architecture. When working on a specific feature, read the relevant documentation, for example:

The :doc:`../cli` follows the guidance for running multiple spiders in the same process.

Use Scrapy

The Scrapy framework is very flexible. To maintain a good separation of concerns:

When setting a custom Request.meta key, check that the attribute name isn't already in use by Scrapy.

Update requirements

Update the requirements files as documented in the OCP Software Development Handbook.

Then, re-calculate the checksum for the requirements.txt file. The checksum is used by deployments to determine whether to update dependencies:

shasum -a 256 requirements.txt > requirements.txt.sha256

API reference

.. toctree::

   base_spider.rst
   extensions.rst
   util.rst
   exceptions.rst
   middlewares.rst