Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KingfisherTransformMiddleware and update affected spiders #572

Merged
merged 63 commits into from
Feb 9, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
ab77692
Add KingfisherTransformMiddleware and update affected spiders
yolile Dec 2, 2020
c047506
Merge branch 'master' of github.com:open-contracting/kingfisher-colle…
yolile Dec 2, 2020
36b1935
Apply suggestions from code review
yolile Dec 3, 2020
5972690
Changes from code review
yolile Dec 3, 2020
d210d45
Merge branch '329-data-types' of github.com:open-contracting/kingfish…
yolile Dec 3, 2020
437cf93
Update CompressedFileSpider tests
yolile Dec 3, 2020
9a3d67a
Reorder if-else condition
yolile Dec 3, 2020
3453b87
Change root_path to '' by default and update transform middleware's c…
yolile Dec 3, 2020
fedba70
Add *_package_list correct handle and sample skip
yolile Dec 3, 2020
3d056df
Add transform middleware tests
yolile Dec 3, 2020
ca1191f
Add root_path documentation
yolile Dec 3, 2020
a0beebc
Update compressed file tests
yolile Dec 3, 2020
b5b67c2
Set item data type in digiwhist items
yolile Dec 4, 2020
999b6ff
Fix bolivia data_type
yolile Dec 4, 2020
6204f65
Update middleware docs and json array keys
yolile Dec 4, 2020
5b793fa
Limit the number of releases inside a package according to sample size
yolile Dec 4, 2020
2d7679c
docs: Clarify some docstrings and comments
jpmckinney Dec 11, 2020
4e5fa27
docs: Clarify some comments
jpmckinney Dec 11, 2020
d994500
Merge branch 'master' of github.com:open-contracting/kingfisher-colle…
yolile Dec 18, 2020
a815b00
Merge branch 'master' of github.com:open-contracting/kingfisher-colle…
yolile Dec 22, 2020
32341d9
Merge branch 'master' of github.com:open-contracting/kingfisher-colle…
yolile Dec 23, 2020
fd62583
Add separate KingfisherTransformCompressedMiddleware
yolile Dec 29, 2020
bb03bf6
KingfisherTransformMiddleware refactor
yolile Dec 29, 2020
fe7ca72
Remove spider.format_file and use compressed_file_format instead
yolile Dec 29, 2020
55b52b7
Update KingfisherTransform*Middleware tests
yolile Dec 29, 2020
2b90f2d
Merge branch '329-data-types' of github.com:open-contracting/kingfish…
yolile Dec 29, 2020
62eed8d
Merge branch 'master' of github.com:open-contracting/kingfisher-colle…
yolile Dec 29, 2020
ad5c538
isort
jpmckinney Jan 4, 2021
61b28a9
Separate KingfisherTransform Middleware in one per step
yolile Jan 7, 2021
9a8bd18
Merge branch '329-data-types' of github.com:open-contracting/kingfish…
yolile Jan 7, 2021
2819b5a
Merge branch 'master' of github.com:open-contracting/kingfisher-colle…
yolile Jan 7, 2021
d960f69
isort
yolile Jan 7, 2021
1e4483f
Eliminate post_to_api, since we no longer write files without posting…
jpmckinney Jan 30, 2021
4bb3ca4
base_spider: Use clearer variable names than data and data_to_ret
jpmckinney Jan 30, 2021
4949af7
Merge master into 329-data-types
jpmckinney Jan 30, 2021
f5c6241
settings: Document spider middleware priority order
jpmckinney Jan 30, 2021
2e04bc4
Add UnknownArchiveFormat exception. Get tests passing after merge.
jpmckinney Jan 30, 2021
b0922f3
Rename UnknownArchiveFormat to UnknownArchiveFormatError
jpmckinney Jan 30, 2021
95b7e15
tests: Add assertions to test_compressed_file_spider.py
jpmckinney Jan 30, 2021
269a09e
Fix typo in b0922f35b6557d972c4eb6bc4db2659a651a7180
jpmckinney Jan 31, 2021
e8b5a7d
tests: Parametrize a test
jpmckinney Jan 31, 2021
efca5aa
Change `compressed_file_format = 'json_lines'` to `line_delimited = T…
jpmckinney Jan 31, 2021
d0dd96d
base_spider: Replace data_pointer with root_path, closes #573
jpmckinney Jan 31, 2021
7be3e8b
flake8, isort
jpmckinney Jan 31, 2021
a24ef5a
tests: Rename test method since original method was renamed
jpmckinney Jan 31, 2021
b2beb4d
Fix KingfisherTransformRootPathMiddleware to not yield duplicate items
jpmckinney Jan 31, 2021
b05d0e8
georgia_opendata: Remove compressed_file_format, since the ZIP contai…
jpmckinney Jan 31, 2021
d659d62
tests: Rename ExpectedError to TestError to avoid pytest warning
jpmckinney Jan 31, 2021
a3692d0
middlewares: Abbreviate class names
jpmckinney Jan 31, 2021
4d8daeb
base_spider: Rename compressed_file_format string to resize_package b…
jpmckinney Jan 31, 2021
c35c105
tests: Split tests to be more atomic
jpmckinney Jan 31, 2021
21ba2aa
base_spider: Add a note that resize_package isn't compatible with lin…
jpmckinney Jan 31, 2021
61157e6
middlewares: Fix comment (ResizePackageMiddleware doesn't support rec…
jpmckinney Jan 31, 2021
2189b5e
base_spider: Merge PeriodicSpider.get_default_until_date into BaseSpi…
jpmckinney Jan 31, 2021
0e84813
spiders: Add comments so that it is easier to reconcile which class a…
jpmckinney Jan 31, 2021
9399603
honduras_portal_base: Remove unused next_pointer
jpmckinney Jan 31, 2021
d84df8f
spiders, docs: Document and apply order for class attributes
jpmckinney Jan 31, 2021
c00a78f
autopep8
jpmckinney Jan 31, 2021
e9e1a51
extensions: KingfisherFilesStore: Put the number after the file name
jpmckinney Jan 31, 2021
a263be1
crawlall: Remove exception for CompressedFileSpider, closes #471
jpmckinney Jan 31, 2021
79d18cf
middlewares: Fix variable name to align with comment
jpmckinney Feb 9, 2021
6f40453
tests: Removed unnecessary response_mock
jpmckinney Feb 9, 2021
66af007
flake8
jpmckinney Feb 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
73 changes: 15 additions & 58 deletions kingfisher_scrapy/base_spider.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@
from math import ceil
from zipfile import ZipFile

import ijson
import scrapy
from jsonpointer import resolve_pointer
from rarfile import RarFile
Expand Down Expand Up @@ -34,13 +33,15 @@ class BaseSpider(scrapy.Spider):
``from_date`` defaults to the ``default_from_date`` class attribute, and ``until_date`` defaults to the
``get_default_until_date()`` return value (which is the current time, by default).
"""
MAX_RELEASES_PER_PACKAGE = 100
VALID_DATE_FORMATS = {'date': '%Y-%m-%d', 'datetime': '%Y-%m-%dT%H:%M:%S'}

ocds_version = '1.1'
date_format = 'date'
date_required = False
unflatten = False
root_path = None
# override this if the file is in json_line format
file_format = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both file_format and compressed_file_format or can we collapse them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that just for digiwhist_base that currently doesn't extend from CompressedFileSpider

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we can maybe look at combining them in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up issue is here: #574

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmckinney I ended up removing it now


def __init__(self, sample=None, note=None, from_date=None, until_date=None, crawl_time=None,
keep_collection_open=None, package_pointer=None, release_pointer=None, truncate=None, *args,
Expand Down Expand Up @@ -242,46 +243,6 @@ def build_file_error_from_response(self, response, **kwargs):
item.update(kwargs)
return item

def _get_package_metadata(self, f, skip_key):
"""
Returns the package metadata from a file object.

:param f: a file object
:param str skip_key: the key to skip
:returns: the package metadata
:rtype: dict
"""
package = {}
for item in util.items(ijson.parse(f), '', skip_key=skip_key):
package.update(item)
return package

def parse_json_lines(self, f, *, file_name='data.json', url=None, data_type=None, encoding='utf-8'):
for number, line in enumerate(f, 1):
if self.sample and number > self.sample:
break
if isinstance(line, bytes):
line = line.decode(encoding=encoding)
yield self.build_file_item(number=number, file_name=file_name, url=url, data=line, data_type=data_type,
encoding=encoding)

def parse_json_array(self, f_package, f_list, *, file_name='data.json', url=None, data_type=None, encoding='utf-8',
array_field_name='releases'):
if self.sample:
size = self.sample
else:
size = self.MAX_RELEASES_PER_PACKAGE

package = self._get_package_metadata(f_package, array_field_name)

for number, items in enumerate(util.grouper(ijson.items(f_list, f'{array_field_name}.item'), size), 1):
package[array_field_name] = filter(None, items)
data = json.dumps(package, default=util.default)
yield self.build_file_item(number=number, file_name=file_name, url=url, data=data, data_type=data_type,
encoding=encoding)
if self.sample:
break

@classmethod
def get_default_until_date(cls, spider):
"""
Expand Down Expand Up @@ -338,11 +299,11 @@ class CompressedFileSpider(BaseSpider):

``json_lines``
Yields each line of each compressed file.
The archive file is saved to disk. The compressed files are *not* saved to disk.
Each compressed file is saved to disk. The archive file is *not* saved to disk.
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved
``release_package``
Re-packages the releases in the compressed files in groups of
:const:`~kingfisher_scrapy.base_spider.BaseSpider.MAX_RELEASES_PER_PACKAGE`, and yields the packages.
The archive file is saved to disk. The compressed files are *not* saved to disk.
Each compressed file is saved to disk. The archive file is *not* saved to disk.
``None``
Yields each compressed file.
Each compressed file is saved to disk. The archive file is *not* saved to disk.
Expand All @@ -363,15 +324,12 @@ def start_requests(self):
"""

encoding = 'utf-8'
skip_pluck = 'Archive files are not supported'
compressed_file_format = None
archive_format = 'zip'
file_name_must_contain = ''

@handle_http_error
def parse(self, response):
if self.compressed_file_format:
yield self.build_file_from_response(response, data_type=self.archive_format, post_to_api=False)
if self.archive_format == 'zip':
cls = ZipFile
else:
Expand All @@ -390,20 +348,19 @@ def parse(self, response):
basename += '.json'

data = archive_file.open(filename)
package = None
# for compressed_file_format == 'release_package' we need to read the file twice: once to extract the
# package metadata and then to extract the releases themselves
if self.compressed_file_format == 'release_package':
package = archive_file.open(filename)
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved

kwargs = {
yield File({
'file_name': basename,
'url': response.request.url,
'data': {'data': data, 'package': package},
'data_type': self.data_type,
'encoding': self.encoding,
}
if self.compressed_file_format == 'json_lines':
yield from self.parse_json_lines(data, **kwargs)
elif self.compressed_file_format == 'release_package':
package = archive_file.open(filename)
yield from self.parse_json_array(package, data, **kwargs)
else:
yield self.build_file(data=data.read(), **kwargs)
'url': response.request.url,
'encoding': self.encoding
})


class LinksSpider(SimpleSpider):
Expand Down
7 changes: 5 additions & 2 deletions kingfisher_scrapy/extensions.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,14 +69,17 @@ def item_scraped(self, item, spider):

Returns a dict with the metadata.
"""
if not isinstance(item, File):
if not isinstance(item, (File, FileItem)):
return

# The crawl's relative directory, in the format `<spider_name>[_sample]/<YYMMDD_HHMMSS>`.
name = spider.name
if spider.sample:
name += '_sample'
path = os.path.join(name, spider.get_start_time('%Y%m%d_%H%M%S'), item['file_name'])
file_name = item['file_name']
if isinstance(item, FileItem):
file_name = f"{item['number']}-{item['file_name']}"
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved
path = os.path.join(name, spider.get_start_time('%Y%m%d_%H%M%S'), file_name)

self._write_file(path, item['data'])

Expand Down
6 changes: 0 additions & 6 deletions kingfisher_scrapy/item_schema/item.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,14 +32,8 @@
"enum": [
"record",
"release",
"release_list",
"record_package",
"release_package",
"record_package_list",
"release_package_list",
"record_package_list_in_results",
"release_package_list_in_results",
"release_in_Release",
"zip",
"rar",
"tar.gz"
Expand Down
5 changes: 1 addition & 4 deletions kingfisher_scrapy/items.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,8 @@ class File(KingfisherItem):
files_store = scrapy.Field()


class FileItem(KingfisherItem):
class FileItem(File):
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved
number = scrapy.Field()
data = scrapy.Field()
data_type = scrapy.Field()
encoding = scrapy.Field()


class FileError(KingfisherItem):
Expand Down
86 changes: 85 additions & 1 deletion kingfisher_scrapy/middlewares.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

import json
from datetime import datetime

import ijson
import scrapy

from kingfisher_scrapy import util
from kingfisher_scrapy.base_spider import CompressedFileSpider
from kingfisher_scrapy.items import File


class ParaguayAuthMiddleware:
"""
Expand Down Expand Up @@ -74,3 +79,82 @@ def process_request(request, spider):
if 'token_request' in request.meta and request.meta['token_request']:
return
request.headers['Authorization'] = spider.access_token


class KingfisherTransformMiddleware:
"""
Middleware that checks for File items that comes from CompressedFileSpider or returns non-well-packaged OCDS
File item and transform them into a release or record package
yolile marked this conversation as resolved.
Show resolved Hide resolved
"""
MAX_RELEASES_PER_PACKAGE = 100

def process_spider_output(self, response, result, spider):
for item in result:

if isinstance(item, File) and (spider.root_path is not None or isinstance(spider, CompressedFileSpider)):
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved
kwargs = {
'file_name': item['file_name'],
'url': response.request.url,
'data_type': spider.data_type,
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved
'encoding': item['encoding'],
}
data = item['data']
package = item['data']
compressed_file = False
if isinstance(spider, CompressedFileSpider):
data = item['data']['data']
package = item['data']['package']
compressed_file = True
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved
# if it is a compressed file and the file dont need any transformations
if compressed_file and spider.compressed_file_format is None:
yield spider.build_file(data=data.read(), **kwargs)
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved
# if it is a compressed file or regular file but as json_lines
elif spider.file_format or (compressed_file and spider.compressed_file_format == 'json_lines'):
yield from self._parse_json_lines(spider, data, **kwargs)
# otherwise is must be a release or record package or a list of them
else:
yield from self._parse_json_array(spider, package, data, **kwargs)
else:
yield item

def _parse_json_array(self, spider, package_data, list_data, *, file_name='data.json', url=None, data_type=None,
encoding='utf-8'):

list_type = 'releases'
if 'record' in data_type:
list_type = 'records'
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved

package = self._get_package_metadata(package_data, list_type, data_type, spider.root_path)
# we change the data_type into a valid one:release_package or record_package
data_type = data_type if 'package' in data_type else f'{data_type}_package'
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved

# we yield a release o record package with a maximum of self.MAX_RELEASES_PER_PACKAGE releases or records
for number, items in enumerate(util.grouper(ijson.items(list_data, spider.root_path),
self.MAX_RELEASES_PER_PACKAGE), 1):
package[list_type] = filter(None, items)
data = json.dumps(package, default=util.default)
yield spider.build_file_item(number=number, file_name=file_name, url=url, data=data,
data_type=data_type, encoding=encoding)

def _parse_json_lines(self, spider, data, *, file_name='data.json', url=None, data_type=None, encoding='utf-8'):
for number, line in enumerate(data, 1):
if isinstance(line, bytes):
line = line.decode(encoding=encoding)
yield from self._parse_json_array(spider, line, line, file_name=file_name, url=url, data_type=data_type,
encoding=encoding)
return
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved

def _get_package_metadata(self, data, skip_key, data_type, root_path):
"""
Returns the package metadata from a file object.

:param data: a data object
:param str skip_key: the key to skip
:returns: the package metadata
:rtype: dict
"""
package = {}
if 'package' in data_type:
for item in util.items(ijson.parse(data), root_path, skip_key=skip_key):
package.update(item)
return package
3 changes: 3 additions & 0 deletions kingfisher_scrapy/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,9 @@
#SPIDER_MIDDLEWARES = {
# 'kingfisher_scrapy.middlewares.MyCustomSpiderMiddleware': 543,
#}
SPIDER_MIDDLEWARES = {
'kingfisher_scrapy.middlewares.KingfisherTransformMiddleware': 543
}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
Expand Down
1 change: 1 addition & 0 deletions kingfisher_scrapy/spiders/afghanistan_records.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ class AfghanistanRecords(SimpleSpider):
name = 'afghanistan_records'
data_type = 'record'
skip_pluck = 'Already covered (see code for details)' # afghanistan_releases
root_path = ''
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved

download_delay = 1

Expand Down
1 change: 1 addition & 0 deletions kingfisher_scrapy/spiders/afghanistan_releases.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ class AfghanistanReleases(SimpleSpider):
"""
name = 'afghanistan_releases'
data_type = 'release'
root_path = ''

download_delay = 1.5

Expand Down
1 change: 1 addition & 0 deletions kingfisher_scrapy/spiders/argentina_buenos_aires.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ class ArgentinaBuenosAires(CompressedFileSpider):
name = 'argentina_buenos_aires'
data_type = 'release_package'
compressed_file_format = 'release_package'
root_path = ''

# the data list service takes too long to be downloaded, so we increase the download timeout
download_timeout = 1000
Expand Down
3 changes: 2 additions & 1 deletion kingfisher_scrapy/spiders/argentina_vialidad.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@ class ArgentinaVialidad(SimpleSpider):
https://datosabiertos.vialidad.gob.ar/ui/index.html#!/datos_abiertos
"""
name = 'argentina_vialidad'
data_type = 'release_package_list'
data_type = 'release_package'
root_path = 'item'

def start_requests(self):
url = 'https://datosabiertos.vialidad.gob.ar/api/ocds/package/all'
Expand Down
3 changes: 2 additions & 1 deletion kingfisher_scrapy/spiders/colombia_bulk.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,10 @@ class ColombiaBulk(CompressedFileSpider):
"""

name = 'colombia_bulk'
data_type = 'release_in_Release'
data_type = 'release'
encoding = 'iso-8859-1'
compressed_file_format = 'json_lines'
root_path = 'Release'

download_timeout = 99999
custom_settings = {
Expand Down
6 changes: 3 additions & 3 deletions kingfisher_scrapy/spiders/digiwhist_base.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,9 @@ class DigiwhistBase(BaseSpider):
Bulk download documentation
https://opentender.eu/download
"""
skip_pluck = 'JSON Lines is not supported'
data_type = 'release_package'
root_path = ''
file_format = 'json_lines'

def start_requests(self):
# See scrapy.spiders.Spider.start_requests
Expand All @@ -24,9 +25,8 @@ def start_requests(self):

@handle_http_error
def parse(self, response):
yield self.build_file_from_response(response, data_type='tar.gz', post_to_api=False)

# Load a line at the time, pass it to API
with tarfile.open(fileobj=BytesIO(response.body), mode="r:gz") as tar:
with tar.extractfile(tar.getnames()[0]) as readfp:
yield from self.parse_json_lines(readfp, url=self.start_urls[0], data_type=self.data_type)
yield self.build_file_from_response(data=readfp, response=response, file_name='data.json')
1 change: 1 addition & 0 deletions kingfisher_scrapy/spiders/dominican_republic.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ class DominicanRepublic(CompressedFileSpider):
data_type = 'release_package'
compressed_file_format = 'release_package'
archive_format = 'rar'
root_path = ''

def start_requests(self):
yield scrapy.Request(
Expand Down
1 change: 1 addition & 0 deletions kingfisher_scrapy/spiders/indonesia_bandung.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ class IndonesiaBandung(BaseSpider):
"""
name = 'indonesia_bandung'
data_type = 'release'
root_path = ''

def start_requests(self):
pattern = 'https://birms.bandung.go.id/api/packages/year/{}'
Expand Down
3 changes: 2 additions & 1 deletion kingfisher_scrapy/spiders/kenya_makueni.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,8 @@ class KenyaMakueni(IndexSpider):
https://opencontracting.makueni.go.ke/swagger-ui.html#/ocds-controller
"""
name = 'kenya_makueni'
data_type = 'release_package_list'
data_type = 'release_package'
root_path = 'item'
limit = 10
additional_params = {'pageSize': limit}
yield_list_results = False
Expand Down
3 changes: 2 additions & 1 deletion kingfisher_scrapy/spiders/kosovo.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,10 @@ class Kosovo(SimpleSpider):
https://ocdskrpp-test.rks-gov.net/Help
"""
name = 'kosovo'
data_type = 'release_list'
data_type = 'release'
date_format = 'datetime'
default_from_date = '2000-01-01T00:00:00'
root_path = 'item'

def start_requests(self):
stages = ['Award', 'Tender', 'Bid']
Expand Down
Loading