-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KingfisherTransformMiddleware and update affected spiders #572
Changes from 16 commits
ab77692
c047506
36b1935
5972690
d210d45
437cf93
9a3d67a
3453b87
fedba70
3d056df
ca1191f
a0beebc
b5b67c2
999b6ff
6204f65
5b793fa
2d7679c
4e5fa27
d994500
a815b00
32341d9
fd62583
bb03bf6
fe7ca72
55b52b7
2b90f2d
62eed8d
ad5c538
61b28a9
9a8bd18
2819b5a
d960f69
1e4483f
4bb3ca4
4949af7
f5c6241
2e04bc4
b0922f3
95b7e15
269a09e
e8b5a7d
efca5aa
d0dd96d
7be3e8b
a24ef5a
b2beb4d
b05d0e8
d659d62
a3692d0
4d8daeb
c35c105
21ba2aa
61157e6
2189b5e
0e84813
9399603
d84df8f
c00a78f
e9e1a51
a263be1
79d18cf
6f40453
66af007
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,14 @@ | ||
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html | ||
|
||
import json | ||
from datetime import datetime | ||
|
||
import ijson | ||
import scrapy | ||
|
||
from kingfisher_scrapy import util | ||
from kingfisher_scrapy.base_spider import CompressedFileSpider | ||
from kingfisher_scrapy.items import File | ||
|
||
|
||
class ParaguayAuthMiddleware: | ||
""" | ||
|
@@ -74,3 +79,102 @@ def process_request(request, spider): | |
if 'token_request' in request.meta and request.meta['token_request']: | ||
return | ||
request.headers['Authorization'] = spider.access_token | ||
|
||
|
||
class KingfisherTransformMiddleware: | ||
""" | ||
Middleware that corrects the packaging of OCDS data (whether the OCDS data is embedded, line-delimited JSON, etc.). | ||
""" | ||
MAX_RELEASES_PER_PACKAGE = 100 | ||
|
||
def process_spider_output(self, response, result, spider): | ||
for item in result: | ||
|
||
if not(isinstance(item, File) and (item['data_type'] not in ('release_package', 'record_package') or | ||
isinstance(spider, CompressedFileSpider) or spider.file_format or | ||
spider.root_path)): | ||
yield item | ||
continue | ||
kwargs = { | ||
'file_name': item['file_name'], | ||
'url': item['url'], | ||
'data_type': item['data_type'], | ||
'encoding': item['encoding'], | ||
} | ||
|
||
if isinstance(spider, CompressedFileSpider): | ||
data = item['data']['data'] | ||
package = item['data']['package'] | ||
compressed_file = True | ||
else: | ||
data = item['data'] | ||
package = item['data'] | ||
compressed_file = False | ||
# if it is a compressed file and the file does'nt need any transformation | ||
if compressed_file and spider.compressed_file_format is None: | ||
item['data'] = data.read() | ||
yield item | ||
# if it is a compressed file or regular file but as json_lines | ||
elif spider.file_format or (compressed_file and spider.compressed_file_format == 'json_lines'): | ||
yield from self._parse_json_lines(spider, data, **kwargs) | ||
# otherwise is must be a release or record package or a list of them | ||
else: | ||
yield from self._parse_json_array(spider, package, data, **kwargs) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this comment isn't quite right. One or more of the following is true: As such, I think the method name should also be changed (or maybe we should have different methods). For example, is afghanistan_records/releases processed correctly? https://ocds.ageops.net/api/record/5ed2a62c4192f32c8c74a4e5 returns a single record. We can wrap it in a simple package, and then return it as a file, instead of as a file item. This can be done in a separate, simpler method. The
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, maybe we should have separate, simpler middlewares (similar idea to the above)? |
||
|
||
def _parse_json_array(self, spider, package_data, list_data, *, file_name='data.json', url=None, data_type=None, | ||
encoding='utf-8'): | ||
|
||
if 'record' in data_type: | ||
list_type = 'records' | ||
else: | ||
list_type = 'releases' | ||
|
||
package = self._get_package_metadata(package_data, list_type, data_type, spider.root_path) | ||
# we change the data_type into a valid one:release_package or record_package | ||
if data_type in ('release', 'record'): | ||
data_type = f'{data_type}_package' | ||
key = spider.root_path | ||
# if the array is a list of packages then we point to the releases or records items | ||
else: | ||
key = '.'.join(list(filter(None, [spider.root_path, list_type, 'item']))) | ||
|
||
if spider.sample: | ||
size = spider.sample | ||
else: | ||
size = self.MAX_RELEASES_PER_PACKAGE | ||
|
||
# we yield a release o record package with a maximum of self.MAX_RELEASES_PER_PACKAGE releases or records | ||
for number, items in enumerate(util.grouper(ijson.items(list_data, key), | ||
size), 1): | ||
# to avoid reading the rest of a large file, as the rest of the items will be dropped | ||
if spider.sample and number > spider.sample: | ||
return | ||
package[list_type] = filter(None, items) | ||
data = json.dumps(package, default=util.default) | ||
yield spider.build_file_item(number=number, file_name=file_name, url=url, data=data, | ||
data_type=data_type, encoding=encoding) | ||
|
||
def _parse_json_lines(self, spider, data, *, file_name='data.json', url=None, data_type=None, encoding='utf-8'): | ||
for number, line in enumerate(data, 1): | ||
# to avoid reading the rest of a large file, as the rest of the items will be dropped | ||
if spider.sample and number > spider.sample: | ||
return | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If there are many items in |
||
if isinstance(line, bytes): | ||
line = line.decode(encoding=encoding) | ||
yield from self._parse_json_array(spider, line, line, file_name=file_name, url=url, data_type=data_type, | ||
encoding=encoding) | ||
|
||
def _get_package_metadata(self, data, skip_key, data_type, root_path): | ||
""" | ||
Returns the package metadata from a file object. | ||
|
||
:param data: a data object | ||
:param str skip_key: the key to skip | ||
:returns: the package metadata | ||
:rtype: dict | ||
""" | ||
package = {} | ||
if 'package' in data_type: | ||
for item in util.items(ijson.parse(data), root_path, skip_key=skip_key): | ||
package.update(item) | ||
return package |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both
file_format
andcompressed_file_format
or can we collapse them?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that just for
digiwhist_base
that currently doesn't extend fromCompressedFileSpider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, we can maybe look at combining them in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up issue is here: #574
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpmckinney I ended up removing it now