-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KingfisherTransformMiddleware and update affected spiders #572
Conversation
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
…ct into 329-data-types Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Some comments.
The data_pointer
logic is doing something similar to root_path
. Should we switch it to root_path
?
kingfisher_scrapy/base_spider.py
Outdated
VALID_DATE_FORMATS = {'date': '%Y-%m-%d', 'datetime': '%Y-%m-%dT%H:%M:%S'} | ||
|
||
ocds_version = '1.1' | ||
date_format = 'date' | ||
date_required = False | ||
unflatten = False | ||
root_path = None | ||
# override this if the file is in json_line format | ||
file_format = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need both file_format
and compressed_file_format
or can we collapse them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added that just for digiwhist_base
that currently doesn't extend from CompressedFileSpider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, we can maybe look at combining them in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Follow-up issue is here: #574
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpmckinney I ended up removing it now
Co-authored-by: James McKinney <26463+jpmckinney@users.noreply.github.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
…er-collect into 329-data-types
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
…ondition Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed my earlier comments. I'll read the diff in full later.
kingfisher_scrapy/middlewares.py
Outdated
if spider.sample and number > spider.sample: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are many items in result
in process_spider_output
, then I guess we'll still do extra processing, but it will at least be limited.
I've read the diff now. Looks good! Just those few comments left. |
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One comment to refactor to make code easier to analyze.
kingfisher_scrapy/middlewares.py
Outdated
# otherwise is must be a release or record package or a list of them | ||
else: | ||
yield from self._parse_json_array(spider, package, data, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment isn't quite right. One or more of the following is true: data_type
is release or record, or compressed_file_format
is release_package, or root_path
is non-empty.
As such, I think the method name should also be changed (or maybe we should have different methods).
For example, is afghanistan_records/releases processed correctly? https://ocds.ageops.net/api/record/5ed2a62c4192f32c8c74a4e5 returns a single record. We can wrap it in a simple package, and then return it as a file, instead of as a file item. This can be done in a separate, simpler method.
The _parse_json_array
method is very hard to analyze, because it also gets called from _parse_json_lines. I think we should have multiple methods for specific stages. For example, maybe:
- items = _split_data(...): if it's json lines, yields each line, otherwise yields data back
- items = _apply_root_path(items, ...): if root_path is set, calls ijson.items and yields each value, otherwise yields each item back
- items = _add_package(items, ...): if data_type is release or record, adds a simple package to each item, otherwise yields each item back
- _resize_packages: does the self.MAX_RELEASES_PER_PACKAGE logic if needed (I think only absolutely necessary when compressed_file_format = 'release_package')
- finally, yields each item as a File (if the items generator contains one item) or a FileItem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, maybe we should have separate, simpler middlewares (similar idea to the above)?
…der.get_default_until_date
…ttributes are for which parent classes
@jpmckinney thank you for your changes and improvements! |
@yolile More specifically, I was wondering about: b2beb4d#diff-1071f4e84a30adb8a2b145f878256523e64f9aa931b5a41dceec4320f9d3420fL75-R78 Originally, the |
@jpmckinney I think that was redundant and it is ok to remove it, as actually we don't use the response in the middleware, just |
@yolile Thanks! I've removed it, so I'll merge this once CI is done. 🎉 |
@jpmckinney awesome, I'm so happy right now, look at all these closed issues, it's beautiful! |
Same! |
closes #329, closes #449, closes #450, closes #471, closes #573, closes #574
(this will also solve getting only samples for compressed files and also being able to pluck them)
@jpmckinney this is a draft, and before updating the documentation and tests I would like to read your opinion.