Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KingfisherTransformMiddleware and update affected spiders #572

Merged
merged 63 commits into from
Feb 9, 2021

Conversation

yolile
Copy link
Member

@yolile yolile commented Dec 2, 2020

closes #329, closes #449, closes #450, closes #471, closes #573, closes #574

(this will also solve getting only samples for compressed files and also being able to pluck them)

@jpmckinney this is a draft, and before updating the documentation and tests I would like to read your opinion.

Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
@yolile yolile marked this pull request as draft December 2, 2020 17:53
…ct into 329-data-types

Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Copy link
Member

@jpmckinney jpmckinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some comments.

The data_pointer logic is doing something similar to root_path. Should we switch it to root_path?

kingfisher_scrapy/base_spider.py Outdated Show resolved Hide resolved
kingfisher_scrapy/base_spider.py Outdated Show resolved Hide resolved
kingfisher_scrapy/extensions.py Outdated Show resolved Hide resolved
kingfisher_scrapy/items.py Outdated Show resolved Hide resolved
kingfisher_scrapy/middlewares.py Outdated Show resolved Hide resolved
kingfisher_scrapy/middlewares.py Outdated Show resolved Hide resolved
kingfisher_scrapy/middlewares.py Outdated Show resolved Hide resolved
kingfisher_scrapy/middlewares.py Outdated Show resolved Hide resolved
kingfisher_scrapy/spiders/afghanistan_records.py Outdated Show resolved Hide resolved
VALID_DATE_FORMATS = {'date': '%Y-%m-%d', 'datetime': '%Y-%m-%dT%H:%M:%S'}

ocds_version = '1.1'
date_format = 'date'
date_required = False
unflatten = False
root_path = None
# override this if the file is in json_line format
file_format = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need both file_format and compressed_file_format or can we collapse them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added that just for digiwhist_base that currently doesn't extend from CompressedFileSpider

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, we can maybe look at combining them in a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up issue is here: #574

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jpmckinney I ended up removing it now

yolile and others added 10 commits December 3, 2020 10:53
Co-authored-by: James McKinney <26463+jpmckinney@users.noreply.github.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
…ondition

Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
@yolile yolile marked this pull request as ready for review December 3, 2020 21:00
Copy link
Member

@jpmckinney jpmckinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed my earlier comments. I'll read the diff in full later.

kingfisher_scrapy/middlewares.py Outdated Show resolved Hide resolved
Comment on lines 155 to 156
if spider.sample and number > spider.sample:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are many items in result in process_spider_output, then I guess we'll still do extra processing, but it will at least be limited.

kingfisher_scrapy/middlewares.py Outdated Show resolved Hide resolved
@jpmckinney
Copy link
Member

I've read the diff now. Looks good! Just those few comments left.

Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Copy link
Member

@jpmckinney jpmckinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment to refactor to make code easier to analyze.

Comment on lines 120 to 122
# otherwise is must be a release or record package or a list of them
else:
yield from self._parse_json_array(spider, package, data, **kwargs)
Copy link
Member

@jpmckinney jpmckinney Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment isn't quite right. One or more of the following is true: data_type is release or record, or compressed_file_format is release_package, or root_path is non-empty.

As such, I think the method name should also be changed (or maybe we should have different methods).

For example, is afghanistan_records/releases processed correctly? https://ocds.ageops.net/api/record/5ed2a62c4192f32c8c74a4e5 returns a single record. We can wrap it in a simple package, and then return it as a file, instead of as a file item. This can be done in a separate, simpler method.

The _parse_json_array method is very hard to analyze, because it also gets called from _parse_json_lines. I think we should have multiple methods for specific stages. For example, maybe:

  • items = _split_data(...): if it's json lines, yields each line, otherwise yields data back
  • items = _apply_root_path(items, ...): if root_path is set, calls ijson.items and yields each value, otherwise yields each item back
  • items = _add_package(items, ...): if data_type is release or record, adds a simple package to each item, otherwise yields each item back
  • _resize_packages: does the self.MAX_RELEASES_PER_PACKAGE logic if needed (I think only absolutely necessary when compressed_file_format = 'release_package')
  • finally, yields each item as a File (if the items generator contains one item) or a FileItem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, maybe we should have separate, simpler middlewares (similar idea to the above)?

@jpmckinney jpmckinney mentioned this pull request Jan 31, 2021
10 tasks
jpmckinney
jpmckinney previously approved these changes Jan 31, 2021
jpmckinney
jpmckinney previously approved these changes Jan 31, 2021
@yolile
Copy link
Member Author

yolile commented Jan 31, 2021

@jpmckinney thank you for your changes and improvements!

@jpmckinney
Copy link
Member

jpmckinney commented Feb 9, 2021

Note that I had to change test_data_types - please review

@yolile Can you check the logic in test_data_types that was changed in this commit? b2beb4d

@jpmckinney
Copy link
Member

@yolile More specifically, I was wondering about: b2beb4d#diff-1071f4e84a30adb8a2b145f878256523e64f9aa931b5a41dceec4320f9d3420fL75-R78

Originally, the response_mock's data key was changed, and then submitted along with the original item to add_package_middleware. In the new code, the response_mock is not changed, and the item returned by the root_path_middleware middleware is submitted to add_package_middleware. I just wanted to check that this was correct. I wasn't sure why the response_mock was changed in the old code.

@yolile
Copy link
Member Author

yolile commented Feb 9, 2021

Originally, the response_mock's data key was changed, and then submitted along with the original item to add_package_middleware. In the new code, the response_mock is not changed, and the item returned by the root_path_middleware middleware is submitted to add_package_middleware. I just wanted to check that this was correct. I wasn't sure why the response_mock was changed in the old code.

@jpmckinney I think that was redundant and it is ok to remove it, as actually we don't use the response in the middleware, just result

@jpmckinney
Copy link
Member

@yolile Thanks! I've removed it, so I'll merge this once CI is done. 🎉

@jpmckinney jpmckinney merged commit 20674ca into master Feb 9, 2021
@jpmckinney jpmckinney deleted the 329-data-types branch February 9, 2021 17:18
@yolile
Copy link
Member Author

yolile commented Feb 9, 2021

@jpmckinney awesome, I'm so happy right now, look at all these closed issues, it's beautiful!

@jpmckinney
Copy link
Member

Same! ☺️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment