Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add KingfisherStoreFiles extension to avoid storing files in spiders #381

Merged
merged 18 commits into from May 26, 2020

Conversation

yolile
Copy link
Member

@yolile yolile commented May 19, 2020

closes #277

Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
@yolile yolile requested a review from jpmckinney May 19, 2020 14:48
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
@jpmckinney jpmckinney added this to In progress [6 max] in CDS 2020-05/2021-02 May 19, 2020
@jpmckinney jpmckinney removed this from In progress [6 max] in CDS 2020-05/2021-02 May 19, 2020
Copy link
Member

@jpmckinney jpmckinney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also move get_local_file_path_including_filestore, get_local_file_path_excluding_filestore and _get_crawl_path into the extension (and make the necessary changes elsewhere). Basically, no spider should use the os module or interact with the filesystem.

Right now, the new extension sets a directory attribute, but it doesn't use it. When we move these methods, we can replace self.crawler.settings['FILES_STORE'] with that attribute.

The new extension might need to add the path to the item, so that the KingfisherAPI extension doesn't need to re-calculate the path using these methods.

kingfisher_scrapy/base_spider.py Outdated Show resolved Hide resolved
@@ -100,63 +100,40 @@ def get_local_file_path_excluding_filestore(self, filename):

def save_response_to_disk(self, response, filename, data_type=None, encoding='utf-8'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a follow-up PR (to keep this PR short) to rename save_response_to_disk to build_item_from_response, and save_data_to_disk to build_item.

kingfisher_scrapy/base_spider.py Outdated Show resolved Hide resolved
kingfisher_scrapy/base_spider.py Outdated Show resolved Hide resolved
kingfisher_scrapy/extensions.py Outdated Show resolved Hide resolved
kingfisher_scrapy/extensions.py Outdated Show resolved Hide resolved
kingfisher_scrapy/base_spider.py Outdated Show resolved Hide resolved
kingfisher_scrapy/base_spider.py Outdated Show resolved Hide resolved
kingfisher_scrapy/base_spider.py Show resolved Hide resolved
kingfisher_scrapy/extensions.py Outdated Show resolved Hide resolved
@jpmckinney
Copy link
Member

jpmckinney commented May 19, 2020

@yolile To address the issue you identified, can you add to settings.py:

LOG_LEVEL = os.getenv('SCRAPY_LOG_LEVEL', 'INFO')

with a comment that the items' OCDS data is logged at the DEBUG level, which we don't want to see in the log file.

https://docs.scrapy.org/en/latest/topics/settings.html#log-level

@jpmckinney
Copy link
Member

jpmckinney commented May 19, 2020

Actually, other DEBUG messages are useful (seeing which requests were retried, etc.). Let's instead add a custom log formatter, so that it excludes the OCDS data: https://docs.scrapy.org/en/latest/topics/logging.html#custom-log-formats

https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-LOG_FORMATTER

Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
@yolile
Copy link
Member Author

yolile commented May 21, 2020

@jpmckinney I added all your suggestions (changing georgia and digiwhist_base spiders to be able to move get_local_file_path_including_filestore etc methods inside the store extension, FYI the georgia one wasn't working now anyway)
And also added a post_to_api field to the item to no post to kingfisher-process zip and tar files.
Also, I don't store the item if the item has the number attribute as this attribute is used when a big JSON file is sent to kingfisher-process as the behavior was to store only the original data.

@yolile yolile requested a review from jpmckinney May 21, 2020 15:32
@jpmckinney
Copy link
Member

I did some refactoring and other cleanup. Looks good!

@jpmckinney jpmckinney merged commit c7a195f into master May 26, 2020
@jpmckinney jpmckinney deleted the 277-store-files-extension branch May 26, 2020 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use item_scraped signal in extension, instead of writing files in spider
2 participants