New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add KingfisherStoreFiles extension to avoid storing files in spiders #381
Conversation
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also move get_local_file_path_including_filestore
, get_local_file_path_excluding_filestore
and _get_crawl_path
into the extension (and make the necessary changes elsewhere). Basically, no spider should use the os
module or interact with the filesystem.
Right now, the new extension sets a directory attribute, but it doesn't use it. When we move these methods, we can replace self.crawler.settings['FILES_STORE']
with that attribute.
The new extension might need to add the path to the item, so that the KingfisherAPI extension doesn't need to re-calculate the path using these methods.
kingfisher_scrapy/base_spider.py
Outdated
@@ -100,63 +100,40 @@ def get_local_file_path_excluding_filestore(self, filename): | |||
|
|||
def save_response_to_disk(self, response, filename, data_type=None, encoding='utf-8'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's have a follow-up PR (to keep this PR short) to rename save_response_to_disk
to build_item_from_response
, and save_data_to_disk
to build_item
.
@yolile To address the issue you identified, can you add to LOG_LEVEL = os.getenv('SCRAPY_LOG_LEVEL', 'INFO') with a comment that the items' OCDS data is logged at the DEBUG level, which we don't want to see in the log file. https://docs.scrapy.org/en/latest/topics/settings.html#log-level |
Actually, other DEBUG messages are useful (seeing which requests were retried, etc.). Let's instead add a custom log formatter, so that it excludes the OCDS data: https://docs.scrapy.org/en/latest/topics/logging.html#custom-log-formats https://docs.scrapy.org/en/latest/topics/settings.html#std-setting-LOG_FORMATTER |
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
Signed-off-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
@jpmckinney I added all your suggestions (changing georgia and digiwhist_base spiders to be able to move |
dee8180
to
6f1aceb
Compare
…xcluding_filestore, _get_crawl_path. Tests: Test changes to item by store extension. Rename methods and variables for clarity. Improve Windows compatibility.
I did some refactoring and other cleanup. Looks good! |
closes #277