-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PeriodicalSpider #467
Add PeriodicalSpider #467
Conversation
|
||
When the ``sample`` option is used, the latest year or month of data is retrieved. | ||
""" | ||
VALID_DATE_FORMATS = {'year': '%Y', 'year-month': '%Y-%m'} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove 'year-month' from base_spider
VALID_DATE_FORMATS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure. I think it should be possible to use the same format for filtering in other spiders, although I think we don't have a case, don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think that all of the spiders which uses 'year-month' extends from PeriodicalSpider
now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, no problem, I'll remove the option.
Add Yohanna's suggestions Co-authored-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
VALID_DATE_FORMATS = {'year': '%Y', 'year-month': '%Y-%m'} | ||
|
||
def __init__(self, *args, **kwargs): | ||
self.date_format_key = self.date_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romifz is this needed? cant we use self.date_format
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case yes, because internally self.date_format
is replaced by the date format it represents in VALID_DATE_FORMATS
e.g. "year" gets replaced by "%Y". I prefer to have the original name in a separate variable.
kingfisher_scrapy/base_spider.py
Outdated
|
||
def start_requests(self): | ||
|
||
start = self.start if not self.exists('from_date') else self.from_date |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romifz maybe we can use default_from_date
and default_until_date
instead of creating start
and end
and dont reimplement the logic here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can
kingfisher_scrapy/base_spider.py
Outdated
self.start_requests_callback = self.parse | ||
|
||
if not isinstance(self.start, DateClass): | ||
self.start = datetime.strptime(str(self.start), self.date_format) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be always a str instead of a integer, and use from_date
and until_date
directly instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
kingfisher_scrapy/base_spider.py
Outdated
if isinstance(spider.from_date, str): | ||
# convert to date format, if needed | ||
spider.from_date = datetime.strptime(spider.from_date, spider.date_format) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove the else
statement and always convert the date to datetime? to avoid duplicating spider.from_date = datetime.strptime(spider.from_date, spider.date_format)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if-else block is used to know whether from_date
was passed as a parameter or not. If we eliminate the else
a bad configuration in default_from_date
by a subclass may raise a SpiderArgumentError
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we remove the else then? (following the same logic that there are not going to be bad configurations in hardcoded default dates)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romifz not sure if you have seen this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'll check if we can remove the if-else block.
kingfisher_scrapy/base_spider.py
Outdated
|
||
if not spider.from_date: | ||
spider.from_date = spider.default_from_date | ||
if isinstance(spider.from_date, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are already doing this in the base class, arent we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the conversion in the base class happens only if the user specifies from_date
or until_date
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can do something like
kingfisher-collect/kingfisher_scrapy/spiders/paraguay_dncp_base.py
Lines 39 to 43 in 726b8d0
def from_crawler(cls, crawler, from_date=None, *args, **kwargs): | |
if not from_date: | |
from_date = cls.default_from_date | |
spider = super().from_crawler(crawler, from_date=from_date, *args, **kwargs) |
to avoid the validation/conversion duplication
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would have the same issue as the previous comment, a bad configuration in the spider subclass may rise a SpiderArgumentError
. Since Scrapy does not provide a way to know if a property is a parameter set by the user (at least I couldn't find a way), I think it is best to never set from_date
and until_date
in a superclass like PeriodicalSpider
.
I'm thinking that the best solution would be to have a class property to indicate the base class that we need a time interval to be set, like require_time_period
or something similar. Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that as default_*_date
variables are hardcoded variables we shouldn't have bad configurations here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair.
default_from_date = '2019-01' | ||
date_format = 'year-month' | ||
default_until_date = date.today() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the base class already does this, doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can remove this.
closes #407 |
Ref #407