Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add PeriodicalSpider #467

Merged
merged 14 commits into from
Sep 21, 2020
Merged

Add PeriodicalSpider #467

merged 14 commits into from
Sep 21, 2020

Conversation

romifz
Copy link
Contributor

@romifz romifz commented Jul 27, 2020

Ref #407

@romifz romifz marked this pull request as ready for review July 27, 2020 04:30
@romifz romifz marked this pull request as draft July 27, 2020 04:30
@romifz romifz marked this pull request as ready for review July 27, 2020 19:33
@romifz romifz requested a review from yolile July 27, 2020 19:33

When the ``sample`` option is used, the latest year or month of data is retrieved.
"""
VALID_DATE_FORMATS = {'year': '%Y', 'year-month': '%Y-%m'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove 'year-month' from base_spider VALID_DATE_FORMATS ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I think it should be possible to use the same format for filtering in other spiders, although I think we don't have a case, don't we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think that all of the spiders which uses 'year-month' extends from PeriodicalSpider now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no problem, I'll remove the option.

kingfisher_scrapy/base_spider.py Outdated Show resolved Hide resolved
kingfisher_scrapy/base_spider.py Outdated Show resolved Hide resolved
Add Yohanna's suggestions

Co-authored-by: Yohanna Lisnichuk <yohanitalisnichuk@gmail.com>
VALID_DATE_FORMATS = {'year': '%Y', 'year-month': '%Y-%m'}

def __init__(self, *args, **kwargs):
self.date_format_key = self.date_format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romifz is this needed? cant we use self.date_format directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case yes, because internally self.date_format is replaced by the date format it represents in VALID_DATE_FORMATS e.g. "year" gets replaced by "%Y". I prefer to have the original name in a separate variable.


def start_requests(self):

start = self.start if not self.exists('from_date') else self.from_date
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romifz maybe we can use default_from_date and default_until_date instead of creating start and end and dont reimplement the logic here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can

self.start_requests_callback = self.parse

if not isinstance(self.start, DateClass):
self.start = datetime.strptime(str(self.start), self.date_format)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be always a str instead of a integer, and use from_date and until_date directly instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@romifz romifz requested a review from yolile August 28, 2020 06:42
if isinstance(spider.from_date, str):
# convert to date format, if needed
spider.from_date = datetime.strptime(spider.from_date, spider.date_format)
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove the else statement and always convert the date to datetime? to avoid duplicating spider.from_date = datetime.strptime(spider.from_date, spider.date_format)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This if-else block is used to know whether from_date was passed as a parameter or not. If we eliminate the else a bad configuration in default_from_date by a subclass may raise a SpiderArgumentError.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove the else then? (following the same logic that there are not going to be bad configurations in hardcoded default dates)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romifz not sure if you have seen this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'll check if we can remove the if-else block.


if not spider.from_date:
spider.from_date = spider.default_from_date
if isinstance(spider.from_date, str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are already doing this in the base class, arent we?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the conversion in the base class happens only if the user specifies from_date or until_date.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do something like

def from_crawler(cls, crawler, from_date=None, *args, **kwargs):
if not from_date:
from_date = cls.default_from_date
spider = super().from_crawler(crawler, from_date=from_date, *args, **kwargs)

to avoid the validation/conversion duplication

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have the same issue as the previous comment, a bad configuration in the spider subclass may rise a SpiderArgumentError. Since Scrapy does not provide a way to know if a property is a parameter set by the user (at least I couldn't find a way), I think it is best to never set from_date and until_date in a superclass like PeriodicalSpider.

I'm thinking that the best solution would be to have a class property to indicate the base class that we need a time interval to be set, like require_time_period or something similar. Does this make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that as default_*_date variables are hardcoded variables we shouldn't have bad configurations here...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair.

default_from_date = '2019-01'
date_format = 'year-month'
default_until_date = date.today()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the base class already does this, doesn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can remove this.

@romifz romifz requested a review from yolile August 31, 2020 15:44
@romifz romifz requested review from yolile and removed request for yolile September 17, 2020 18:54
@yolile yolile merged commit a87a3a4 into master Sep 21, 2020
@yolile yolile deleted the 407-periodic-spider branch September 21, 2020 13:07
@yolile
Copy link
Member

yolile commented Sep 21, 2020

closes #407

@yolile yolile mentioned this pull request Sep 21, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants