New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spider for Maringa/Pr #83
Conversation
) | ||
|
||
def parse_year(self, response): | ||
# print(response.body) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove debug code.
TERRITORY_ID = '4115200' | ||
name = 'pr_maringa' | ||
allowed_domains = ['maringa.pr.gov.br'] | ||
starting_year = 2015 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used variable. It can be removed.
a20fadd
to
9b64462
Compare
Done. |
Havia acabado de começar a estudar. Parabéns pela iniciativa. Avante Maringá. |
gazette_id = row.css('td:nth-child(1) a::attr(href)').re_first('.*/[oO]{2}[mM] (.*)') | ||
gazette_date = row.css('td:nth-child(2) font > font::text').extract_first() | ||
yield Gazette( | ||
date=parse(f'{gazette_date}', languages=['pt']).date(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to use f-string here, gazette_date is adequate, or need?
file_urls=[f'http://venus.maringa.pr.gov.br/arquivos/orgao_oficial/arquivos/oom%20{gazette_id}'], | ||
is_extra_edition=any(extra_char in gazette_id for extra_char in ['A', 'B', 'C', 'D']), | ||
territory_id=self.TERRITORY_ID, | ||
power='executive_legislature', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yield Gazette( | ||
date=parse(f'{gazette_date}', languages=['pt']).date(), | ||
file_urls=[f'http://venus.maringa.pr.gov.br/arquivos/orgao_oficial/arquivos/oom%20{gazette_id}'], | ||
is_extra_edition=any(extra_char in gazette_id for extra_char in ['A', 'B', 'C', 'D']), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra edition identification correct based on 2009 gazettes list. The way that it is works, but I do prefer to identify just a letter, not just this 4, just in case of more than 4 extras in a day. What do you think?
Something like: is_extra_edition=any(caracter.isalpha() for caracter in gazette_id),
gazette_id = row.css('td:nth-child(1) a::attr(href)').re_first('.*/[oO]{2}[mM] (.*)\.pdf') | ||
gazette_date = row.css('td:nth-child(2) font > font::text').extract_first() | ||
yield Gazette( | ||
date=parse(gazette_date).date(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I've expressed myself badly here or , languages=['pt']
was removed by accident.
My review was about the f-string, the , languages=['pt']
should be maintained. Sorry for the trouble
Need more modifications? I can help! |
No description provided.