Skip to content

Araraquara-SP spider#603

Open
lcsvillela wants to merge 3 commits intookfn-brasil:mainfrom
lcsvillela:spider-sp-araraquara
Open

Araraquara-SP spider#603
lcsvillela wants to merge 3 commits intookfn-brasil:mainfrom
lcsvillela:spider-sp-araraquara

Conversation

@lcsvillela
Copy link
Copy Markdown

Creates the spider for Araraquara/SP municipality.

@lcsvillela lcsvillela changed the title Araraquara-SP Araraquara-SP spider Aug 16, 2022
Copy link
Copy Markdown
Collaborator

@rennerocha rennerocha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lcsvillela thanks for your PR. It is quite good. I added a few issues and suggestions in your spider code. Please let me know if you have any questions.

import datetime
from urllib.parse import urlencode

import dateparser
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue Imported but unused (F401). Remove this import.

name = "sp_araraquara"
allowed_domains = ["diariooficialcmararaquara.sp.gov.br"]
start_date = datetime.date(2021, 3, 4) # First gazette available
end_date = datetime.datetime.today()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion end_date value is defined as today in BaseGazetteSpider definition so you don't need to specify it in your spider.

date = datetime.datetime.strptime(date, "%d/%m/%Y").date()
url = card.css(".row ::attr(href)").get()
url = self.base_url + url
if card.css(".event-edicao p ::text").get() == "Edição Única":
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick This if statement could be replaced by
extra_edition = card.css(".event-edicao p ::text").get() == "Edição Extra"
This is just a personal preference anyway.

for gazette in gazettes:
card = gazette.css(".event-card")

edition_number = card.css(".event-data h4 ::text").re_first(r"[0-9]+")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise Good use of regexes.


def parse_gazette(self, response):

gazettes = response.css(".event-card.animated.flipInX")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion Everything that has class event-card is a gazette, so you can replace it as gazettes = response.css(".event-card")

gazettes = response.css(".event-card.animated.flipInX")

for gazette in gazettes:
card = gazette.css(".event-card")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion If you replace the definition of gazettes, you won't need this card variable.

card = gazette.css(".event-card")

edition_number = card.css(".event-data h4 ::text").re_first(r"[0-9]+")
date = card.re_first(r"[0-9]+/[0-9]+/[0-9]+")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick [0-9] can be replaced by \d in a regex.

@trevineju trevineju linked an issue Oct 11, 2022 that may be closed by this pull request
@trevineju trevineju self-assigned this Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Araraquara-SP

3 participants