Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add spider for Teresina/PI #53

Closed
wants to merge 7 commits into from
Closed

Conversation

alfakini
Copy link
Contributor

Hey people,

In this PR we add the spider to collect data from Teresina/PI.

@@ -0,0 +1,53 @@
import dateparser
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file may not be here. It is not related to your PR that references only Teresina-PI spider.

scraped_at=datetime.utcnow(),
)

next_page_path = response.css(self.NEXT_PAGE_CSS).extract_first()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use urljoin (https://doc.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Response.urljoin) method from the response to handle relative links:

for next_page_url in response.css(self.NEXT_PAGE_CSS).extract():
    yield Request(response.urljoin(next_page_url))


allowed_domains = ['www.dom.teresina.pi.gov.br']
name = 'pi_teresina'
start_urls = [f'{GAZETTE_URL}/lista_diario.php']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you are using urljoin to follow the pagination (see my comment below), I don't see the need for GAZETTE_URL variable. Just include the full URL in start_urls list.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done @rennerocha! 👍

@giovanisleite
Copy link
Contributor

Hey, @alfakini, remember to update the cities.md

@alfakini
Copy link
Contributor Author

@rennerocha @giovanisleite Done with the refactoring and suggestion 👍

@jvanz
Copy link
Collaborator

jvanz commented Dec 6, 2019

@alfakini mark the change requests as resolved. @rennerocha, does it look good to you?

@jvanz jvanz added this to the Capital cities milestone Jun 21, 2020
@jvanz jvanz linked an issue Jun 21, 2020 that may be closed by this pull request
@jvanz jvanz removed this from the Capital cities milestone Jun 21, 2020
@jvanz jvanz changed the base branch from master to main August 12, 2020 01:47
@giovanisleite
Copy link
Contributor

This PR can be closed (Cc. @jvanz)

@rennerocha rennerocha closed this Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Teresina spider
4 participants