Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setting to make scrapy ignore/follow robots.txt #421

Closed

Conversation

stijn-uva
Copy link
Contributor

Fixes #376 .

As far as I could see this does all that's necessary to be able to configure ROBOTSTXT_OBEY via config.json, et cetera. It's set to TRUE by default, which is the implicit default for Scrapy.

This doesn't add a way to toggle this via the web interface, but to be honest I'm not sure how to best go about that. For our own purposes being able to toggle it via a config file is sufficient.

Let me know if I missed any spots where the configuration is processed!

@boogheta
Copy link
Member

boogheta commented Oct 1, 2021

Hey Stijn,

Thanks for getting back to this!

As far as I understand it in the docs https://docs.scrapy.org/en/latest/topics/settings.html#robotstxt-obey the default value is False if not set, so in order to maintain the current situation with historical corpuses, I would leave it to false by default.

Also I think it would be best if this could be set locally by corpus in addition to globally by instance, but as you guessed this requires quite a bit more changes including some in the web interface. I actually added already some similar settings recently along with the webarchives crawling part, so I guess I should be able to try and adapt it relatively quickly, but if that's ok I'll wait until i do this before I merge the PR?

@boogheta
Copy link
Member

boogheta commented Oct 1, 2021

FYI I started to complete this in this branch : https://github.com/medialab/hyphe/compare/digitalmethodsinitiative-robots-txt?expand=1

There is still a little bit of work to make this setting available form the Settings in the web interface, will try and work on it soon.

@stijn-uva
Copy link
Contributor Author

stijn-uva commented Oct 1, 2021

Yeah, I definitely see how a per-corpus setting would be more useful - happy to wait for that to be done the proper way. In the meantime I know where to find the setting now so we can set it as needed for our own instances.

As for the default - I swear it was true when I checked, but maybe I was looking at an old version then (or I just wasn't paying attention)! Either is fine for me, since with this addition it can be changed anyway.

@boogheta
Copy link
Member

I'm closing the PR as I've merged it with extra commits within master (see https://github.com/medialab/hyphe/commits/master)
I will publish a release including it soon!

@boogheta boogheta closed this Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Respect robots.txt
2 participants