Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some feature questions #49

Closed
koolma opened this issue Jun 17, 2017 · 3 comments
Closed

Some feature questions #49

koolma opened this issue Jun 17, 2017 · 3 comments

Comments

@koolma
Copy link

koolma commented Jun 17, 2017

Hi,
I have some questions regarding the features of this crawler, which are not covered by the documentation.

  1. Does php-spider support JavaScript (content and URLs generated via JavaScript)
  2. Does php-spider follows robot.txt files?
  3. Is php-spider able to leverage a sitemap?
  4. Is it possible to crawl sites that require authentication?
@danvuquoc
Copy link

@koolma, I'm not a contributor or maintainer of this project but I have used this spider.

  1. It doesn't support javascript, this isn't a headless browser. It simply uses guzzle http 6 to grab pages. Typically you then use a css/xpath discovery class to find new uris to find more pages to crawl.
  2. It doesn't follow robots.txt files, but you could easily write a filter that implements the PreFetchFilterInterface and uses tomverran/robots to match the uri path about to be grabbed by the spider against the robots.txt. I've done this as well with robots headers and robots meta tags implementing a PostFetchFilterInterface.

I'm not sure about 3 and 4, hope this helps.

@solverat
Copy link

  1. no, not without any additional work
  2. yes, just implement some middleware

@mvdbos
Copy link
Owner

mvdbos commented Jun 16, 2019

Good answers by @danvuquoc and @solverat. Closing this ancient issue. :-)

@mvdbos mvdbos closed this as completed Jun 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants