Some feature questions #49

koolma · 2017-06-17T11:54:03Z

Hi,
I have some questions regarding the features of this crawler, which are not covered by the documentation.

danvuquoc · 2017-08-18T14:55:06Z

@koolma, I'm not a contributor or maintainer of this project but I have used this spider.

It doesn't support javascript, this isn't a headless browser. It simply uses guzzle http 6 to grab pages. Typically you then use a css/xpath discovery class to find new uris to find more pages to crawl.
It doesn't follow robots.txt files, but you could easily write a filter that implements the PreFetchFilterInterface and uses tomverran/robots to match the uri path about to be grabbed by the spider against the robots.txt. I've done this as well with robots headers and robots meta tags implementing a PostFetchFilterInterface.

I'm not sure about 3 and 4, hope this helps.

solverat · 2017-08-18T19:04:49Z

mvdbos · 2019-06-16T00:12:46Z

Good answers by @danvuquoc and @solverat. Closing this ancient issue. :-)

mvdbos closed this as completed Jun 16, 2019

Provide feedback