You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@koolma, I'm not a contributor or maintainer of this project but I have used this spider.
It doesn't support javascript, this isn't a headless browser. It simply uses guzzle http 6 to grab pages. Typically you then use a css/xpath discovery class to find new uris to find more pages to crawl.
It doesn't follow robots.txt files, but you could easily write a filter that implements the PreFetchFilterInterface and uses tomverran/robots to match the uri path about to be grabbed by the spider against the robots.txt. I've done this as well with robots headers and robots meta tags implementing a PostFetchFilterInterface.
Hi,
I have some questions regarding the features of this crawler, which are not covered by the documentation.
The text was updated successfully, but these errors were encountered: