-
-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically detect and parse sitemap.xml #19
Comments
Going to do this before invoking spidr... using a lil gem: https://github.com/benbalter/sitemap-parser |
I'd prefer the |
Cool, gonna try that way. |
Ah, sorry...that's implicit in the subject 'Automatically detect xml...'. Any other way besides robots.txt? |
Was going to say TIL! Always thought I see three possible implementations for the
|
If this sitemap feature is in the source already, I did not see it. If it is there could you point me to it? If not, is it still a feature being developed and is there an update on progress? If not, is there a workaround? I am parsing a sitemap currently but I cannot seem to identify a way to add my results into the crawler. My need has come about because of new websites that are client-side rendering js and they are not using traditional anchor tag structures for links so they are not crawlable. The only way I can get some data is to have the crawler seed in the URLs from the sitemap. Otherwise crawling stops on the homepage. |
Sitemap also have index files that in turn define locations for other sitemaps.
As for the sitemap option Spidr.site(url, sitemap: :robots) # check /robots.txt
Spidr.site(url, sitemap: true) # check default locations, maybe /robots.txt first?
Spidr.site(url, sitemap: '/some-non-default-location.xml') Q: Only queue all found URLs in sitemap or keep crawling the site? Sitemap protocol: https://www.sitemaps.org/protocol.html |
Automatically detecting and parsing
/sitemap.xml
might be a good way to cut down on spidering depth.The text was updated successfully, but these errors were encountered: