Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect and parse sitemap.xml #19

Open
postmodern opened this issue Oct 16, 2010 · 7 comments
Open

Automatically detect and parse sitemap.xml #19

postmodern opened this issue Oct 16, 2010 · 7 comments
Assignees
Labels

Comments

@postmodern
Copy link
Owner

Automatically detecting and parsing /sitemap.xml might be a good way to cut down on spidering depth.

@ghost ghost assigned postmodern May 8, 2012
@nofxx
Copy link

nofxx commented May 23, 2016

Going to do this before invoking spidr... using a lil gem: https://github.com/benbalter/sitemap-parser
If it looks good, I'll pull request something like sitemap: true to look simbling to robots: true.
Sounds good? Thank you for spidr!

@postmodern
Copy link
Owner Author

I'd prefer the /sitemap.xml is requested by the agent, and we own the parsing logic. The XML scheme seems simple enough, could probably parse it with a single XPath?

@nofxx
Copy link

nofxx commented May 23, 2016

Cool, gonna try that way.
Also, we need robots to do that: The filename 'sitemap.xml' isn't default.
I've seen some under differnt names, and the name is in robots.txt sitemap: key.

@nofxx
Copy link

nofxx commented May 24, 2016

Ah, sorry...that's implicit in the subject 'Automatically detect xml...'. Any other way besides robots.txt?
Also, maybe another option sitemap: may receive true or false or http://url.to/sitemap.xml.

@postmodern
Copy link
Owner Author

Was going to say TIL! Always thought /sidemap.xml was a defacto standard and not configurable.

I see three possible implementations for the sitemap: option:

  1. Implicitly enable robots: if sitemap: is enable.
  2. Allow mixing robots: with sitemap:. If robots: is not specified, fallback to /sitemap.xml. This would have to be documented.
  3. Add another option to indicate that you wish to infer sitemap from /robots.txt.

@kcalmes
Copy link

kcalmes commented Dec 26, 2017

If this sitemap feature is in the source already, I did not see it. If it is there could you point me to it? If not, is it still a feature being developed and is there an update on progress? If not, is there a workaround? I am parsing a sitemap currently but I cannot seem to identify a way to add my results into the crawler.

My need has come about because of new websites that are client-side rendering js and they are not using traditional anchor tag structures for links so they are not crawlable. The only way I can get some data is to have the crawler seed in the URLs from the sitemap. Otherwise crawling stops on the homepage.

@buren
Copy link
Contributor

buren commented Aug 20, 2018

Sitemap also have index files that in turn define locations for other sitemaps.
They can be gzipped and other common locations are

sitemap_index.xml.gz
sitemap-index.xml.gz
sitemap_index.xml
sitemap-index.xml
sitemap.xml.gz
sitemap.xml

As for the sitemap option

Spidr.site(url, sitemap: :robots) # check /robots.txt
Spidr.site(url, sitemap: true) # check default locations, maybe /robots.txt first?
Spidr.site(url, sitemap: '/some-non-default-location.xml')

Q: Only queue all found URLs in sitemap or keep crawling the site?
Q: What if robots.txt is 404, errors, or doesn't define a sitemap location?
Q: What if the sitemap location is 404 or errors?

Sitemap protocol: https://www.sitemaps.org/protocol.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants