Automatically detect and parse sitemap.xml #19

postmodern · 2010-10-16T21:00:13Z

Automatically detecting and parsing /sitemap.xml might be a good way to cut down on spidering depth.

The text was updated successfully, but these errors were encountered:

nofxx · 2016-05-23T06:28:54Z

Going to do this before invoking spidr... using a lil gem: https://github.com/benbalter/sitemap-parser
If it looks good, I'll pull request something like sitemap: true to look simbling to robots: true.
Sounds good? Thank you for spidr!

postmodern · 2016-05-23T06:38:46Z

I'd prefer the /sitemap.xml is requested by the agent, and we own the parsing logic. The XML scheme seems simple enough, could probably parse it with a single XPath?

nofxx · 2016-05-23T08:12:13Z

Cool, gonna try that way.
Also, we need robots to do that: The filename 'sitemap.xml' isn't default.
I've seen some under differnt names, and the name is in robots.txt sitemap: key.

nofxx · 2016-05-24T06:18:15Z

Ah, sorry...that's implicit in the subject 'Automatically detect xml...'. Any other way besides robots.txt?
Also, maybe another option sitemap: may receive true or false or http://url.to/sitemap.xml.

postmodern · 2016-05-24T07:35:17Z

Was going to say TIL! Always thought /sidemap.xml was a defacto standard and not configurable.

I see three possible implementations for the sitemap: option:

Implicitly enable robots: if sitemap: is enable.
Allow mixing robots: with sitemap:. If robots: is not specified, fallback to /sitemap.xml. This would have to be documented.
Add another option to indicate that you wish to infer sitemap from /robots.txt.

kcalmes · 2017-12-26T22:42:09Z

If this sitemap feature is in the source already, I did not see it. If it is there could you point me to it? If not, is it still a feature being developed and is there an update on progress? If not, is there a workaround? I am parsing a sitemap currently but I cannot seem to identify a way to add my results into the crawler.

My need has come about because of new websites that are client-side rendering js and they are not using traditional anchor tag structures for links so they are not crawlable. The only way I can get some data is to have the crawler seed in the URLs from the sitemap. Otherwise crawling stops on the homepage.

buren · 2018-08-20T21:16:27Z

Sitemap also have index files that in turn define locations for other sitemaps.
They can be gzipped and other common locations are

sitemap_index.xml.gz
sitemap-index.xml.gz
sitemap_index.xml
sitemap-index.xml
sitemap.xml.gz
sitemap.xml

As for the sitemap option

Spidr.site(url, sitemap: :robots) # check /robots.txt
Spidr.site(url, sitemap: true) # check default locations, maybe /robots.txt first?
Spidr.site(url, sitemap: '/some-non-default-location.xml')

Q: Only queue all found URLs in sitemap or keep crawling the site?
Q: What if robots.txt is 404, errors, or doesn't define a sitemap location?
Q: What if the sitemap location is 404 or errors?

Sitemap protocol: https://www.sitemaps.org/protocol.html

ghost assigned postmodern May 8, 2012

buren mentioned this issue Aug 25, 2018

Sitemap XML support #69

Open

Spone mentioned this issue Mar 24, 2023

Idea: using sitemap for crawling benpickles/parklife#95

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically detect and parse sitemap.xml #19

Automatically detect and parse sitemap.xml #19

postmodern commented Oct 16, 2010

nofxx commented May 23, 2016 •

edited

Loading

postmodern commented May 23, 2016

nofxx commented May 23, 2016

nofxx commented May 24, 2016

postmodern commented May 24, 2016

kcalmes commented Dec 26, 2017

buren commented Aug 20, 2018 •

edited

Loading

Automatically detect and parse sitemap.xml #19

Automatically detect and parse sitemap.xml #19

Comments

postmodern commented Oct 16, 2010

nofxx commented May 23, 2016 • edited Loading

postmodern commented May 23, 2016

nofxx commented May 23, 2016

nofxx commented May 24, 2016

postmodern commented May 24, 2016

kcalmes commented Dec 26, 2017

buren commented Aug 20, 2018 • edited Loading

nofxx commented May 23, 2016 •

edited

Loading

buren commented Aug 20, 2018 •

edited

Loading