# How To Easily Find All Of The Sitemap.xml Files In Python

-----------------------------------------------------------------

To effectively analyse websites, knowing <strong> how to download all of the [sitemap.xml files](https://en.wikipedia.org/wiki/Sitemaps) </strong> for a particular website is an incredibly useful skill.

Forunately, there are [python packages](https://pypi.org/project/ultimate-sitemap-parser/) that allow us to easily download all of sitemap.xml file's with brute force!

--------------------------------------------------------

<strong> NB: If you're using a standard python environment, then simply exclude the ! symbol. The reason for using !pip install is because this guide is written in a jupyter notebook. </strong>

In [3]:
!pip install ultimate-sitemap-parser
!pip install requests



In [5]:
from usp.tree import sitemap_tree_for_homepage
import requests

--------------------------------------------------------

## Download all of the Sitemap.xml files based upon the URL of the homepage:

In [6]:
tree = sitemap_tree_for_homepage('https://understandingdata.com/')
print(tree)

2020-12-10 09:05:10,683 INFO usp.fetch_parse [35788/MainThread]: Fetching level 0 sitemap from https://understandingdata.com/robots.txt...
2020-12-10 09:05:10,684 INFO usp.helpers [35788/MainThread]: Fetching URL https://understandingdata.com/robots.txt...
2020-12-10 09:05:10,820 INFO usp.fetch_parse [35788/MainThread]: Parsing sitemap from URL https://understandingdata.com/robots.txt...
2020-12-10 09:05:10,822 INFO usp.fetch_parse [35788/MainThread]: Fetching level 0 sitemap from https://understandingdata.com/sitemap.xml...
2020-12-10 09:05:10,822 INFO usp.helpers [35788/MainThread]: Fetching URL https://understandingdata.com/sitemap.xml...
2020-12-10 09:05:11,277 INFO usp.fetch_parse [35788/MainThread]: Parsing sitemap from URL https://understandingdata.com/sitemap.xml...
2020-12-10 09:05:11,286 INFO usp.fetch_parse [35788/MainThread]: Fetching level 0 sitemap from https://understandingdata.com/sitemap-index.xml...
2020-12-10 09:05:11,287 INFO usp.helpers [35788/MainThread]: Fetching

IndexWebsiteSitemap(url=https://understandingdata.com/, sub_sitemaps=[IndexRobotsTxtSitemap(url=https://understandingdata.com/robots.txt, sub_sitemaps=[PagesXMLSitemap(url=https://understandingdata.com/sitemap.xml, pages=[SitemapPage(url=https://understandingdata.com/, priority=0.5, last_modified=2020-07-31 14:47:00+00:00, change_frequency=None, news_story=None), SitemapPage(url=https://understandingdata.com/privacy-policy/, priority=0.5, last_modified=2018-08-17 08:43:00+00:00, change_frequency=None, news_story=None), SitemapPage(url=https://understandingdata.com/contact/, priority=0.5, last_modified=2019-09-15 21:41:00+00:00, change_frequency=None, news_story=None), SitemapPage(url=https://understandingdata.com/blog/, priority=0.5, last_modified=2019-10-19 09:53:00+00:00, change_frequency=None, news_story=None), SitemapPage(url=https://understandingdata.com/data-engineering-services/, priority=0.5, last_modified=2020-05-25 20:44:00+00:00, change_frequency=None, news_story=None), Site

After running the following method, we've used brute force to find all of the sitemap files and have saved them to a variable called tree:
~~~
tree = sitemap_tree_for_homepage('https://homepageurl.com')

~~~

----

sitemap_tree_for_homepage() returns a tree of AbstractSitemap subclass objects that represent the sitemap hierarchy found on a given website.

To find all of the pages we can simply do this:

In [10]:
# all_pages() returns an Iterator
for page in tree.all_pages():
    print(page)

SitemapPage(url=https://understandingdata.com/, priority=0.5, last_modified=2020-07-31 14:47:00+00:00, change_frequency=None, news_story=None)
SitemapPage(url=https://understandingdata.com/privacy-policy/, priority=0.5, last_modified=2018-08-17 08:43:00+00:00, change_frequency=None, news_story=None)
SitemapPage(url=https://understandingdata.com/contact/, priority=0.5, last_modified=2019-09-15 21:41:00+00:00, change_frequency=None, news_story=None)
SitemapPage(url=https://understandingdata.com/blog/, priority=0.5, last_modified=2019-10-19 09:53:00+00:00, change_frequency=None, news_story=None)
SitemapPage(url=https://understandingdata.com/data-engineering-services/, priority=0.5, last_modified=2020-05-25 20:44:00+00:00, change_frequency=None, news_story=None)
SitemapPage(url=https://understandingdata.com/data-science-and-analytics-services/, priority=0.5, last_modified=2020-05-28 15:06:00+00:00, change_frequency=None, news_story=None)
SitemapPage(url=https://understandingdata.com/digita

Also, you can save of the URLs to a new variable via a list comprehension:

In [17]:
urls = [page.url for page in tree.all_pages()]

In [18]:
print(len(urls))

120


In [19]:
print(urls[0:2])

['https://understandingdata.com/', 'https://understandingdata.com/privacy-policy/']


------------------------------------------------------------------------------------------

## Conclusion

Now you'll hopefully be able to easily find all of the sitemap.xml files and the web pages in just a few lines of python code!