# Sitemap

Extends from the `WebBaseLoader`, `SitemapLoader` loads a sitemap from a given URL, and then scrape and load all pages in the sitemap, returning each page as a Document. This is helpful when you are provided with a sitemap that contains all the pages you wish to use as Documents.

The scraping is done concurrently. There are reasonable limits to concurrent requests, defaulting to 2 per second. You can increase this if you aren't concerned about being a good citizen, or you control the scrapped server, or don't care about load. Note, while this will speed up the scraping process, but it may cause the server to block you. Be careful!

## Basic Example

Let's run through a basic example of how to use the `SiteMapLoader` on the site map for the [Semrush Features Sitemap](https://www.semrush.com/features/sitemap/). 

### Library Installation

Before starting let's make sure we have installed the proper libraries to run our code examples.

In [None]:
%pip install --upgrade --quiet nest_asyncio langchain_community

### Asyncio Bug Fix

The code block below should always be run to fix a bug with asyncio and jupyter.

In [3]:
import nest_asyncio

nest_asyncio.apply()

In [1]:
from langchain_community.document_loaders.sitemap import SitemapLoader



In [34]:
sitemap_loader = SitemapLoader(web_path="https://www.semrush.com/features/sitemap/")

docs = sitemap_loader.load()

Fetching pages: 100%|###########################| 53/53 [00:05<00:00,  9.38it/s]


Let's examine the first document we loaded.

In [35]:
docs[0].metadata

{'source': 'https://www.semrush.com/features/',
 'loc': 'https://www.semrush.com/features/',
 'changefreq': 'daily'}

In [37]:
print(docs[0].page_content[:250].replace('\n',''))

            Features | Semrush            Skip to content    Your browser is out of date. The site might not be displayed correctly. Please update your


Great! That looks like the page that is first on the site map and we are receiving the proper meta data and page content in a parsed format. Let's now look at some variations we can make to our basic example.

## More Examples

### Adding a Parsing Function

In the basic example we see that our loader return raw HTML, which in most cases is not we want. To alleviate this problem we can pass in the parameter `parsing_function` which allows us to parse the HTML that is returned. In the example below we define a parser that removes all `title` elements and returns the content of all the other elements.

In [None]:
%pip install beautifulsoup4

In [43]:
from bs4 import BeautifulSoup

def remove_nav_and_header_elements(content: BeautifulSoup) -> str:
    # Find all 'nav' and 'header' elements in the BeautifulSoup object
    title_elements = content.find_all("title")

    # Remove each 'nav' and 'header' element from the BeautifulSoup object
    for element in title_elements:
        element.decompose()

    return str(content.get_text())

Let's add our custom parsing function to the `SitemapLoader` object.

In [44]:
sitemap_loader = SitemapLoader(
    "https://www.semrush.com/features/sitemap/",
    parsing_function=remove_nav_and_header_elements,
)
docs = sitemap_loader.load()

Fetching pages: 100%|###########################| 53/53 [00:05<00:00,  9.11it/s]


In [45]:
print(docs[0].page_content[:250].replace('\n',''))

        Skip to content    Your browser is out of date. The site might not be displayed correctly. Please update your browser.    


As we can see, the title element containing the string "Features | Semrush" has been removed, and our loader only returns the values of the other elements.

### Filtering sitemap URLs

Sitemaps can be massive files, with thousands of URLs.  Often you don't need every single one of them.  You can filter the URLs by passing a list of strings or regex patterns to the `filter_urls` parameter.  Only URLs that match one of the patterns will be loaded. In this case, let's find URLs that contain the string "ppc".

In [47]:
sitemap_loader = SitemapLoader(
    web_path="https://www.semrush.com/features/sitemap/",
    filter_urls=[".*ppc.*"],
)
docs = sitemap_loader.load()

Fetching pages: 100%|#############################| 3/3 [00:00<00:00,  7.16it/s]


As we can see, we only pulled 3 documents instead of 53 - let's take a look at the metadata of the first document to ensure it actually pulled the URLs we wanted.

In [48]:
docs[0].metadata

{'source': 'https://www.semrush.com/features/ppc-keyword-research-tools/',
 'loc': 'https://www.semrush.com/features/ppc-keyword-research-tools/',
 'changefreq': 'daily'}

As we can see, this URL does ideed contain the string "ppc" which is exactly what we expected.

### Local Sitemap

The sitemap loader can also be used to load local files, as show in the code example below.

In [59]:
sitemap_loader = SitemapLoader(web_path="example_data/sitemap.xml", is_local=True)

docs = sitemap_loader.load()

Fetching pages: 100%|#############################| 3/3 [00:00<00:00, 16.53it/s]


## More Topics

There are a varity of other changes you cna make to the functionality of the base `SiteMapLoader`. For example you can change the `requests_per_second` parameter to increase the max concurrent requests, and use `requests_kwargs` to pass kwargs when sending requests. To read about all the possible modifications that can be made, read the API reference.