# Recursive URL Loader

We may want to process load all URLs under a root directory.

For example, let's look at the [Python 3.9 Document](https://docs.python.org/3.9/).

This has many interesting child pages that we may want to read in bulk.

Of course, the `WebBaseLoader` can load a list of pages. 

But, the challenge is traversing the tree of child pages and actually assembling that list!
 
We do this using the `RecursiveUrlLoader`.

This also gives us the flexibility to exclude some children (e.g., the `api` directory with > 800 child pages).

# Parameters
- url: str, the target url to crawl.
- exclude_dirs: Optional[str], webpage directories to exclude.
use_async: bool = False, wether to use async requests, using async requests is usually faster in large tasks. However, async will disable the lazy loading feature(the function still works, but it is not lazy).
- extractor: Callable[[str], str] = lambda x: x, a function to extract the text of the document from the webpage, by default it returns the page as it is. It is recommended to use tools like goose3 and beautifulsoup to extract the text.
- max_depth: int = 2, the maximum depth to crawl.
- timeout: int = 10, the timeout for each request, in the unit of seconds.
- prevent_outside: bool = True, whether to prevent crawling outside the root url.

In [16]:
from langchain.document_loaders.recursive_url_loader import RecursiveUrlLoader



Let's try a simple example.

In [25]:
from bs4 import BeautifulSoup as Soup

url = "https://docs.python.org/3.9/"
loader = RecursiveUrlLoader(url=url, max_depth=2, extractor=lambda x: Soup(x, "html.parser").text)
docs = loader.load()



In [31]:
docs[-1].page_content

'\n\n\n\n\nExtending and Embedding the Python Interpreter — Python 3.9.17 documentation\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTable of Contents\n\nExtending and Embedding the Python Interpreter\nRecommended third party tools\nCreating extensions without third party tools\nEmbedding the CPython runtime in a larger application\n\n\n\nPrevious topic\nSecurity Considerations\nNext topic\n1. Extending Python with C or C++\n\nThis Page\n\nReport a Bug\n\nShow Source\n        \n\n\n\n\n\n\n\nNavigation\n\n\nindex\n\nmodules |\n\nnext |\n\nprevious |\n\nPython »\n\n\n\n\n\n\n\n3.9.17 Documentation »\n    \n\n\n\n\n\n\n\n\n\n                     |\n                \n\n\n\n\n\n\n\nExtending and Embedding the Python InterpreterÂ¶\nThis document describes how to write modules in C or C++ to extend the Python\ninterpreter with new modules.  Those modules can not only define new functions\nbut also new object types and their methods.  The document als

In [32]:
docs[-1].metadata

{'source': 'https://docs.python.org/3.9/extending/index.html',
 'title': 'Extending and Embedding the Python Interpreter — Python 3.9.17 documentation',
 'language': None}

However, since it's hard to perform a perfect filter, you may still see some irrelevant results in the results. For example:

In [33]:
docs[0]

Document(page_content='\n\n\n\n\n\n\n\n\n\n\n\n\n\n', metadata={'source': 'https://docs.python.org/3.9/_static/py.svg'})

Because it is fetched from a svg file. Please manually perform a filter on the docs to remove such files.