# AsyncRecursiveURLLoader

This notebook demonstrates how to use the AsyncRecursiveURLLoader to load a document from a URL.

This loader will fetch the document from the URL, then find every accessible in the webpage. It will then fetch each of those URLs, and repeat the process until all URLs have been fetched, or the depth limit has been reached.

Please note that this loader is an asynchronous one, thus, lazy_loading won't work as excepted. If you want to use lazy_loading, you should use the RecursiveURLLoader instead. Also, this loader will be obviously slower than other loaders as it has exponentially more requests to make.

# Installation

To use this loader, it's necessary to install the `aiohttp` and `asyncio` package:

```bash
pip install aiohttp asyncio
```

It is also highly recommended to use beautifulsoup4 or goose3 to extract the text from the HTML document. But if you would like to write a custom filter, you can use any other package or write your own.

# Examples

AsyncRecursiveURLLoader takes the following parameters:
- url, the URL string to load the document from.
- exclude_dirs, optional, a list of directories to exclude from the recursive search.
- raw_webpage_to_text_converter, optional, a function that takes a raw webpage and returns the text from it. It should be a function from string to string.
- max_depth, optional, the maximum depth to search for URLs. If not specified, by default, it will be 2.
- prevent_outside, optional, when enabled, pages outside of the given url won't be crawled.
- timeout, optional, an aiohttp.ClientTimeout object provided to the aiohttp session.

In [None]:
from langchain.document_loaders.async_recursive_url_loader import AsyncRecursiveUrlLoader
import goose3

In [None]:
def converter(raw: str) -> str:
    extractor = goose3.Goose()
    article = extractor.extract(raw_html=raw)
    return article.cleaned_text

url = "https://python.langchain.com/docs/get_started"
loader = AsyncRecursiveUrlLoader(url, max_depth=2, raw_webpage_to_text_converter=converter)
loader.load()