# Recursive URL Loader

We may want to process load all URLs under a root directory.

For example, let's look at the [LangChain JS documentation](https://js.langchain.com/docs/).

This has many interesting child pages that we may want to read in bulk.

Of course, the `WebBaseLoader` can load a list of pages. 

But, the challenge is traversing the tree of child pages and actually assembling that list!
 
We do this using the `RecursiveUrlLoader`.

This also gives us the flexibility to exclude some children (e.g., the `api` directory with > 800 child pages).

In [2]:
from langchain.document_loaders.recursive_url_loader import RecursiveUrlLoader

Let's try a simple example.

In [3]:
url = 'https://js.langchain.com/docs/modules/memory/examples/'
loader=RecursiveUrlLoader(url=url)
docs=loader.load()

In [4]:
len(docs)

12

In [5]:
docs[0].page_content[:50]

'\n\n\n\n\nDynamoDB-Backed Chat Memory | \uf8ffü¶úÔ∏è\uf8ffüîó Lan'

In [6]:
docs[0].metadata

{'source': 'https://js.langchain.com/docs/modules/memory/examples/dynamodb',
 'title': 'DynamoDB-Backed Chat Memory | \uf8ffü¶úÔ∏è\uf8ffüîó Langchain',
 'description': 'For longer-term persistence across chat sessions, you can swap out the default in-memory chatHistory that backs chat memory classes like BufferMemory for a DynamoDB instance.',
 'language': 'en'}

Now, let's try a more extensive example, the `docs` root dir.

We will skip everything under `api`.

In [7]:
url = 'https://js.langchain.com/docs/'
exclude_dirs=['https://js.langchain.com/docs/api/']
loader=RecursiveUrlLoader(url=url,exclude_dirs=exclude_dirs)
docs=loader.load()

In [8]:
len(docs)

176

In [9]:
docs[0].page_content[:50]

'\n\n\n\n\nHacker News | \uf8ffü¶úÔ∏è\uf8ffüîó Langchain\n\n\n\n\n\nSkip'

In [10]:
docs[0].metadata

{'source': 'https://js.langchain.com/docs/modules/indexes/document_loaders/examples/web_loaders/hn',
 'title': 'Hacker News | \uf8ffü¶úÔ∏è\uf8ffüîó Langchain',
 'description': 'This example goes over how to load data from the hacker news website, using Cheerio. One document will be created for each page.',
 'language': 'en'}