---
sidebar_label: RecursiveUrlLoader
sidebar_class_name: node-only
---

# RecursiveUrlLoader

```{=mdx}

:::tip Compatibility

Only available on Node.js.

:::

```

This notebook provides a quick overview for getting started with [RecursiveUrlLoader](/docs/integrations/document_loaders/). For detailed documentation of all RecursiveUrlLoader features and configurations head to the [API reference](https://api.js.langchain.com/classes/langchain_community_document_loaders_web_recursive_url.RecursiveUrlLoader.html).

## Overview
### Integration details

| Class | Package | Local | Serializable | PY support |
| :--- | :--- | :---: | :---: |  :---: |
| [RecursiveUrlLoader](https://api.js.langchain.com/classes/langchain_community_document_loaders_web_recursive_url.RecursiveUrlLoader.html) | [@langchain/community](https://api.js.langchain.com/modules/langchain_community_document_loaders_web_recursive_url.html) | ✅ | beta | ❌ | 
### Loader features
| Source | Web Loader | Node Envs Only
| :---: | :---: | :---: | 
| RecursiveUrlLoader | ✅ | ✅ | 

When loading content from a website, we may want to process load all URLs on a page.

For example, let's look at the [LangChain.js introduction](/docs/introduction) docs.

This has many interesting child pages that we may want to load, split, and later retrieve in bulk.

The challenge is traversing the tree of child pages and assembling a list!

We do this using the `RecursiveUrlLoader`.

This also gives us the flexibility to exclude some children, customize the extractor, and more.

## Setup

To access `RecursiveUrlLoader` document loader you'll need to install the `@langchain/community` integration, and the [`jsdom`](https://www.npmjs.com/package/jsdom) package.

### Credentials

If you want to get automated tracing of your model calls you can also set your [LangSmith](https://docs.smith.langchain.com/) API key by uncommenting below:

```bash
# export LANGSMITH_TRACING="true"
# export LANGSMITH_API_KEY="your-api-key"
```

### Installation

The LangChain RecursiveUrlLoader integration lives in the `@langchain/community` package:

```{=mdx}
import IntegrationInstallTooltip from "@mdx_components/integration_install_tooltip.mdx";
import Npm2Yarn from "@theme/Npm2Yarn";

<IntegrationInstallTooltip></IntegrationInstallTooltip>

<Npm2Yarn>
  @langchain/community @langchain/core jsdom
</Npm2Yarn>

We also suggest adding a package like [`html-to-text`](https://www.npmjs.com/package/html-to-text) or
[`@mozilla/readability`](https://www.npmjs.com/package/@mozilla/readability) for extracting the raw text from the page.

<Npm2Yarn>
  html-to-text
</Npm2Yarn>

```

## Instantiation

Now we can instantiate our model object and load documents:

In [1]:
import { RecursiveUrlLoader } from "@langchain/community/document_loaders/web/recursive_url"
import { compile } from "html-to-text";

const compiledConvert = compile({ wordwrap: 130 }); // returns (text: string) => string;

const loader = new RecursiveUrlLoader("https://langchain.com/",  {
  extractor: compiledConvert,
  maxDepth: 1,
  excludeDirs: ["/docs/api/"],
})

## Load

In [2]:
const docs = await loader.load()
docs[0]

{
  pageContent: '\n' +
    '/\n' +
    'Products\n' +
    '\n' +
    'LangChain [/langchain]LangSmith [/langsmith]LangGraph [/langgraph]\n' +
    'Methods\n' +
    '\n' +
    'Retrieval [/retrieval]Agents [/agents]Evaluation [/evaluation]\n' +
    'Resources\n' +
    '\n' +
    'Blog [https://blog.langchain.dev/]Case Studies [/case-studies]Use Case Inspiration [/use-cases]Experts [/experts]Changelog\n' +
    '[https://changelog.langchain.com/]\n' +
    'Docs\n' +
    '\n' +
    'LangChain Docs [https://python.langchain.com/v0.2/docs/introduction/]LangSmith Docs [https://docs.smith.langchain.com/]\n' +
    'Company\n' +
    '\n' +
    'About [/about]Careers [/careers]\n' +
    'Pricing [/pricing]\n' +
    'Get a demo [/contact-sales]\n' +
    'Sign up [https://smith.langchain.com/]\n' +
    '\n' +
    '\n' +
    '\n' +
    '\n' +
    'LangChain’s suite of products supports developers along each step of the LLM application lifecycle.\n' +
    '\n' +
    '\n' +
    'APPLICATIONS THAT CAN

In [3]:
console.log(docs[0].metadata)

{
  source: 'https://langchain.com/',
  title: 'LangChain',
  description: 'LangChain’s suite of products supports developers along each step of their development journey.',
  language: 'en'
}


## Options

```typescript
interface Options {
  excludeDirs?: string[]; // webpage directories to exclude.
  extractor?: (text: string) => string; // a function to extract the text of the document from the webpage, by default it returns the page as it is. It is recommended to use tools like html-to-text to extract the text. By default, it just returns the page as it is.
  maxDepth?: number; // the maximum depth to crawl. By default, it is set to 2. If you need to crawl the whole website, set it to a number that is large enough would simply do the job.
  timeout?: number; // the timeout for each request, in the unit of seconds. By default, it is set to 10000 (10 seconds).
  preventOutside?: boolean; // whether to prevent crawling outside the root url. By default, it is set to true.
  callerOptions?: AsyncCallerConstructorParams; // the options to call the AsyncCaller for example setting max concurrency (default is 64)
}
```

However, since it's hard to perform a perfect filter, you may still see some irrelevant results in the results. You can perform a filter on the returned documents by yourself, if it's needed. Most of the time, the returned results are good enough.

## API reference

For detailed documentation of all RecursiveUrlLoader features and configurations head to the API reference: https://api.js.langchain.com/classes/langchain_community_document_loaders_web_recursive_url.RecursiveUrlLoader.html