# How to link Documents on hyperlinks in HTML

The `HtmlLinkExtractor` creates outgoing links to hyperlinks found in an HTML Document and an incoming link from a provided URL where this document is hosted.

## Preliminaries

Install the following dependencies:

In [None]:
%pip install -q langchain_community beautifulsoup4

## Usage

We'll scrape 2 HTML pages that have an hyperlink from one page to the other and use the `HtmlLinkExtractor` to create the links in the documents.

### Basic usage

In [79]:
import requests
from langchain_community.graph_vectorstores.extractors import (
    HtmlInput,
    HtmlLinkExtractor,
)
from langchain_core.documents import Document
from langchain_core.graph_vectorstores.links import add_links

html_extractor = HtmlLinkExtractor()

documents = []
for url in [
    "https://python.langchain.com/v0.2/docs/integrations/providers/astradb/",
    "https://docs.datastax.com/en/astra/home/astra.html",
]:
    html = requests.get(url)
    doc = Document(str(html.content), metadata={"source": url})
    links = html_extractor.extract_one(HtmlInput(doc.page_content, url))
    add_links(doc, links)
    documents.append(doc)

documents[0].metadata["links"][:10]

[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/spreedly/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/ray_serve/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/bageldb/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/introduction/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/gradient/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/replicate/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/cerebriumai/'),
 Link(kind='hyperlink', direction='out', tag='htt



The documents with hyperlink links can then be added to a `GraphVectorStore`.


In [None]:
from langchain_community.graph_vectorstores import CassandraGraphVectorStore

store = CassandraGraphVectorStore.from_documents(documents=documents, embedding=...)

### Using as_document_extractor()

If you use a document loader that returns the raw HTML and that sets the `source` key in the document metadata such as `AsyncHtmlLoader`, you can simplify by using `as_document_extractor()` that takes directly `Document` as input.

In [81]:
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.graph_vectorstores.extractors import HtmlLinkExtractor
from langchain_core.graph_vectorstores.links import add_links

loader = AsyncHtmlLoader(
    [
        "https://python.langchain.com/v0.2/docs/integrations/providers/astradb/",
        "https://docs.datastax.com/en/astra/home/astra.html",
    ]
)
documents = loader.load()
html_extractor = HtmlLinkExtractor().as_document_extractor()

for i, document in enumerate(documents):
    document.id = f"doc-{i}"
    links = html_extractor.extract_one(document)
    add_links(document, links)

documents[0].metadata["links"][:10]

Fetching pages: 100%|##########| 2/2 [00:00<00:00, 10.85it/s]


[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/spreedly/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/ray_serve/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/bageldb/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/introduction/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/gradient/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/replicate/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/cerebriumai/'),
 Link(kind='hyperlink', direction='out', tag='htt

### Using LinkExtractorTransformer

`LinkExtractorTransformer` can further simplify the extraction.

In [64]:
from langchain_community.document_loaders import AsyncHtmlLoader
from langchain_community.graph_vectorstores.extractors import (
    HtmlLinkExtractor,
    LinkExtractorTransformer,
)
from langchain_core.graph_vectorstores.links import add_links

loader = AsyncHtmlLoader(
    [
        "https://python.langchain.com/v0.2/docs/integrations/providers/astradb/",
        "https://docs.datastax.com/en/astra/home/astra.html",
    ]
)

documents = loader.load()
transformer = LinkExtractorTransformer([HtmlLinkExtractor().as_document_extractor()])
documents = transformer.transform_documents(documents)

documents[0].metadata["links"][:10]

Fetching pages: 100%|##########| 2/2 [00:00<00:00, 10.06it/s]


[Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/spreedly/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/nvidia/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/ray_serve/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/llamacpp/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/brave_search/'),
 Link(kind='hyperlink', direction='out', tag='https://smith.langchain.com'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/chat/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/integrations/providers/bageldb/'),
 Link(kind='hyperlink', direction='out', tag='https://python.langchain.com/v0.2/docs/int

We can check that there is a link from the first document to the second:

In [82]:
for doc_to in documents:
    for link_to in doc_to.metadata["links"]:
        if link_to.direction == "in":
            for doc_from in documents:
                for link_from in doc_from.metadata["links"]:
                    if (
                        link_to.direction == "in"
                        and link_from.direction == "out"
                        and link_to.tag == link_from.tag
                    ):
                        print(
                            f"Found link from {doc_from.metadata['source']} to {doc_from.metadata['source']}."
                        )

Found link from https://python.langchain.com/v0.2/docs/integrations/providers/astradb/ to https://python.langchain.com/v0.2/docs/integrations/providers/astradb/.
