# URL HREF URL Retriever

The notebook showcases how to use the UrlUrlsDocSearchChain.

In a nutshell, it loads web pages, and retrives the href urls that are most relevant to a search query.

The core idea is simple: given a specific query, we should be able to return only the documents relevant to that query, and only the parts of those documents that are relevant.

In [1]:
# Don't forget to set the OpenAI API key.
import os
os.environ["OPENAI_API_KEY"] = ""
os.environ["TOKENIZERS_PARALLELISM"] = "true"

## Web Page HREF Search

Below is an example of a search for href urls that have the term "privacy"

In [2]:
import nest_asyncio


from slangchain.chains.url_urls_doc_search.base import UrlUrlsDocSearchChain
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

# Allows multi-threaded code to run on a notebook 
nest_asyncio.apply()

# Chunk size for the url content.
chunk_size = 100
chunk_overlap = 50
k = 3

embeddings = HuggingFaceEmbeddings()

url_urls_search = UrlUrlsDocSearchChain(
  embeddings=embeddings,
  chunk_size=chunk_size,
  chunk_overlap=chunk_overlap,
  k=3,
  browser_headless_flag=True
)
content = url_urls_search.run("https://www.bbc.com/news|privacy")
print(content)

NOTE: Redirects are currently not supported in Windows or MacOs.


[Document(page_content='"link_text": Privacy Policy, "href": https://www.bbc.co.uk/usingthebbc/privacy/', metadata={'source': 'https://www.bbc.com/news'}), Document(page_content='know-their-donors', metadata={'source': 'https://www.bbc.com/news'}), Document(page_content='"link_text": Five revelations from Nasa\'s public UFO meeting, "href": https://www.bbc.com/news/world-us-canada-65729356', metadata={'source': 'https://www.bbc.com/news'})]
