### Read Data From Site

In [52]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# only keep relevant data from the page
# by filtering out all tags except for p, h1, h2, h3, h4
bs4_strainer = bs4.SoupStrainer(['p', 'h1', 'h2', 'h3', 'h4'])
loader = WebBaseLoader(
    web_paths=("https://www.anthropic.com/news/contextual-retrieval",),
    bs_kwargs={"parse_only": bs4_strainer},
    bs_get_text_kwargs={"separator": "|"}
)
docs = loader.load()

document = docs[0].page_content

In [53]:
print(document)

Introducing Contextual Retrieval|For an AI model to be useful in specific contexts, it often needs access to background knowledge. For example, customer support chatbots need knowledge about the specific business they're being used for, and legal analyst bots need to know about a vast array of past cases.|Developers typically enhance an AI model's knowledge using Retrieval-Augmented Generation (RAG). RAG is a method that retrieves relevant information from a knowledge base and appends it to the user's prompt, significantly enhancing the model's response. The problem is that traditional RAG solutions remove context when encoding information, which often results in the system failing to retrieve the relevant information from the knowledge base.|In this post, we outline a method that dramatically improves the retrieval step in RAG. The method is called “Contextual Retrieval” and uses two sub-techniques: Contextual Embeddings and Contextual BM25. This method can reduce the number of failed

### Split Document into Chunks

In [54]:
from langchain_text_splitters import CharacterTextSplitter

# separete on \n\n because of how the page was read
text_splitter = CharacterTextSplitter(
    separator="|",
    chunk_size=600,
    chunk_overlap=0,
    length_function=len,
    is_separator_regex=False,
)

texts = text_splitter.create_documents([document])

In [55]:
for i in range(5):
    print(texts[i].page_content)
    print("-" * 80)

Introducing Contextual Retrieval|For an AI model to be useful in specific contexts, it often needs access to background knowledge. For example, customer support chatbots need knowledge about the specific business they're being used for, and legal analyst bots need to know about a vast array of past cases.
--------------------------------------------------------------------------------
Developers typically enhance an AI model's knowledge using Retrieval-Augmented Generation (RAG). RAG is a method that retrieves relevant information from a knowledge base and appends it to the user's prompt, significantly enhancing the model's response. The problem is that traditional RAG solutions remove context when encoding information, which often results in the system failing to retrieve the relevant information from the knowledge base.
--------------------------------------------------------------------------------
In this post, we outline a method that dramatically improves the retrieval step in RA