**Exercise 5: Retrieval Augmented Generation**


We'll build a RAG system using the LangChain framework 


In [5]:
# Installations
!pip install -q langchain
!pip install -q langchain-community
!pip install -q langchain-chroma
!pip install -q langchain-huggingface
!pip install -q bs4
!pip install -q rank_bm25
!pip install -q huggingface_hub
!pip install -q langsmith

**Part 1: Simple RAG pipeline with LangChain using Sparse Retrieval**

Step-by-Step Instructions:
 - Introduction to RAG:
  - Retrieval-Augmented Generation (RAG) enhances the capabilities of a language model by retrieving relevant information from external sources, such as documents or databases.
  - It has two key components:
    - Indexing: Ingests and organizes data for efficient retrieval.
    - Retrieval & Generation: Matches a query with relevant documents and feeds them to the language model to generate a response.

- Data Loading with WebBaseLoader:
  - Before we start with the indexing we need some data to query. For this we use LangChains "WebBaseLoader" which allows us to fetch data from webpages directly by giving a path. For this exercise we use the content of the Data Science and its Applications Research Groups webpage. There are many different loaders available if you want to try out different sources: https://python.langchain.com/docs/integrations/document_loaders/



In [6]:
#Loading the data with LangChains WebBaseLoader
import bs4
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.documents import Document

# Load and chunk contents of the blog
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-title", "post-content", "post-header"))),
)

docs = loader.load()

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [7]:
print(docs[0])

page_content='

      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory

Short-t

In [8]:
print("First 500 characters of the loaded content:",docs[0].page_content[:500])

First 500 characters of the loaded content: 

      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


**Using a Retriever to Answer Questions**
- Now that we have fetched and preprocessed our data, we can use it to answer some questions. To do this, we need a Retriever—a tool that helps us search through the data and find the most relevant documents.

**What is a Retriever?**
- A retriever is a system that identifies and retrieves documents that are most relevant to a given query. These documents can then be used by an LLM or another system to answer questions or perform further analysis.

- Retrievers are an important component of Retrieval-Augmented Generation (RAG) pipelines and can be implemented using various types of search and indexing systems, such as:
  - Vectorstores (e.g., embeddings-based search)
  - Graph databases (e.g., knowledge graphs)
  - Relational databases (e.g., SQL-based search)

**How Does the LangChain Retriever Work?**
- LangChain provides a retriever interface that is simple and intuitive:
  - Input: The retriever takes a query as input, which is a text string (e.g., a question you want to answer).
  - Output: It returns a list of documents, formatted as standardized LangChain Document objects. These documents contain the content and metadata retrieved from the dataset.

**Getting Started with a Simple Sparse Retriever**
- To understand the basics of retrieval, we'll start with a sparse retriever that uses the BM25 algorithm.

- What is BM25?
  - BM25 (Okapi BM25) is a ranking algorithm commonly used in search engines. It calculates the relevance of documents to a query based on:
    - The frequency of query terms in the document.
    - The importance of those terms (e.g., frequent terms like "the" are weighted lower).
    - The length of the document (shorter documents are typically favored).
    - BM25 is a lexical search algorithm, meaning it matches words in the query to words in the document.

<br>

---
<br>

**Steps to Implement BM25 in LangChain**

- Step 1: Define Your Query
Decide on the question you want to answer. For example:

In [9]:
query = "What are the components of agent systems?"

- Step 2: Initialize the Retriever
  - Use LangChain’s BM25Retriever to create a retriever based on the preprocessed documents.

In [10]:
from langchain.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(docs)

- Step 3: Retrieve Relevant Documents
  - Invoke the retriever with your query and specify how many top results (k) you want.

In [11]:
results = bm25_retriever.invoke(query)

- Step 4: Examine the Results
  - Print out the retrieved documents to see the most relevant ones.

In [12]:
print("BM25 Retrieved Results:")
for i, doc in enumerate(results):
    print(f"Result {i+1}:\n{doc.page_content}\n")

print(len(results))

BM25 Retrieved Results:
Result 1:


      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final result

**Analyzing Retrieved Text for LLM Input**
- After retrieving the relevant text, we now have data that an LLM (Language Model) can use to answer our question. However, there are some challenges we need to address:
  - Too Much Information: The retrieved content often contains extra details that are irrelevant to the question. This can confuse the LLM and reduce the accuracy of the generated answer.
  - Context Size Limits: LLMs have a limited context window (i.e., the maximum number of tokens they can process at once). If the input is too large, we need to trim or split it into smaller chunks.
- To address these issues, we need to analyze the length of the retrieved documents and optimize their size for the LLM.  


<br>

---

<br>

- Step 1: Calculate Average Document Size
  - Let’s calculate the average word count and character count of the documents we retrieved. This will help us understand if the documents are too large to be used directly.

  - Why Does This Matter?
    - If the average word count or character count is too high, we risk exceeding the LLM’s context window.
    - Large documents may also dilute the relevance of the answer, as irrelevant details are included.


In [13]:
def word_count(texts):
  total_words = 0
  total_characters = 0

  for doc in texts:
      content = doc.page_content
      word_count = len(content.split())  # Count words
      char_count = len(content)         # Count characters
      total_words += word_count
      total_characters += char_count

  num_docs = len(texts)
  avg_words = total_words / num_docs if num_docs > 0 else 0
  avg_characters = total_characters / num_docs if num_docs > 0 else 0

  print(f"Average words per document: {avg_words}")
  print(f"Average characters per document: {avg_characters}")

word_count(docs)

Average words per document: 6477.0
Average characters per document: 43047.0


- Step 2: Splitting Large Documents into Chunks
  - To make the documents manageable for both retrieval and LLM input, we need to split large documents into smaller chunks. This ensures:
    - The chunks fit within the LLM's context window.
    - Each chunk focuses on specific content, improving relevance during retrieval.
  - We can use LangChain’s RecursiveCharacterTextSplitter for this task.

**How the Text Splitter Works:**
- Chunk Size: Defines the maximum size of each chunk (in characters or tokens).
- Chunk Overlap: Ensures there’s some overlap between consecutive chunks to preserve context.

In [14]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

- Step 3: Recalculate Average Document Size
  - After splitting, we can check how the average size of the documents has changed.

In [15]:
word_count(splits)

Average words per document: 108.11111111111111
Average characters per document: 718.4444444444445


- Step 4: Retrieve Text from Split Documents
  - Now that we’ve split the documents into smaller chunks, we’ll use the BM25 retriever again to find the most relevant chunks for the query.

**Why Use the Split Documents?**
- Improved Relevance: Smaller chunks focus on specific content, making it easier to retrieve exactly what we need.
- Better Fit for LLM: The LLM can process smaller chunks without hitting its context limits, ensuring we get accurate answers.

In [16]:
bm25_retriever_split = BM25Retriever.from_documents(splits)

results_split = bm25_retriever_split.invoke(query)

In [17]:
print("BM25 Split Texts Retrieved Results:")
for i, doc in enumerate(results_split):
    print(f"Result {i+1}:\n{doc.page_content}\n")

BM25 Split Texts Retrieved Results:
Result 1:
},
  {
    "role": "user",
    "content": "We are writing {{a Super Mario game in python. MVC components split in separate files. Keyboard control.}}\n"
  },
  {
    "role": "assistant",
    "content": "Summary of areas that need clarification:\n1. Specifics of the Super Mario game (e.g. level design, characters, gameplay mechanics)\n2. Details about the MVC components (e.g. which components are in each file)\n3. Keyboard control implementation (e.g. which keys to use, how to handle input)\n\nClarifying question:\nCan you provide more details about the Super Mario game, such as level design, characters, and gameplay mechanics?"
  },
  {
    "role": "user",

Result 2:
Reliability of natural language interface: Current agent system relies on natural language as an interface between LLMs and external components such as memory and tools. However, the reliability of model outputs is questionable, as LLMs may make formatting errors and occasional

- Step 5: Use an LLM to Answer the Question
  - With the most relevant chunks retrieved, we can now use a Language Model (LLM) to generate a response to the query. For this, we’ll use the Llama-3.1-8B model hosted on HuggingFace.

**Setting Up the LLM:**
- HuggingFace Hub API: Ensure you have your HuggingFace Hub API token ready.
- Model Configuration: Specify the model repository (repo_id) and additional parameters such as temperature for response generation.

In [18]:
import getpass
import os

if "HF_TOKEN" not in os.environ:
    os.environ["HF_TOKEN"] = getpass.getpass("Enter your Huggingfacehub api token: ")

In [19]:
from langchain_huggingface import HuggingFaceEndpoint

repo_id = "meta-llama/Llama-3.1-8B"

llm = HuggingFaceEndpoint(
    repo_id=repo_id,
    temperature=1.0,
)

- Step 6: Create a RAG Chain
  - A RAG chain combines the retriever and the LLM into a single pipeline. It takes the user query, retrieves relevant documents, and passes them to the LLM for generating the final answer.

**How the RAG Chain Works:**
- Input: A query from the user.
- Retrieval: Finds the most relevant chunks using the retriever.
- Formatting: Combines the retrieved chunks into a coherent context for the LLM.
Answer Generation: The LLM generates a response based on the provided context.

In [20]:
from langchain import hub
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Format the Retrieved Chunks
def format_docs(splits):
    return "\n\n".join(doc.page_content for doc in docs)

# Load a Pre-Built RAG Prompt
prompt = hub.pull("rlm/rag-prompt")

# Build the RAG Chain
rag_chain = (
    {"context": bm25_retriever_split | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Run the Query Through the RAG Chain
result_simple = rag_chain.invoke(query)
print(result_simple)



 

      LLM Powered Autonomous Agents


Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng



To compare to the output of the LLM without RAG:

In [21]:
result_plain_llm = llm.invoke(query)
print(result_plain_llm)

 This paper adopts an instrumentation-oriented perspective to investigate this question. The instrumentalist in natural sciences holds that all concepts of nature are instruments for predicting events, that is, all theories and all observations are means to these predictions. For example, the scientific discoveries of the 20th century do not challenge the Aristotelian view of continuous space and time. The only challenge is how to most predictively represent them. In this paper we exploit the idea that conceptualizations of agent systems are tools for achieving predictively successful interactions with other agents and the environment. That is, they are instruments for achieving the goal of achieving pre-set goals. This paper presents a formalization of agent systems using the instrumentalist position. Section 2 starts with an examination of the sense in which a conceptualization of agent systems can be instrumentally successful. Section 3 outlines the temporal logic formalization for 



---



---



**Part 2: Using Embeddings and Chroma for Enhanced Retrieval**

**Why Use Embeddings?**
- Lexical retrieval methods like BM25 rely on exact word matches. This works well for simple queries but struggles with:
  - Synonyms (e.g., “AI” vs. “artificial intelligence”)
  - Conceptual Matches (e.g., “data visualization” vs. “charts”)
- Embeddings solve this by representing text as vectors in a high-dimensional space. Similar meanings result in vectors that are closer together, enabling semantic search.

**What is Chroma?**
- Chroma is a vector database designed for storing and querying embeddings. It allows us to:
  - Efficiently index large amounts of data.
  - Perform similarity searches using embeddings.
  - Combine retrieval with LLMs for downstream tasks.


- Step 1: Set Up Chroma and an Embeddings Model
  - Install Necessary Libraries

In [22]:
!pip install -q langchain-chroma>=0.1.2

zsh:1: 0.1.2 not found


- Step 2: Choose an Embeddings Model
    - We’ll use all-mpnet-base-v2 from the sentence-transformers library.

In [25]:
from langchain_huggingface import HuggingFaceEmbeddings

# Load the embeddings model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

- Step 3: Store the Embeddings in Chroma
  - Initialize the Chroma Database: Create a Chroma collection to store the embeddings and their associated documents.

In [26]:
from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="EngGenAI",
    embedding_function=embeddings,
)

In [27]:
# Add documents and their embeddings to Chroma (might take some time)
vector_store.add_documents(documents=splits)

['b1fb11d2-9ab3-48d3-b7f8-9be1d7b6b4b2',
 '92762e40-b4e1-4d20-a29e-985cbb515ade',
 '5b3a4a86-8f12-4f51-ad9b-499c68b6315d',
 '9fc77545-d403-425c-a258-3003a65245cc',
 '3a45a609-1a9b-4999-8ff1-9be1ecdac0dd',
 '03a3e91d-29aa-4ce6-a511-c00db3a7e450',
 '8bedcd33-4971-4789-90f6-86d08db7ce9b',
 '786f961c-7eb8-4955-9593-c546508ed568',
 'f1e45a4a-fe29-4399-9cb9-e13ed733a1cd',
 'b7e62888-af34-418f-94d7-12ad2ad7df82',
 '64a0e2ca-e3e4-4734-857f-049a01bbd946',
 'fcf6f3d9-a488-4da1-a414-d7706c6266ec',
 'ed1b1472-8595-4b69-bc49-a82af5da1f68',
 '27a8798c-612d-4639-bc7e-1905fffefe7d',
 'f4064d16-8c25-452a-bf16-ae2782cd3118',
 '6bc3f10d-8348-4911-b2ff-e5786183b08a',
 'd57c4994-8c0a-4392-9749-842c2e8fddd0',
 'e580decd-a584-433e-896a-1c568f0089a3',
 'a3e41380-2db4-4263-b31e-a3d0dc53c5e9',
 'a1639986-bf6a-4648-9357-abb1f6a938f6',
 '1ab87f0a-1699-48d6-9e78-b6dbc225d6c3',
 'fa7243ba-181c-4d50-9cff-012c4f2903c8',
 'e11647ba-2a1a-44be-8188-95e1f803cc82',
 'dff332bf-95ea-4146-832f-0bbac15fad2f',
 '626ef922-7b20-

- Step 4: Perform a Semantic Search
  - Generate an embedding for the query using the same model.
  - Perform a similarity search in Chroma to find the top matching documents.

In [28]:
results_chroma = vector_store.similarity_search(
    query,
    k=2,
)

# Print the results
for res in results_chroma:
    print(f"* {res.page_content}")

* LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:

Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory
* Prompt LM with 100 most re

- Step 5: Use the vector store as a retriever
  - You can also transform the vector store into a retriever for easier usage in your chains. For more information on the different search types and kwargs you can pass, please visit the API reference (https://python.langchain.com/api_reference/chroma/vectorstores/langchain_chroma.vectorstores.Chroma.html#langchain_chroma.vectorstores.Chroma.as_retriever).

In [29]:
retriever_chroma = vector_store.as_retriever(
    search_type="mmr", search_kwargs={"k": 1}
)

- Step 6: Answer the Query Using an LLM
  - We now pass the retrieved documents to the same LLM we used in the previous exercise.
  - we define a new chain using the retriever for the chroma database

In [31]:
rag_chain_chroma = (
    {"context": retriever_chroma | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

result_chroma = rag_chain_chroma.invoke(query)

print(result_chroma)

 The components of agent systems include planning, memory, and tool use. Planning involves tasks decomposition, reasoning, and self-reflection. Memory deals with short-term memory, long-term memory, and external vector stores. Tool use refers to the ability of the agent to use external tools and APIs to perform various tasks.





---



---



**Part 3: Using a MultiQueryRetriever and Compression**

In this exercise, you will learn how to use a MultiQueryRetriever and Contextual Compression Retriever to improve document retrieval and generation in a Retrieval-Augmented Generation (RAG) system. The goal is to show the impact of generating multiple queries from different perspectives and compressing retrieved documents for more efficient generation.

**Methods**
- MultiQueryRetriever: This retrieves documents based on multiple perspectives of the same query. It helps mitigate retrieval issues that can arise from slight changes in query wording, or from embeddings that may not perfectly capture the meaning of the query.

- Contextual Compression Retriever: After retrieving a large set of documents, the compression step reduces the volume of information, selecting the most relevant parts to present to the model, improving efficiency and focus on the most pertinent content.

- Step 1: MultiQueryRetriever
  - The MultiQueryRetriever is initialized with the vector store and the language model.
  - The model generates multiple variations of the query and retrieves documents for each variation.

In [32]:
from langchain.retrievers.multi_query import MultiQueryRetriever

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vector_store.as_retriever(), llm=llm
)

- Step 2: Enable logging for the queries to track MultiQueryRetriever operations

In [33]:
# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

In [34]:
unique_docs = retriever_from_llm.invoke(query)
len(unique_docs)

INFO:langchain.retrievers.multi_query:Generated queries: ['(Yurko, L. & McClatchey, R., 2004)', '```']


8

- Step 3: Use the retrieved documents in a RAG chain to generate an answer

In [35]:
rag_chain_llm = (
    {"context": retriever_from_llm | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

result_multiquery = rag_chain_llm.invoke(query)
print(result_multiquery)

INFO:langchain.retrievers.multi_query:Generated queries: ['## And here are 3 different ways of rephrasing the question to retrieve different documents from a vector database.', '    by a neural network: "Please generate 3 distinct sentences that can be used to retrieve relevant documents from a vector database, given the user question \'What are the components of agent systems?\'"', '    Important: Perform as many of these 3 transformations as possible, because the more variations you give, the more potential documents your user can retrieve, and the less likely it is that they will overlook something important.', '1. Generate a positive variant. For example, replacing the negative sentiment in the original question "I hate her" with a positive one "I love her".', '2. Generate a negative variant. For example, replacing the positive sentiment in the original question "I love her" with a negative one "I hate her".', '3. Generate a neutral variant. For example, replacing the sentiment in 

 This can be considered a simple design of a tree-based system using a large language model (LLM) as its core. This system has three components:

The planning component: It decomposes the complex task into smaller, manageable subtasks. The decomposition process allows the system to plan ahead and execute the tasks step-by-step.

The memory component: This component stores and retrieves information. It can be used to remember past actions, communicate with other agents, and even search for solutions on the internet.

The tool use component: This component enables the agent to use external tools and resources, such as APIs or pre-trained models, to supplement its capabilities. The agent can learn how to use these tools through task-specific instructions.

In conclusion, this simple design uses an LLM as the system's brain, which is augmented with planning, memory, and tool-use components to allow it to perform complex tasks autonomously.

In addition, LLMs have the capability to learn an

- Step 4: Compression Retriever:
  - After retrieving documents using MultiQueryRetriever, the ContextualCompressionRetriever applies a compression technique.
  - The compression reduces the retrieved document set to a smaller, more focused set of relevant information. This makes the subsequent generation step more efficient and targeted.
  - The LLMChainExtractor is used to compress the documents by extracting the most relevant information.

In [36]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever_from_llm
)
compressed_docs = compression_retriever.invoke(query)

INFO:langchain.retrievers.multi_query:Generated queries: ['We would like to know more about Agent Systems!', '    The specification of the goal: What are the components of agent systems? The team wants to know more about this field of study.', '    A hypothetical candidate answer from the database: You can find more information about Agent Systems on this website: https://www.scholarpedia.org/article/Agent_system.', '    A related concept: "agent systems" might be related to other concepts such as "multiagent systems," "agents," and "artificial intelligence." By understanding how these concepts are related, you might be able to generate more relevant questions for your user.', '    A comparative perspective: Maybe you could use contrasting questions to help the user better understand the difference between "agent systems" and other similar concepts.', '    A different perspective: The user might be looking for a specific component of an agent system, such as a particular algorithm or f

HfHubHTTPError: 402 Client Error: Payment Required for url: https://router.huggingface.co/featherless-ai/v1/completions (Request ID: Root=1-68c93dda-472afa542880925e5a20bee3;ac219dbd-f09c-49e2-b06b-4a32e84d29f2)

You have exceeded your monthly included credits for Inference Providers. Subscribe to PRO to get 20x more monthly included credits.

- Step 5: Use the compressed documents in a RAG chain to generate an answer

In [None]:
rag_chain_compressed = (
    {"context": compression_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

result_compressed = rag_chain_compressed.invoke(query)
print(result_compressed)



---



---



**Part 4: Implementing an Ensemble Retriever using BM25 and Chroma**

We have a lot of retrievers introduced but it might be useful to combine some of them for a better retrieval.

It is very useful in the real-world to have a combination of sparse and dense retrieval to leverage both techniques, taking advantage of the strengths of sparse and dense retrieval.

Please visit the langchain documentation for the ensemble retriever (https://python.langchain.com/docs/how_to/ensemble_retriever/) and implement it for the
bm25_retriever from part 1 and the chroma_retriever from part 2. Use it in a chain to query the LLM.

In [41]:
from langchain.retrievers import EnsembleRetriever

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever_split, retriever_chroma], weights=[0.3, 0.7]
)

results = ensemble_retriever.invoke(query)
results
#len(results)

[Document(id='b1fb11d2-9ab3-48d3-b7f8-9be1d7b6b4b2', metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/'}, page_content='LLM Powered Autonomous Agents\n    \nDate: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng\n\n\nBuilding agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.\nAgent System Overview#\nIn a LLM-powered autonomous agent system, LLM functions as the agent’s brain, complemented by several key components:\n\nPlanning\n\nSubgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.\nReflection and refinement: The agent can do self-criticism and self-reflection over

In [38]:
rag_chain_ensemble = (
    {"context": ensemble_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

result_ensemble = rag_chain_ensemble.invoke(query)

print(result_chroma)

HfHubHTTPError: 402 Client Error: Payment Required for url: https://router.huggingface.co/featherless-ai/v1/completions (Request ID: Root=1-68c96a4c-48ec0dcf6d341ddf4379db4d;0c8d1f36-a62f-4bc7-80f9-0edb74ef65e5)

You have exceeded your monthly included credits for Inference Providers. Subscribe to PRO to get 20x more monthly included credits.