In [None]:
!pip install -qU langchain langchain-community langchain-chroma langchain-openai

# Build a Retrieval Augmented Generation (RAG) App

One of the most powerful applications enabled by LLMs is sophisticated question-answering (Q&A) chatbots, which can answer questions about specific source information. These applications use a technique known as **Retrieval Augmented Generation (RAG)**.

## What is RAG?

RAG is a techique for augmenting LLM knowledge with additional data.

If we want to build AI applications that can reason about private data or data introduced after a model's cutoff training date, we need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as RAG.

## Concepts

A typical RAG application has two main components:
* **Indexing**: a pipeline for ingesting data from a source and indexing it. *This usually happens offline.*
* **Retrieval and generation**: the actual RAG chain, which taskes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

### Indexing

1. **Load**: We need to load our data. This is done with Document Loaders.
2. **Split**: Text splitters break large `Documents` into smaller chunks. This is useful both for indexing data and for passing it into a model, since large chunks are harder to search over and will NOT fit in a model's finite context window.
3. **Store**: We need somewhere to store and index our splits, so that they can later be searched over. This is done using a `VectorStore` and `Embeddings` model.

### Retrieval and generation

4. **Retrieve**: Given a user input, relevant splits are retrieved from storage using a `Retriever`.
5 **Generate**: A `ChatModel`/`LLM` produces an answer using a prompt that includes the question and the retrieved data.

## Setup

In [None]:
import os

langchain_api_key = 'your_langchain_api_key_here'  # Replace with your actual LangChain API key
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_API_KEY'] = langchain_api_key

openai_api_key = 'your_openai_api_key_here'  # Replace with your actual OpenAI API key
os.environ['OPENAI_API_KEY'] = openai_api_key

## Preview

In this section, we will build an app that answers questions about the content of a website. The specific website we will use is the LLM Powered Autonomous Agents blog, which allows us to ask questions about the contents of the post.

In [None]:
import bs4
from langchain import hub
from langchain_chroma import Chroma
from langchain_community.document_loaders import WebBaseLoader
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitter import RecursiveCharacterTextSplitter


# Create a llm
llm = ChatOpenAI(model='gpt-3.5-turbo')

# Load, chunk, and index the contents of the blog
loader = WebBaseLoader(
    web_paths=('https://lilianweng.github.io/posts/2023-06-23-agent/',),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=('post-content', 'post-title', 'post-header')
        )
    ),
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
vector_store = Chroma.from_documents(documents=splits,
                                     embedding=OpenAIEmbeddings())


# Retrieve and generate using the relevant snippets of the blog
retriever = vector_store.as_retriever()
prompt = hub.pull('rlm/rag-prompt')


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {'context': retriever | format_docs, 'question': RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke('What is Task Decomposition?')

In [None]:
# clean up
vector_store.delele_collection()

## Detailed walkthrough

### 1.Indexing: Load

First we need to load the blog post contents. We can use `DocumentLoaders` for this. In this case, we will use the `WebBaseLoader`, which uses `urllib` to load HTML from web URLs and `BeautifulSoup` to parse it to text.

We can customize the HTML -> text parsing by passing in parameters to the `BeautifulSoup` parser via `bs_kwargs`. In this case only HTML tags with class "post-content", "post-title", or "post-header" are relevant, and we will remove all others.

In [12]:
import bs4
from langchain_community.document_loaders import WebBaseLoader

# Only keep post title, header, and content from the full HTML
bs4_strainer = bs4.SoupStrainer(
    class_=('post-title', 'post-header', 'post-content')
)

loader = WebBaseLoader(
    web_paths=('https://lilianweng.github.io/posts/2023-06-23-agent/',),
    bs_kwargs={'parse_only': bs4_strainer},
)
docs = loader.load()

len(docs[0].page_content)

43131

In [13]:
print(docs[0].page_content[:500])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


In [8]:
type(docs), type(docs[0])

(list, langchain_core.documents.base.Document)

### 2.Indexing: Split

The loaded document is over 42k characters long. This is too long for many models to find information.

To handle this we have to split the `Document` into chunks for embedding and vector storage, which should help us retrieve only the most relevant bits of the blog post at run time.

In this case, we will split our documents into chunks of 1,000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statment from important context related to it.

We use the `RecursiveCharacterTextSplitter`, which will recursively split the document using commmon separators like new lines until each chunk has the appropriate size.

We set `add_start_index=True` so that the character index, at which each split `Document` starts within the initial `Document`, is preserved as metadata attribute "start_index".

In [16]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    add_start_index=True,
)

all_splits = text_splitter.split_documents(docs)

len(all_splits)

66

In [17]:
# check the length of first chunk
len(all_splits[0].page_content)

969

In [18]:
# check the metadata of the random split
all_splits[10].metadata

{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'start_index': 7056}

### 3.Indexing: Store

Now we need to index our 66 text chunks so that we can search over them at runtime.

We need to embed the contents of each document split and insert these embeddings into a vector database (or vector store). When we want to search over our splits, we take a text search query, embed it, and perform some sort of "similarity" search to identify the stored splits with the most similar embeddings to our query embedding.

The simplest similarity measure is cosine similarity.

We can embed and store all of our document splits in a single command using the `Chroma` vector store and `OpenAIEmbeddings` model.

In [19]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

vector_store = Chroma.from_documents(
    documents=all_splits,
    embedding=OpenAIEmbeddings(),
)

This completes the **Indexing** portion of the pipeline. We have a query-able vector store containing the chunked contents of our blog post. Given a user question, we should ideally be able to return the snippets of the blog post that answer the question.

### 4.Retrieval and Generation: Retrieve

We want to create a simple application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

First we need to define our logic for searching over documents. LangChain defines a `Retriever` interface which wraps an index that can return relevant `Documents` given a string query.

The most common type of `Retriever` is the `VectorStoreRetriever`, which uses the similarity search capabilities of a vector store to facilitate retrieval. Any `VectorStore` can easily be turned into a `Retriever` with `VectorStore.as_retriever()`:

In [20]:
retriever = vector_store.as_retriever(
    search_type='similarity',
    search_kwargs={"k": 3},
)

retrieved_docs = retriever.invoke('What are the approaches to Task Decomposition?')

len(retrieved_docs)

3

In [21]:
print(retrieved_docs[0].page_content)

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.


### 5.Retrieval and Generation: Generate

Finally, we put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output.

We will use the gpt-3.5-turbo OpenAI chat model as an example.

In [22]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model='gpt-3.5-turbo')

We will use a prompt for RAG that is checked into the LangChain prompt hub.

In [23]:
from langchain import hub

prompt = hub.pull('rlm/rag-prompt')

example_messages = prompt.invoke(
    {'context': 'filler context',
     'question': 'filler question'},
).to_messages()

example_messages

[HumanMessage(content="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: filler question \nContext: filler context \nAnswer:", additional_kwargs={}, response_metadata={})]

In [24]:
print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


We will use the LCEL Runnable protocol to define the chain, allowing us to
* pipe together components and functions in a transparent way
* automatically trace our chain in LangSmith
* get streaming, async, and batched calling out of the box.

In [25]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {'context': retriever | format_docs, 'question': RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [26]:
for chunk in rag_chain.stream('What is Task Decomposition?'):
    print(chunk, end='', flush=True)

Task Decomposition is a technique where complex tasks are broken down into smaller and simpler steps for better understanding and execution. It involves transforming big tasks into multiple manageable tasks to enhance model performance. Task decomposition can be achieved through simple prompting, task-specific instructions, or human inputs.

Inside the LCEL,
* Each of these components (`retriever`, `prompt`, `llm`, and `StrOutputParser`) are instances of `Runnable`, which means that they implement the same methods -- such as sync and async `.invoke`, `.stream`, or `.batch` -- which makes them easier to connect together. They can be connected into a `RunnableSequence` via the `|` operator.
* LangChain will automatically cast certain objects to runnables when met with the `|` operator. Here, `format_docs` is cast to a `RunnableLambda`, and the dict with `"context"` and `"question"` is cast to a `RunnableParallel`.

As we have seen above, the input to `prompt` is expected to be a dict with keys `"context"` and `"question"`. So the first element of this chain builds runnables that will calculate both of these from the input question:
* `retriever | format_docs` passes the question through the retriever, generating `Document` objects, and then to `format_docs` to generate strings;
* `RunnablePassthrough()` passes through the input question unchanged.

If we construct,
```python
chain = (
    {'context': retriever | format_docs, 'question': RunnablePassthrough()}
    | prompt
)
```
Then `chain.invoke(question)` would build a formatted prompt, ready for inference. Note that when developing with LCEL, it can be practical to test with sub-chains like this.

Finally, the last steps of the chain are `llm`, which runs the inference, and `StrOutputParser()`, which just plucks the string content out of the LLM's output message.

#### Built-in chains

If preferred, LangChain includes convenience functions that implement the above LCEL. There are two useful functions:
* `create_stuff_documents_chain` specifies how retrieved context is fed into a prompt and LLM. We will "stuff" the contents into the prompt -- i.e., we will include all retrieved context without any summarization or other processing. It largely implements our above `rag_chain`, with input keys `context` and `input` -- it generates an answer using retrieved context and query.
* `create_retrieval_chain` adds the retreival step and propagates the retrieved context through the chain, providing it alongside the final answer. It has input key `input`, and includes `input`, `context`, and `answer` in its output.

In [28]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ('system', system_prompt),
        ('human', "{input}"),
    ]
)

question_answer_chain = create_stuff_documents_chain(
    llm=llm,
    prompt=prompt,
)
rag_chain = create_retrieval_chain(
    retriever,
    question_answer_chain,
)

response = rag_chain.invoke(
    {'input': 'What is Task Decomposition?'}
)
print(response['answer'])

Task decomposition is a technique used to break down complex tasks into smaller and simpler steps, making it easier for agents or models to tackle them. Methods like Chain of Thought (CoT) and Tree of Thoughts help in transforming big tasks into manageable components for better understanding and execution. Task decomposition can be achieved through various means, including simple prompting by language models, task-specific instructions, or human inputs.


#### Returning sources

Often in Q&A applications it is important to show users the sources that were used to generate the answer.

LangChain's built-in `create_retrieval_chain` will propagate retrieved source documents through the output in the `"context"` key:

In [29]:
response['answer']

'Task decomposition is a technique used to break down complex tasks into smaller and simpler steps, making it easier for agents or models to tackle them. Methods like Chain of Thought (CoT) and Tree of Thoughts help in transforming big tasks into manageable components for better understanding and execution. Task decomposition can be achieved through various means, including simple prompting by language models, task-specific instructions, or human inputs.'

In [30]:
for document in response['context']:
    print(document)
    print('--------')

page_content='Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition#
Chain of thought (CoT; Wei et al. 2022) has become a standard prompting technique for enhancing model performance on complex tasks. The model is instructed to “think step by step” to utilize more test-time computation to decompose hard tasks into smaller and simpler steps. CoT transforms big tasks into multiple manageable tasks and shed lights into an interpretation of the model’s thinking process.' metadata={'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/', 'start_index': 1585}
--------
page_content='Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can

#### Customizing the prompt

We can load prompts (e.g., this RAG prompt) from the prompt hub. The prompt can also be easily customized:

In [31]:
from langchain_core.prompts import PromptTemplate

template = """
Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

{context}

Question: {question}

Helpful Answer:
"""

custom_rag_prompt = PromptTemplate.from_template(template)

rag_chain = (
    {'context': retriever | format_docs, 'question': RunnablePassthrough()}
    | custom_rag_prompt
    | llm
    | StrOutputParser()
)


rag_chain.invoke('What is Task Decomposition?')

'Task decomposition is a technique used to break down complex tasks into smaller and simpler steps, allowing for better planning and execution by autonomous agents. It can be achieved through methods like Chain of Thought and Tree of Thoughts, which help in transforming big tasks into manageable subgoals. Task decomposition can be facilitated by LLM with simple prompting, task-specific instructions, or human inputs. Thanks for asking!'