# Retrieval-augmented generation (RAG)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/docs/use_cases/question_answering/qa.ipynb)

## Overview

Suppose you have some unstructured text data (in the form of a PDF, an article, Notion pages, etc.) and want to ask questions related to the contents of that data. LLMs are a great tool for this, and LangChain comes with a number of components and chains specifically designed to help with this use case.

In this guide we'll:
- go over a typical architecture for a question-answering over documents application
- build a QA app over the [LLM Powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/) blog post by Lilian Weng
- walk through the relevant LangChain components
- see how [LangSmith](/docs/langsmith/) can help trace our RAG app
- touch on evaluating a RAG application
- explore more advanced RAG applications and see how to serve them with [LangServe](/docs/langserve) by looking at some [LangChain Templates](/docs/templates/)

**Note**
Retrieval-augmented generation is a very general technique that can be used for more than just QA on more than just unstructured data. It refers to getting data at runtime ("retrieval") which has not been memorized by the model, and passing that in as part of the prompt to the model ("-augmented generation"). This allows you to perform tasks that require information not baked into the model. Two RAG use cases which we cover elsewhere are:
- [QA over structured data](/docs/use_cases/qa_structured/sql) (e.g., SQL)
- [QA over code](/docs/use_cases/question_answering/code_understanding) (e.g., Python)

![intro.png](/img/qa_intro.png)

## Architecture

To build a RAG application, you'll generally need to setup two pipelines/chains:
- **Indexing**: a pipeline for ingesting data from a source and indexing it.
- **Retrieval and generation**: the actual RAG chain, which takes the user query at run time and retrieves the relevant data from the index, then passes that to the model.

The simplest and most common full sequence from raw data to answer looks like:
#### Indexing time
1. **Load**: First we need to load our data. We'll use [DocumentLoaders](/docs/modules/data_connection/document_loaders/) for this.
2. **Split**: [Text splitters](/docs/modules/data_connection/document_transformers/) break large `Documents` into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't in a model's finite context window.
3. **Store**: We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a [VectorStore](/docs/modules/data_connection/vectorstores/) and [Embeddings](/docs/modules/data_connection/text_embedding/) model.
#### Run time
4. **Retrieve**: Give a user input, relevents splits are retrieved from storage using a [Retriever](/docs/modules/data_connection/retrievers/).
5. **Generate**: A [ChatModel](/docs/modules/model_io/chat_models) / [LLM](/docs/modules/model_io/llms/) produces an answer using a prompt that includes the question and the retrieved data

![flow.jpeg](/img/qa_flow.jpeg)

## Setup

### Dependencies

We'll use an OpenAI chat model and embeddings and a Chroma vector store in this walkthrough, but everything shown here works with any [ChatModel](/docs/integrations/chat/) or [LLM](/docs/integrations/llms/), [Embeddings](/docs/integrations/text_embedding/), and [VectorStore](/docs/integrations/vectorstores/) or [Retrievers](/docs/integrations/retrievers). 

First set environment variables and install packages:

In [None]:
!pip install -U langchain openai chromadb langchainhub bs4

# Set env var OPENAI_API_KEY or load from a .env file
# import dotenv

# dotenv.load_dotenv()

### LangSmith

Many of the applications you build with LangChain will contain multiple steps with multiple invocations of LLM calls. As these applications get more and more complex, it becomes crucial to be able to inspect what exactly is going on inside your chain or agent. The best way to do this is with [LangSmith](https://smith.langchain.com).

Note that LangSmith is not needed, but it is helpful. If you do want to use LangSmith, after you sign up at the link above, make sure to set your environment variables to start logging traces:

```bash
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_API_KEY="..."
```

## Quickstart

Suppose we want to build a QA app over the [LLM Powered Autonomous Agents](https://lilianweng.github.io/posts/2023-06-23-agent/) blog post by Lilian Weng. We can create a simple pipeline for this in ~20 lines of code:

In [1]:
import bs4
from langchain import hub
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

In [4]:
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header")))
)
docs = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt 
    | llm
    | StrOutputParser()
)

In [5]:
rag_chain.invoke("What is Task Decomposition?")

'Task decomposition is the process of breaking down a task into smaller subgoals or steps. It can be done using simple prompting, task-specific instructions, or with human inputs. Task decomposition helps in planning and organizing complex tasks.'

:::note LangSmith trace

[Here](https://smith.langchain.com/public/2270a675-74de-47ac-b111-b232d8340a64/r) is the LangSmith trace for this chain.
:::

## Detailed walkthrough

Let's go through the above code step-by-step to really understand what's going on.

## Step 1. Load

We need to first load the blog post contents. We can use `DocumentLoader`s for this, which are objects that load in data from a source as `Documents`.  A `Document` is an object with `page_content` (str) and `metadata` (dict) attributes. 

In this case we'll use the `WebBaseLoader`, which uses `urllib` and `BeautifulSoup` to load and parse the passed in web urls, returning one `Document` per url. We can customize the html -> text parsing by passing in parameters to the `BeautifulSoup` parser via `bs_kwargs`. In this case only HTML tags with class "post-content", "post-title", or "post-header" are relevant, so we'll remove all others.

In [7]:
from langchain.document_loaders import WebBaseLoader

loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(parse_only=bs4.SoupStrainer(class_=("post-content", "post-title", "post-header")))
)
docs = loader.load()

In [8]:
len(docs[0].page_content)

42824

In [9]:
print(docs[0].page_content[:500])



      LLM Powered Autonomous Agents
    
Date: June 23, 2023  |  Estimated Reading Time: 31 min  |  Author: Lilian Weng


Building agents with LLM (large language model) as its core controller is a cool concept. Several proof-of-concepts demos, such as AutoGPT, GPT-Engineer and BabyAGI, serve as inspiring examples. The potentiality of LLM extends beyond generating well-written copies, stories, essays and programs; it can be framed as a powerful general problem solver.
Agent System Overview#
In


### Go deeper
- See further documentation on loaders [here](/docs/modules/data_connection/document_loaders/).
- Find the relevant document loader integration (of the > 160 of them) for your use case [here](/docs/integrations/document_loaders).


## Step 2. Split

Our loaded document is over 42k characters long. This is too long to fit in the context window of many models. And even for those models that could fit the full post in their context window, empirically models struggle to find the relevant context in very long prompts. 

So we'll split the `Document` into chunks for embedding and vector storage. This should help us retrieve only the most relevant bits of the blog post at run time.

In this case we'll split our documents into chunks of 1000 characters with 200 characters of overlap between chunks. The overlap helps mitigate the possibility of separating a statement from important context related to it. We use the `RecursiveCharacterTextSplitter`, which will (recursively) split the document using common separators (like new lines) until each chunk is the appropriate size.

In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)
all_splits = text_splitter.split_documents(docs)

In [11]:
len(all_splits)

66

In [12]:
len(all_splits[0].page_content)

969

In [13]:
all_splits[10].metadata

{'source': 'https://lilianweng.github.io/posts/2023-06-23-agent/',
 'start_index': 7056}

### Go deeper

- `DocumentSplitters` are just one type of the more generic `DocumentTransformers`.
- See further documentation on transformers [here](/docs/modules/data_connection/document_transformers/) and integrations [here](/docs/integrations/document_transformers/).
- `Context-aware splitters` keep the location ("context") of each split in the original `Document`:
    - [Markdown files](/docs/use_cases/question_answering/document-context-aware-QA)
    - [Code (py or js)](docs/integrations/document_loaders/source_code)
    - [Scientific papers](/docs/integrations/document_loaders/grobid)

## Step 3. Store

Now that we've got 66 text chunks in memory, we need to store and index them so that we can search them later in our RAG app. The most common way to do this is to embed the contents of each document split and upload those embeddings to a vector store. 

Then, when we want to search over our splits, we take the search query, embed it as well, and perform some sort of "similarity" search to identify the stored splits with the most similar embeddings to our query embedding. The simplest similarity measure is cosine similarity — we measure the cosine of the angle between each pair of embeddings (which are just very high dimensional vectors).

We can embed and store all of our document splits in a single command using the `Chroma` vector store and `OpenAIEmbeddings` model.

In [23]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings())

### Go deeper
- Browse the > 40 vectorstores integrations [here](/docs/integrations/vectorstores/) and see further documentation on the interface [here](/docs/modules/data_connection/vectorstores/).
- Browse the > 30 text embedding integrations [here](/docs/integrations/text_embedding/) and see further documentation on the interface [here](/docs/modules/data_connection/text_embedding).

This completes the **Indexing** portion of the pipeline. At this point we have an query-able vector store containing the chunked contents of our blog post. Given a user question, we should ideally be able to return the snippets of the blog post that answer the question:

![lc.png](/img/qa_data_load.png)

## Step 4. Retrieve

Now let's write the actual application logic. We want to create a simple application that let's the user ask a question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and finally returns an answer.

LangChain defines a `Retriever` interface which wraps an index that can return relevant documents given a string query. All retrievers implement a common method `get_relevant_documents()` (and its asynchronous variant `aget_relevant_documents()`).

The most common type of `Retriever` is the `VectorStoreRetriever`, which uses the similarity search capabilities of a vector store to facillitate retrieval. Any `VectorStore` can easily be turned into a `Retriever`:

In [24]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

In [25]:
retrieved_docs = retriever.get_relevant_documents("What are the approaches to Task Decomposition?")

In [26]:
len(retrieved_docs)

6

In [29]:
print(retrieved_docs[0].page_content)

Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple reasoning possibilities at each step. It first decomposes the problem into multiple thought steps and generates multiple thoughts per step, creating a tree structure. The search process can be BFS (breadth-first search) or DFS (depth-first search) with each state evaluated by a classifier (via a prompt) or majority vote.
Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs.


### Go deeper
Vector stores are commonly used for retrieval, but there are plenty of other ways to do retrieval. 
- LangChain has many [built-in retrieval techniques](/docs/modules/data_connection/retrievers/) and [Retriever integrations](/docs/integrations/retrievers/).

Some of which include:
- `MultiQueryRetriever` [generates variants of the input question](/docs/modules/data_connection/retrievers/MultiQueryRetriever) to improve retrieval hit rate.
- `MultiVectorRetriever` (diagram below) instead generates [variants of the embeddings](/docs/modules/data_connection/retrievers/multi_vector), also in order to improve retrieval hit rate.
- `Max marginal relevance` selects for [relevance and diversity](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf) among the retrieved documents to avoid passing in duplicate context.
- Documents can be filtered during vector store retrieval using [`metadata` filters](/docs/use_cases/question_answering/document-context-aware-QA).

![mv.png](/img/multi_vector.png)

## Step 5. Generate

Let's put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output.

We'll use the gpt-3.5-turbo OpenAI chat model, but any LangChain `LLM` or `ChatModel` could be substituted in.

In [30]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

We'll use a prompt for RAG that is checked into the LangChain prompt hub ([here](https://smith.langchain.com/hub/rlm/rag-prompt)).

In [31]:
from langchain import hub

prompt = hub.pull("rlm/rag-prompt")

In [36]:
print(prompt.invoke({"context": "filler context", "question": "filler question"}).to_string())

Human: You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


We'll use the [LCEL Runnable](https://python.langchain.com/docs/expression_language/) protocol to define the chain, allowing us to 
- pipe together components in a transparent way
- automatically trace our chain in LangSmith
- get streaming, async, and batching out of the box

In [39]:
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt 
    | llm
    | StrOutputParser()
)

In [40]:
for chunk in rag_chain.stream("What is Task Decomposition?"):
    print(chunk, end="", flush=True)

Task decomposition is the process of breaking down a complex task into smaller and simpler steps. It can be done using techniques like Chain of Thought (CoT) or Tree of Thoughts, which involve transforming big tasks into multiple manageable tasks. Task decomposition can also be achieved through simple prompting, task-specific instructions, or human inputs.

### Go deeper

#### Choosing LLMs
- Browse the > 90 LLM and chat model integrations [here](/docs/modules/integrations/chat).
- See further documentation on LLMs and chat models [here](/docs/modules/model_io).
- See a guide on RAG with local LLMs [here](/docs/modules/use_cases/question_answering/local_retrieval_qa).

#### Customizing the prompt

As shown above, we can load prompts (e.g., [this RAG prompt](https://smith.langchain.com/hub/rlm/rag-prompt)) from the prompt hub. The prompt can also be easily customized:

In [41]:
from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum and keep the answer as concise as possible. 
Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
rag_prompt_custom = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | rag_prompt_custom 
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is Task Decomposition?")

'Task decomposition is the process of breaking down a complex task into smaller and simpler steps. It can be done using techniques like Chain of Thought (CoT) or Tree of Thoughts, which prompt the model to think step by step and explore multiple reasoning possibilities at each step. Thanks for asking!'

We can use [LangSmith](https://smith.langchain.com/public/129cac54-44d5-453a-9807-3bd4835e5f96/r) to see the trace.

### Adding history


### Adding sources


## Evaluation

## Relevant LangChain Templates

There are many many different techniques and integrations that can be used to build a RAG application.

### Serving with LangServe

## Next steps