# RAG与LCEL
* 如何以非常快速和简洁的方式进行RAG。

## Setup

## 创建您的 .env 文件
* 在 GitHub 仓库中，我们包含了一个名为 .env.example 的文件
* 将该文件重命名为 .env 文件，在这里您将添加您的机密 API 密钥。记得包括：
* OPENAI_API_KEY=your_openai_api_key
* LANGCHAIN_TRACING_V2=true
* LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
* LANGCHAIN_API_KEY=your_langchain_api_key
* LANGCHAIN_PROJECT=your_project_name

我们将把我们的LangSmith项目称为**007-rag-with-lcel**。

## 连接到位于此笔记本同一目录中的 .env 文件

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())
openai_api_key = os.environ["OPENAI_API_KEY"]

## 向量数据库（即向量存储）：存储和搜索嵌入
* 请参阅[此处](https://python.langchain.com/v0.1/docs/modules/data_connection/vectorstores/)的文档页面。
* 请参阅[此处](https://python.langchain.com/v0.1/docs/integrations/vectorstores/)的向量存储列表。

In [77]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma

# Load the document, split it into chunks, embed each chunk and load it into the vector store.
loaded_document = TextLoader('./data/state_of_the_union.txt').load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

chunks_of_text = text_splitter.split_documents(loaded_document)

vector_db = Chroma.from_documents(chunks_of_text, OpenAIEmbeddings())

In [79]:
question = "What did the president say about the John Lewis Voting Rights Act?"

response = vector_db.similarity_search(question)

print(response[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


## 检索器：根据问题返回响应
* 检索器是一个接口，可以根据非结构化查询返回文档。它比向量存储更为通用。
* 检索器不需要能够存储文档，只需返回（或检索）它们。
* 向量存储可以用作检索器的基础，但也有其他类型的检索器。
* 请查看文档页面 [这里](https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/)。
* 请查看第三方检索器的列表 [这里](https://python.langchain.com/v0.1/docs/integrations/retrievers/)。

#### 向量存储作为检索器

In [81]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader("./data/state_of_the_union.txt")

In [83]:
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

loaded_document = loader.load()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

chunks_of_text = text_splitter.split_documents(loaded_document)

embeddings = OpenAIEmbeddings()

vector_db = FAISS.from_documents(chunks_of_text, embeddings)

In [84]:
retriever = vector_db.as_retriever()

#### 简单使用，无需LCEL

In [93]:
response = retriever.invoke("what did he say about ketanji brown jackson?")

In [94]:
len(response)

4

In [95]:
response[0]

Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': './data/state_of_the_union.txt'})

### 指定前k个

In [96]:
retriever = vector_db.as_retriever(search_kwargs={"k": 1})

In [97]:
response = retriever.invoke("what did he say about ketanji brown jackson?")

In [98]:
len(response)

1

In [99]:
response

[Document(page_content='Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. \n\nTonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. \n\nOne of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. \n\nAnd I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.', metadata={'source': './data/state_of_the_union.txt'})]

#### 与LCEL的简单使用及输入输出格式化工具

In [89]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """Answer the question based only on the following context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

model = ChatOpenAI()

def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | model
    | StrOutputParser()
)

response = chain.invoke("what did he say about ketanji brown jackson?")

以下是链中每个部分的简单解释：

1. **检索相关文档**：首先，系统检索与您所问问题相关的文档。它使用的是一个检索器，旨在根据查询找到最相关的信息。

2. **格式化文档**：一旦检索到相关文档，下一步是将它们格式化为可读的格式。函数 `format_docs` 通过提取每个文档的内容并用两个换行符分隔它们来完成这一任务。这会创建一个清晰且结构化的文本块，作为回答问题的背景信息。

3. **准备提示**：在格式化完背景信息后，系统使用模板为 AI 模型准备提示。模板通过首先放置格式化的背景信息，然后提出问题来构建输入。这样，AI 模型就能准确知道背景信息是什么，以及需要回答的内容。

4. **生成答案**：结构化的提示随后被输入到 AI 模型中，在这种情况下是 ChatOpenAI。AI 阅读组合的背景和问题，并根据所提供文本中的知识生成答案。

5. **提取最终答案**：最后，AI 的响应通过 `StrOutputParser` 被解析为简单的文本格式，便于阅读和理解。这一步确保您获得的输出仅仅是答案，排除了 AI 可能输出的任何格式或原始数据。

总之，这段代码自动化了获取相关信息、准备信息并请求 AI 提供基于这些信息的清晰答案的过程。就像建立一个迷你问答系统，使 AI 拥有所有必要的背景以提供准确的响应。

In [90]:
response

"He said that Ketanji Brown Jackson is one of the nation's top legal minds who will continue Justice Breyer's legacy of excellence."