# 建立一个将自然语言的问题转换成结构化对象并查询的RAG应用| 🦜️🔗 LangChain

[https://python.langchain.com/v0.2/docs/tutorials/query_analysis/](https://python.langchain.com/v0.2/docs/tutorials/query_analysis/)


## 配置

### 安装依赖

In [None]:
%%bash
pip install -qU langchain langchain-community langchain-openai youtube-transcript-api pytube langchain-chroma langchainhub

In [None]:
pip list

Package                                  Version
---------------------------------------- ---------------------
absl-py                                  1.4.0
aiohttp                                  3.9.5
aiosignal                                1.3.1
alabaster                                0.7.16
albumentations                           1.3.1
altair                                   4.2.2
annotated-types                          0.7.0
anyio                                    3.7.1
argon2-cffi                              23.1.0
argon2-cffi-bindings                     21.2.0
array_record                             0.5.1
arviz                                    0.15.1
asgiref                                  3.8.1
astropy                                  5.3.4
astunparse                               1.6.3
async-timeout                            4.0.3
atpublic                                 4.1.0
attrs                                    23.2.0
audioread                            

### 设置环境变量

In [None]:
import os
from google.colab import userdata
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_BASE"] = userdata.get('OPENAI_API_BASE')
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = userdata.get('LANGCHAIN_API_KEY')

### 加载文本

我们可以使用 `YouTubeLoader` 加载一些 YouTube 上 LangChain 视频的字幕：

In [None]:
from langchain_community.document_loaders import YoutubeLoader
import json

urls = [
    "https://www.youtube.com/watch?v=HAn9vnJy6S4",
    "https://www.youtube.com/watch?v=dA1cHGACXCo",
    "https://www.youtube.com/watch?v=ZcEMLz27sL4",
    "https://www.youtube.com/watch?v=hvAPnpSfSGo",
    "https://www.youtube.com/watch?v=EhlPDL4QrWY",
    "https://www.youtube.com/watch?v=mmBo8nlu2j0",
    "https://www.youtube.com/watch?v=rQdibOsL1ps",
    "https://www.youtube.com/watch?v=28lC4fqukoc",
    "https://www.youtube.com/watch?v=es-9MgxB-uc",
    "https://www.youtube.com/watch?v=wLRHwKuKvOE",
    "https://www.youtube.com/watch?v=ObIltMaRJvY",
    "https://www.youtube.com/watch?v=DjuXACWYkkU",
    "https://www.youtube.com/watch?v=o7C9ld6Ln-M",
]

add_docs = []
for url in urls:
    add_docs.extend(YoutubeLoader.from_youtube_url(url, add_video_info=True).load())

# Convert documents to a serializable format (e.g., list of dictionaries)
serializable_docs = [doc.dict() for doc in add_docs]

# Save to a JSON file
with open("langchain_transcripts.json", "w") as f:
    json.dump(serializable_docs, f)


加载json文件

In [None]:
import json
from langchain.schema import Document

# Load from JSON file
with open("langchain_transcripts.json", "r") as f:
    serializable_docs = json.load(f)

# Convert back to Document objects
docs = [Document(**doc) for doc in serializable_docs]



**API 调用:**[YoutubeLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.youtube.YoutubeLoader.html)

第一个视频的字幕内容，

In [None]:
# Now you can use the `docs` list as needed
print(docs[0].page_content[:500])

hello today I want to talk about open gpts open gpts is a project that we built here at linkchain uh that replicates the GPT store in a few ways so it creates uh end user-facing friendly interface to create different Bots and these Bots can have access to different tools and they can uh be given files to retrieve things over and basically it's a way to create a variety of bots and expose the configuration of these Bots to end users it's all open source um it can be used with open AI it can be us


In [None]:
docs[0].metadata

{'source': 'HAn9vnJy6S4',
 'title': 'OpenGPTs',
 'description': 'Unknown',
 'view_count': 8932,
 'thumbnail_url': 'https://i.ytimg.com/vi/HAn9vnJy6S4/hq720.jpg',
 'publish_date': '2024-01-31 00:00:00',
 'length': 1530,
 'author': 'LangChain'}

这是与每个视频关联的元数据。我们可以看到，每个文档还具有标题、观看次数、发布日期和长度：

添加元数据: 视频发布的年份

In [None]:
import datetime

# 添加元数据: 视频发布的年份
for doc in docs:
    doc.metadata["publish_year"] = int(
        datetime.datetime.strptime(
            doc.metadata["publish_date"], "%Y-%m-%d %H:%M:%S"
        ).strftime("%Y")
    )

In [None]:
docs[0].metadata

{'source': 'HAn9vnJy6S4',
 'title': 'OpenGPTs',
 'description': 'Unknown',
 'view_count': 8932,
 'thumbnail_url': 'https://i.ytimg.com/vi/HAn9vnJy6S4/hq720.jpg',
 'publish_date': '2024-01-31 00:00:00',
 'length': 1530,
 'author': 'LangChain',
 'publish_year': 2024}


### 文本索引

每当我们执行检索时，我们需要创建一个文档索引，以便我们可以查询。我们将使用向量存储来索引我们的文档，并首先将它们分块，使我们的检索更加简洁和精确：

In [None]:
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
chunked_docs = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    chunked_docs,
    embeddings,
)

**API 调用：**[OpenAIEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain_openai.embeddings.base.OpenAIEmbeddings.html) | [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/character/langchain_text_splitters.character.RecursiveCharacterTextSplitter.html)

## Retrieval 检索器

我们可以直接在用户问题上执行相似性搜索，以找到与问题相关的片段：

In [None]:
search_results = vectorstore.similarity_search("how do I build a RAG agent")
print(search_results[0].metadata["title"])
print(search_results[0].page_content[:500])

OpenGPTs
hardcoded that it will always do a retrieval step here the assistant decides whether to do a retrieval step or not sometimes this is good sometimes this is bad sometimes it you don't need to do a retrieval step when I said hi it didn't need to call it tool um but other times you know the the llm might mess up and not realize that it needs to do a retrieval step and so the rag bot will always do a retrieval step so it's more focused there because this is also a simpler architecture so it's always


这个方法效果相当不错！我们的第一个结果与问题非常相关。

我们再尝试搜索特定时间段的结果，直接搜索2023年发布的关于 RAG 的视频，

In [None]:
search_results = vectorstore.similarity_search("videos on RAG published in 2023")
print(search_results[0].metadata["title"])
print(search_results[0].metadata["publish_date"])
print(search_results[0].page_content[:500])

OpenGPTs
2024-01-31 00:00:00
hardcoded that it will always do a retrieval step here the assistant decides whether to do a retrieval step or not sometimes this is good sometimes this is bad sometimes it you don't need to do a retrieval step when I said hi it didn't need to call it tool um but other times you know the the llm might mess up and not realize that it needs to do a retrieval step and so the rag bot will always do a retrieval step so it's more focused there because this is also a simpler architecture so it's always




可以看到第一个结果是来自2024年的（尽管我们要求的是2023年的视频）。由于我们只是针对文档内容进行搜索，所以结果无法根据任何文档属性进行过滤。现在，让我们看看如何能够解决这个问题！


## 查询分析

我们可以使用查询分析来改善检索结果。这需要定义一个包含日期过滤器的查询结构，并使用函数调用模型将用户问题转换为结构化查询。

### 定义查询类

在这种情况下，我们将有明确的属性用于过滤发布日期。

我们可以定义一个Search类，作为查询向量数据量的参数，来接收查询需要的字段，其中包含str类型的查询问题的query字段和int类型的发布年份publish_year字段

In [None]:
from typing import Optional
from langchain_core.pydantic_v1 import BaseModel, Field

class Search(BaseModel):
    """Search over a database of tutorial videos about a software library."""

    query: str = Field(
        ...,
        description="Similarity search query applied to video transcripts.",
    )
    publish_year: Optional[int] = Field(None, description="Year video was published")

### 查询生成

为了将用户问题转换为结构化查询，可以使用 OpenAI 工具调用的API。使用新的[ChatModel.with_structured_output()](https://python.langchain.com/v0.2/docs/how_to/structured_output/)构造函数来处理将结构传递给模型并解析输出。

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

system = """You are an expert at converting user questions into database queries. \
You have access to a database of tutorial videos about a software library for building LLM-powered applications. \
Given a question, return a list of database queries optimized to retrieve the most relevant results.

If there are acronyms or words you are not familiar with, do not try to rephrase them."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm = llm.with_structured_output(Search)
query_analyzer = {"question": RunnablePassthrough()} | prompt | structured_llm

**API 调用：**[ChatPromptTemplate](https://api.python.langchain.com/en/latest/prompts/langchain_core.prompts.chat.ChatPromptTemplate.html) | [RunnablePassthrough](https://api.python.langchain.com/en/latest/runnables/langchain_core.runnables.passthrough.RunnablePassthrough.html) | [ChatOpenAI](https://api.python.langchain.com/en/latest/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html)

让我们看看分析器为我们之前搜索的问题生成了哪些查询：

In [None]:
query_analyzer.invoke("how do I build a RAG agent")

Search(query='build RAG agent', publish_year=None)

In [None]:
query_analyzer.invoke("videos on RAG published in 2023")

Search(query='RAG', publish_year=2023)


## 使用查询分析的检索

现在让我们尝试使用生成的查询执行实际的检索。

在这个示例中，我们选择了 `tool_choice="Search"` 选项。这会让LLM只能使用一个工具进行搜索，这样我们总能得到一个优化后的查询结果。请记住，这并不是每次都这样——可以参考其他指南，了解在没有优化查询或有多个优化查询时应该怎么处理。

In [None]:
from typing import List

from langchain_core.documents import Document

**API 调用：**[Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html)

In [None]:
def retrieval(search: Search) -> List[Document]:
    if search.publish_year is not None:
        # This is syntax specific to Chroma,
        # the vector database we are using.
        _filter = {"publish_year": {"$eq": search.publish_year}}
    else:
        _filter = None
    return vectorstore.similarity_search(search.query, filter=_filter)

In [None]:
retrieval_chain = query_analyzer | retrieval

现在我们再问一下之前的问题，并看到它只输出了那一年的结果！

In [None]:
results = retrieval_chain.invoke("RAG tutorial published in 2023")

In [None]:
[(doc.metadata["title"], doc.metadata["publish_date"]) for doc in results]

[('Getting Started with Multi-Modal LLMs', '2023-12-20 00:00:00'),
 ('LangServe and LangChain Templates Webinar', '2023-11-02 00:00:00'),
 ('Getting Started with Multi-Modal LLMs', '2023-12-20 00:00:00'),
 ('Building a Research Assistant from Scratch', '2023-11-16 00:00:00')]

In [None]:
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
prompt = hub.pull("rlm/rag-prompt")

def format_docs(docs):
  return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
{"context": retrieval_chain | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
rag_chain.invoke("How to implement Retrieval-Augmented Generation (RAG) in 2023")

'To implement Retrieval-Augmented Generation (RAG) in 2023, you can start by creating a set of question-answer pairs from a slide deck using CSV format and loading them into a platform like Lang Smith for evaluation. Utilize multimodal embeddings, such as Open CLIP, to map images from the slides into a common embedding space for text and images. Set up a retriever using the Lang chain benchmarks code with Open AI embeddings for retrieval and generation of answers to the questions from the slide deck data set.'