### 搭建查询分析系统

本页将介绍如何在基本的端到端示例中使用查询分析。

这将介绍创建一个简单的搜索引擎，显示将原始用户问题传递给该搜索时发生的失败模式，然后介绍查询分析如何帮助解决该问题。

有许多不同的查询分析技术，此端到端示例不会显示所有这些技术。

在本例中，我们将对LangChain YouTube视频进行检索。 （本来想用B站的，结果方法确实支持得有问题）

In [None]:
# 安装依赖项
# %pip install -qU langchain langchain-community langchain-openai youtube-transcript-api pytube langchain-chroma

In [1]:
# 设置环境变量
# 在此示例中，我们将使用 OpenAI

import getpass
import os

os.environ["OPENAI_API_KEY"] = getpass.getpass()

from langchain_openai import ChatOpenAI

llm =  ChatOpenAI(model="gpt-3.5-turbo", base_url="https://api.gpts.vin/v1")


#### 加载文档

我们可以用 来加载一些LangChain视频的文字记录：BiliBiliLoader

In [2]:
from langchain_community.document_loaders import YoutubeLoader

urls = [
    "https://www.youtube.com/watch?v=HAn9vnJy6S4",
    "https://www.youtube.com/watch?v=dA1cHGACXCo",
    "https://www.youtube.com/watch?v=ZcEMLz27sL4",
    "https://www.youtube.com/watch?v=hvAPnpSfSGo",
    "https://www.youtube.com/watch?v=EhlPDL4QrWY",
    "https://www.youtube.com/watch?v=mmBo8nlu2j0",
    "https://www.youtube.com/watch?v=rQdibOsL1ps",
    "https://www.youtube.com/watch?v=28lC4fqukoc",
    "https://www.youtube.com/watch?v=es-9MgxB-uc",
    "https://www.youtube.com/watch?v=wLRHwKuKvOE",
    "https://www.youtube.com/watch?v=ObIltMaRJvY",
    "https://www.youtube.com/watch?v=DjuXACWYkkU",
    "https://www.youtube.com/watch?v=o7C9ld6Ln-M",
]
docs = []
for url in urls:
    docs.extend(YoutubeLoader.from_youtube_url(url, add_video_info=True).load())

In [3]:
import datetime

# Add some additional metadata: what year the video was published
for doc in docs:
    doc.metadata["publish_year"] = int(
        datetime.datetime.strptime(
            doc.metadata["publish_date"], "%Y-%m-%d %H:%M:%S"
        ).strftime("%Y")
    )

In [4]:
# 以下是我们加载的视频的标题：
[doc.metadata["title"] for doc in docs]

['OpenGPTs',
 'Building a web RAG chatbot: using LangChain, Exa (prev. Metaphor), LangSmith, and Hosted Langserve',
 'Streaming Events: Introducing a new `stream_events` method',
 'LangGraph: Multi-Agent Workflows',
 'Build and Deploy a RAG app with Pinecone Serverless',
 'Auto-Prompt Builder (with Hosted LangServe)',
 'Build a Full Stack RAG App With TypeScript',
 'Getting Started with Multi-Modal LLMs',
 'SQL Research Assistant',
 'Skeleton-of-Thought: Building a New Template from Scratch',
 'Benchmarking RAG over LangChain Docs',
 'Building a Research Assistant from Scratch',
 'LangServe and LangChain Templates Webinar']

以下是与每个视频关联的元数据。我们可以看到，每个文档还有一个标题、浏览次数、发布日期和长度：

In [5]:
docs[0].metadata

{'source': 'HAn9vnJy6S4',
 'title': 'OpenGPTs',
 'description': 'Unknown',
 'view_count': 8717,
 'thumbnail_url': 'https://i.ytimg.com/vi/HAn9vnJy6S4/hq720.jpg',
 'publish_date': '2024-01-31 00:00:00',
 'length': 1530,
 'author': 'LangChain',
 'publish_year': 2024}

下面是文档内容的示例：

In [6]:
docs[0].page_content[:500]

"hello today I want to talk about open gpts open gpts is a project that we built here at linkchain uh that replicates the GPT store in a few ways so it creates uh end user-facing friendly interface to create different Bots and these Bots can have access to different tools and they can uh be given files to retrieve things over and basically it's a way to create a variety of bots and expose the configuration of these Bots to end users it's all open source um it can be used with open AI it can be us"

### 索引文档
每当我们执行检索时，我们都需要创建一个可以查询的文档索引。

我们将使用向量存储来索引我们的文档，我们将首先对它们进行分块，以使我们的检索更加简洁和精确：

In [16]:
from langchain_chroma import Chroma
from langchain_community.embeddings import DashScopeEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

os.environ["DASHSCOPE_API_KEY"] = getpass.getpass() # 向量模型也是用阿里

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
chunked_docs = text_splitter.split_documents(docs)
embeddings = DashScopeEmbeddings()
vectorstore = Chroma.from_documents(
    chunked_docs,
    embeddings,
)

#### 不带查询分析的检索

我们可以直接对用户问题进行相似性搜索，以查找与该问题相关的块：


In [8]:
search_results = vectorstore.similarity_search("我该怎么去构建一个RAG代理？")
print(search_results[0].metadata["title"])
print(search_results[0].page_content[:500])

LangServe and LangChain Templates Webinar
to have kind of like a common interface for everything and so uh what that means is that we probably won't have specialized endpoints at the moment for different things um because they'll all we'll handle them in like a generic way but I do think there's a good chance that we start having like a collection of templates that are themselves retrievers or something like that um so I think like we have some Advanced rag methods in there that I think actually some of them probably are just like retri


这很有效！我们的第一个结果与这个问题非常相关。

如果我们想搜索特定时间段的结果，该怎么办？

In [9]:
search_results = vectorstore.similarity_search("2023 年发布的 RAG 视频")
print(search_results[0].metadata["title"])
print(search_results[0].metadata["publish_date"])
print(search_results[0].page_content[:500])

Build and Deploy a RAG app with Pinecone Serverless
2024-01-16 00:00:00
hi this is Lance from the Lang chain team and today we're going to be building and deploying a rag app using pine con serval list from scratch so we're going to kind of walk through all the code required to do this and I'll use these slides as kind of a guide to kind of lay the the ground work um so first what is rag so under capoy has this pretty nice visualization that shows LMS as a kernel of a new kind of operating system and of course one of the core components of our operating system is th


我们的第一个结果是 2024 年的（尽管我们要求提供 2023 年的视频），与输入不是很相关。

由于我们只是针对文档内容进行搜索，因此无法根据任何文档属性筛选结果。

这只是可能出现的一种故障模式。现在让我们来看看查询分析的基本形式是如何解决它的！

### 查询分析
我们可以使用查询分析来改进检索结果。这将涉及定义包含一些日期筛选器的查询架构，并使用函数调用模型将用户问题转换为结构化查询。

#### 查询架构
在本例中，我们将为发布日期提供显式的 min 和 max 属性，以便对其进行筛选。

In [10]:
from typing import Optional

from langchain_core.pydantic_v1 import BaseModel, Field


class Search(BaseModel):
    """搜索有关软件库的教程视频数据库."""

    query: str = Field(
        ...,
        description="应用于视频文字记录的相似性搜索查询.",
    )
    publish_year: Optional[int] = Field(None, description="视频发布年份")

#### 查询生成
为了将用户问题转换为结构化查询，我们将使用 OpenAI 的工具调用 API。

具体来说，我们将使用新的 ChatModel.with_structured_output（） 构造函数来处理将架构传递给模型并分析输出。

In [11]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] =getpass.getpass()

system = """您是将用户问题转换为数据库查询的专家。
您可以访问有关用于构建 LLM 支持的应用程序的软件库的教程视频数据库。
给定一个问题，返回一个数据库查询列表，这些查询经过优化，可以检索最相关的结果。

如果有您不熟悉的首字母缩略词或单词，请不要尝试改写它们。"""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
llm =  ChatOpenAI(model="gpt-3.5-turbo", base_url="https://api.gpts.vin/v1", temperature=0)
structured_llm = llm.with_structured_output(Search)
query_analyzer = {"question": RunnablePassthrough()} | prompt | structured_llm

让我们看看我们的分析器为我们之前搜索的问题生成了哪些查询：

In [17]:
query_analyzer.invoke("我该怎么去构建一个RAG代理")

Search(query='RAG代理构建指南', publish_year=None)

In [18]:
query_analyzer.invoke("videos on RAG published in 2023")

Search(query='RAG', publish_year=2023)

### 使用查询分析进行检索

我们的查询分析看起来相当不错;现在，让我们尝试使用生成的查询来实际执行检索。

注意：在我们的示例中，我们指定了 .这将强制 LLM 调用一个 - 并且只有一个 - 工具，这意味着我们将始终有一个优化的查询来查找。

请注意，情况并非总是如此 - 请参阅其他指南，了解如何处理未返回或返回多个优化查询的情况。`tool_choice="Search"`

In [19]:
from typing import List

from langchain_core.documents import Document

In [20]:
def retrieval(search: Search) -> List[Document]:
    if search.publish_year is not None:
        # This is syntax specific to Chroma,
        # the vector database we are using.
        _filter = {"publish_year": {"$eq": search.publish_year}}
    else:
        _filter = None
    return vectorstore.similarity_search(search.query, filter=_filter)

In [21]:
retrieval_chain = query_analyzer | retrieval

我们现在可以在之前有问题的输入上运行这个链，并看到它只产生当年的结果！

In [22]:
results = retrieval_chain.invoke("2023 年发布的 RAG 教程")

In [23]:
[(doc.metadata["title"], doc.metadata["publish_date"]) for doc in results]

[('Getting Started with Multi-Modal LLMs', '2023-12-20 00:00:00'),
 ('Getting Started with Multi-Modal LLMs', '2023-12-20 00:00:00'),
 ('Getting Started with Multi-Modal LLMs', '2023-12-20 00:00:00'),
 ('Getting Started with Multi-Modal LLMs', '2023-12-20 00:00:00')]