# 查询转换以改进 RAG 系统中的检索

## 概述

此代码实现了三种查询转换技术，以增强检索增强型生成（Retrieval-Augmented Generation, RAG）系统中的检索过程：

1. 查询重写
2. 回退提示
3. 子查询分解

每种技术都旨在通过修改或扩展原始查询来提高检索到的信息的相关性和全面性。

## 动机

RAG 系统在检索最相关信息时经常面临挑战，尤其是在处理复杂或模糊的查询时。这些查询转换技术通过重新制定查询以更好地匹配相关文档或检索更全面的信息来解决这个问题。

## 关键组件

1. 查询重写：重新制定查询，使其更具体和详细。
2. 回退提示：生成更广泛的查询以更好地检索背景信息。
3. 子查询分解：将复杂查询分解为更简单的子查询。

## 方法细节

### 1. 查询重写

- **目的**：使查询更具体和详细，提高检索到相关信息的可能性。
- **实现**：
  - 使用带有自定义提示模板的 GPT-4 模型。
  - 采用原始查询并重新制定，使其更具体和详细。

### 2. 回退提示

- **目的**：生成更广泛、更一般的查询，有助于检索相关的背景信息。
- **实现**：
  - 使用带有自定义提示模板的 GPT-4 模型。
  - 采用原始查询并生成更一般的"回退"查询。

### 3. 子查询分解

- **目的**：将复杂查询分解为更简单的子查询，以实现更全面的信息检索。
- **实现**：
  - 使用带有自定义提示模板的 GPT-4 模型。
  - 将原始查询分解为 2-4 个更简单的子查询。

## 这些方法的好处

1. **提高相关性**：查询重写有助于检索更具体和相关的信息。
2. **更好的上下文**：回退提示允许检索更广泛的上下文和背景信息。
3. **全面的结果**：子查询分解能够检索到涵盖复杂查询不同方面的信息。
4. **灵活性**：每种技术可以独立使用或根据特定用例组合使用。

## 实现细节

- 所有技术都使用 OpenAI 的 GPT-4 模型进行查询转换。
- 使用自定义提示模板引导模型生成适当的转换。
- 代码为每种转换技术提供了单独的函数，便于集成到现有的 RAG 系统中。

## 示例用例

代码使用示例查询演示了每种技术：
"气候变化对环境有什么影响？"

- **查询重写**将其扩展到包括温度变化和生物多样性等特定方面。
- **回退提示**将其概括为"气候变化的一般影响是什么？"
- **子查询分解**将其分解为关于生物多样性、海洋、天气模式和陆地环境的问题。

## 结论

这些查询转换技术为增强 RAG 系统的检索能力提供了强大的方法。通过以各种方式重新制定查询，它们可以显著提高检索到的信息的相关性、上下文和全面性。这些方法在查询可能复杂或多方面的领域特别有价值，如科学研究、法律分析或全面的事实发现任务。


### Import libraries and set environment variables

In [1]:
from langchain_openai import ChatOpenAI
from langchain.prompts import PromptTemplate

import os
from dotenv import load_dotenv

# Load environment variables from a .env file
load_dotenv()

# Set the OpenAI API key environment variable
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### 1 - 查询重写:重新定义查询以改进检索。

In [2]:
re_write_llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)

# Create a prompt template for query rewriting
query_rewrite_template = """You are an AI assistant tasked with reformulating user queries to improve retrieval in a RAG system. 
Given the original query, rewrite it to be more specific, detailed, and likely to retrieve relevant information.

Original query: {original_query}

Rewritten query:"""

query_rewrite_prompt = PromptTemplate(
    input_variables=["original_query"],
    template=query_rewrite_template
)

# Create an LLMChain for query rewriting
query_rewriter = query_rewrite_prompt | re_write_llm

def rewrite_query(original_query):
    """
    重写原始查询以改进检索。
    
    参数:
    original_query (str):原始用户查询
    
    返回:
    str:重写的查询
    """
    response = query_rewriter.invoke(original_query)
    return response.content

### 例子

In [3]:
# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
rewritten_query = rewrite_query(original_query)
print("Original query:", original_query)
print("\nRewritten query:", rewritten_query)

Original query: What are the impacts of climate change on the environment?

Rewritten query: How does climate change affect various aspects of the environment, such as biodiversity, sea levels, weather patterns, and ecosystems?


### 2 - 退步提示: 生成更广泛的查询以获得更好的上下文检索。

In [4]:
step_back_llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)


# Create a prompt template for step-back prompting
step_back_template = """You are an AI assistant tasked with generating broader, more general queries to improve context retrieval in a RAG system.
Given the original query, generate a step-back query that is more general and can help retrieve relevant background information.

Original query: {original_query}

Step-back query:"""

step_back_prompt = PromptTemplate(
    input_variables=["original_query"],
    template=step_back_template
)

# Create an LLMChain for step-back prompting
step_back_chain = step_back_prompt | step_back_llm

def generate_step_back_query(original_query):
    """
    Generate a step-back query to retrieve broader context.
    
    Args:
    original_query (str): The original user query
    
    Returns:
    str: The step-back query
    """
    response = step_back_chain.invoke(original_query)
    return response.content

### 例子

In [5]:
# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
step_back_query = generate_step_back_query(original_query)
print("Original query:", original_query)
print("\nStep-back query:", step_back_query)

Original query: What are the impacts of climate change on the environment?

Step-back query: What are the general effects of climate change?


### 3 - 子查询分解: 将复杂查询分解为更简单的子查询。

In [6]:
sub_query_llm = ChatOpenAI(temperature=0, model_name="gpt-4o", max_tokens=4000)

# Create a prompt template for sub-query decomposition
subquery_decomposition_template = """You are an AI assistant tasked with breaking down complex queries into simpler sub-queries for a RAG system.
Given the original query, decompose it into 2-4 simpler sub-queries that, when answered together, would provide a comprehensive response to the original query.

Original query: {original_query}

example: What are the impacts of climate change on the environment?

Sub-queries:
1. What are the impacts of climate change on biodiversity?
2. How does climate change affect the oceans?
3. What are the effects of climate change on agriculture?
4. What are the impacts of climate change on human health?"""


subquery_decomposition_prompt = PromptTemplate(
    input_variables=["original_query"],
    template=subquery_decomposition_template
)

# Create an LLMChain for sub-query decomposition
subquery_decomposer_chain = subquery_decomposition_prompt | sub_query_llm

def decompose_query(original_query: str):
    """
    Decompose the original query into simpler sub-queries.
    
    Args:
    original_query (str): The original complex query
    
    Returns:
    List[str]: A list of simpler sub-queries
    """
    response = subquery_decomposer_chain.invoke(original_query).content
    sub_queries = [q.strip() for q in response.split('\n') if q.strip() and not q.strip().startswith('Sub-queries:')]
    return sub_queries

### 例子

In [7]:
# example query over the understanding climate change dataset
original_query = "What are the impacts of climate change on the environment?"
sub_queries = decompose_query(original_query)
print("\nSub-queries:")
for i, sub_query in enumerate(sub_queries, 1):
    print(sub_query)


Sub-queries:
Original query: What are the impacts of climate change on the environment?
1. How does climate change affect biodiversity and ecosystems?
2. What are the impacts of climate change on ocean temperatures and sea levels?
3. How does climate change influence weather patterns and extreme weather events?
4. What are the effects of climate change on agriculture and food production?
