# Azure AI Search 与 NVIDIA NIM 和 LlamaIndex 集成

在本笔记中，我们将演示如何利用 NVIDIA 的 AI 模型和 LlamaIndex 构建一个强大的检索增强生成（RAG）管道。我们将使用 NVIDIA 的大型语言模型和嵌入，将它们与 Azure AI Search 集成作为向量存储，并通过 RAG 提升搜索质量和效率。

## 优势
- **可扩展性**：利用 NVIDIA 的大型语言模型和 Azure AI Search，实现可扩展且高效的检索。
- **成本效益**：通过高效的向量存储和混合搜索技术优化搜索和检索成本。
- **高性能**：结合强大的语言模型与向量化搜索，提供更快、更准确的响应。
- **高质量**：通过相关检索文档为语言模型的响应提供依据，确保搜索质量。

## 前提条件
- 🐍 Python 3.9 或更高版本
- 🔗 [Azure AI Search 服务](https://learn.microsoft.com/azure/search/)
- 🔗 NVIDIA API 密钥，用于通过 NVIDIA NIM 微服务访问 NVIDIA 的语言模型和嵌入

## 涉及功能
- ✅ NVIDIA LLM 集成（我们将使用 [Phi-3.5-MOE](https://build.nvidia.com/microsoft/phi-3_5-moe)）
- ✅ NVIDIA 嵌入（我们将使用 [nv-embedqa-e5-v5](https://build.nvidia.com/nvidia/nv-embedqa-e5-v5)）
- ✅ Azure AI Search 高级检索模式
- ✅ 使用 LlamaIndex 进行文档索引
- ✅ 使用 Azure AI Search 和 LlamaIndex 结合 NVIDIA LLM 进行 RAG

让我们开始吧！


In [None]:
!pip install azure-search-documents==11.5.1
!pip install --upgrade llama-index
!pip install --upgrade llama-index-core
!pip install --upgrade llama-index-readers-file
!pip install --upgrade llama-index-llms-nvidia
!pip install --upgrade llama-index-embeddings-nvidia
!pip install --upgrade llama-index-postprocessor-nvidia-rerank
!pip install --upgrade llama-index-vector-stores-azureaisearch
!pip install python-dotenv

## 安装和要求
使用 Python 版本 >3.10 创建一个 Python 环境。

## 快速开始！


要开始使用 NVIDIA AI Foundation 模型，您需要一个 `NVIDIA_API_KEY`：
1) 创建一个免费的 [NVIDIA](https://build.nvidia.com/explore/discover) 账户。
2) 点击您选择的模型。
3) 在输入选项中，选择 Python 标签，然后点击 **Get API Key**，接着点击 **Generate Key**。
4) 复制并保存生成的密钥作为 NVIDIA_API_KEY。从这里开始，您就可以访问相关的端点了。


In [3]:
import getpass
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

if not os.environ.get("NVIDIA_API_KEY", "").startswith("nvapi-"):
    nvidia_api_key = getpass.getpass("Enter your NVIDIA API key: ")
    assert nvidia_api_key.startswith("nvapi-"), f"{nvidia_api_key[:5]}... is not a valid key"
    os.environ["NVIDIA_API_KEY"] = nvidia_api_key


## 使用LLM和嵌入的RAG示例
### 1) 初始化LLM
`llama-index-llms-nvidia`，也称为NVIDIA的LLM连接器，允许您连接并从NVIDIA API目录中可用的兼容模型生成内容。有关聊天完成模型的列表，请参见此处：https://build.nvidia.com/search?term=Text-to-Text

这里我们将使用 **mixtral-8x7b-instruct-v0.1**


In [75]:
from llama_index.core import Settings
from llama_index.llms.nvidia import NVIDIA

# Here we are using mixtral-8x7b-instruct-v0.1 model from API Catalog
Settings.llm = NVIDIA(model="microsoft/phi-3.5-moe-instruct", api_key=os.getenv("NVIDIA_API_KEY"))

### 2) 初始化嵌入
`llama-index-embeddings-nvidia`，也称为 NVIDIA 的嵌入连接器，允许您连接并从 NVIDIA API 目录中可用的兼容模型生成嵌入。我们选择了 `nvidia/nv-embedqa-e5-v5` 作为嵌入模型。有关文本嵌入模型的列表，请参见此处：https://build.nvidia.com/nim?filters=usecase%3Ausecase_text_to_embedding%2Cusecase%3Ausecase_image_to_embedding


In [6]:
from llama_index.embeddings.nvidia import NVIDIAEmbedding

Settings.embed_model = NVIDIAEmbedding(model="nvidia/nv-embedqa-e5-v5", api_key=os.getenv("NVIDIA_API_KEY"))

### 3) 创建 Azure AI 搜索向量存储


In [76]:
import logging
import sys
import os
import getpass
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from IPython.display import Markdown, display
from llama_index.vector_stores.azureaisearch import AzureAISearchVectorStore, IndexManagement


search_service_api_key = os.getenv('AZURE_SEARCH_ADMIN_KEY') or getpass.getpass('Enter your Azure Search API key: ')
search_service_endpoint = os.getenv('AZURE_SEARCH_SERVICE_ENDPOINT') or getpass.getpass('Enter your Azure Search service endpoint: ')
search_service_api_version = "2024-07-01"
credential = AzureKeyCredential(search_service_api_key)

# Index name to use
index_name = "llamaindex-nvidia-azureaisearch-demo"

# Use index client to demonstrate creating an index
index_client = SearchIndexClient(
    endpoint=search_service_endpoint,
    credential=credential,
)

# Use search client to demonstrate using existing index
search_client = SearchClient(
    endpoint=search_service_endpoint,
    index_name=index_name,
    credential=credential,
)

In [None]:
vector_store = AzureAISearchVectorStore(
    search_or_index_client=index_client,
    index_name=index_name,
    index_management=IndexManagement.CREATE_IF_NOT_EXISTS,
    id_field_key="id",
    chunk_field_key="chunk",
    embedding_field_key="embedding",
    embedding_dimensionality=1024, # dimensionality for nv-embedqa-e5-v5 model
    metadata_string_field_key="metadata",
    doc_id_field_key="doc_id",
    language_analyzer="en.lucene",
    vector_algorithm_type="exhaustiveKnn",
    # compression_type="binary" # Option to use "scalar" or "binary". NOTE: compression is only supported for HNSW
)

In [20]:
from llama_index.core import SimpleDirectoryReader, StorageContext, VectorStoreIndex
from llama_index.core.text_splitter import TokenTextSplitter

# Configure text splitter (nv-embedqa-e5-v5 model has a limit of 512 tokens per input size)
text_splitter = TokenTextSplitter(separator=" ", chunk_size=500, chunk_overlap=10)

# Load documents
documents = SimpleDirectoryReader(
    input_files=["data/txt/state_of_the_union.txt"]
).load_data()
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create index with text splitter
index = VectorStoreIndex.from_documents(
    documents,
    transformations=[text_splitter],
    storage_context=storage_context,
)

### 5) 创建一个查询引擎以便对您的数据进行提问

以下是一个使用 Azure AI Search 的纯向量搜索并将响应与我们的 LLM (Phi-3.5-MOE) 结合的查询


In [69]:
query_engine = index.as_query_engine()
response = query_engine.query("Who did the speaker mention as being present in the chamber?")
display(Markdown(f"{response}"))

 The speaker mentioned the Ukrainian Ambassador to the United States, along with other members of Congress, the Cabinet, and various officials such as the Vice President, the First Lady, and the Second Gentleman, as being present in the chamber.

以下是在 Azure AI 搜索中使用混合搜索的查询。


In [70]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.vector_stores.types import VectorStoreQueryMode
from IPython.display import Markdown, display
from llama_index.core.schema import MetadataMode

# Initialize hybrid retriever and query engine
hybrid_retriever = index.as_retriever(vector_store_query_mode=VectorStoreQueryMode.HYBRID)
hybrid_query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)

# Query execution
query = "What were the exact economic consequences mentioned in relation to Russia's stock market?"
response = hybrid_query_engine.query(query)

# Display the response
display(Markdown(f"{response}"))
print("\n")

# Print the source nodes
print("Source Nodes:")
for node in response.source_nodes:
    print(node.get_content(metadata_mode=MetadataMode.LLM))

 The Russian stock market experienced a significant drop, losing 40% of its value. Additionally, trading had to be suspended due to the ongoing situation.



Source Nodes:
file_path: data\txt\state_of_the_union.txt

building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. 

I spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression.  

We countered Russia’s lies with truth.   

And now that he has acted the free world is holding him accountable. 

Along with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland. 

We are inflicting pain on Russia and supporting the people of Ukraine. Putin is now isolated from the world more than ever. 

Together with our allies –we are right now enforcing powerful economic sanctions. 

We are cutting off Russia’s largest banks from the international financial system.  



#### 向量搜索分析
LLM 的回答准确地捕捉了来源文本中提到的关于俄罗斯股市的关键经济后果。具体来说，它指出俄罗斯股市经历了显著下跌，损失了40%的市值，并且由于当前局势交易被暂停。该回答与来源信息高度一致，表明 LLM 正确识别并总结了与俄罗斯行动及所受制裁相关的股市影响的相关细节。

#### 来源节点评论
来源节点详细描述了俄罗斯因国际制裁而面临的经济后果。文本强调俄罗斯股市损失了40%的市值，并且交易被暂停。此外，还提到其他经济影响，例如卢布贬值以及俄罗斯经济的更广泛孤立。LLM 的回答有效地提炼了这些节点中的关键点，重点关注了查询所要求的股市影响。


现在，让我们来看一个混合搜索无法提供可靠答案的查询：


In [71]:
# Query execution
query = "What was the precise date when Russia invaded Ukraine?"
response = hybrid_query_engine.query(query)

# Display the response
display(Markdown(f"{response}"))
print("\n")

# Print the source nodes
print("Source Nodes:")
for node in response.source_nodes:
    print(node.get_content(metadata_mode=MetadataMode.LLM))


 The provided context does not specify the exact date of Russia's invasion of Ukraine. However, it does mention that the events discussed are happening in the current era and that the actions taken are in response to Putin's aggression. For the precise date, one would need to refer to external sources or historical records.



Source Nodes:
file_path: data\txt\state_of_the_union.txt

our forces are not engaged and will not engage in conflict with Russian forces in Ukraine.  

Our forces are not going to Europe to fight in Ukraine, but to defend our NATO Allies – in the event that Putin decides to keep moving west.  

For that purpose we’ve mobilized American ground forces, air squadrons, and ship deployments to protect NATO countries including Poland, Romania, Latvia, Lithuania, and Estonia. 

As I have made crystal clear the United States and our Allies will defend every inch of territory of NATO countries with the full force of our collective power.  

And we remain clear-eyed. The Ukrainians are fighting back with pure courage. But the next few days weeks, months, will be hard on them.  

Putin has unleashed violence and chaos.  But while he may make gains on the battlefield – he will pay a continuing high price over the long run. 

And a proud Ukrainian people, who have known 30 years  of independence,

### 混合搜索：LLM响应分析
在混合搜索示例中，LLM的响应表明提供的上下文未明确说明俄罗斯入侵乌克兰的具体日期。该响应表明LLM正在利用来源文档中的信息，但同时也承认文本中缺乏具体细节。

该响应准确地指出上下文提到了与俄罗斯侵略相关的事件，但未明确具体的入侵日期。这展示了LLM能够理解所提供的信息，同时识别内容中的信息缺口。LLM有效地提示用户寻找外部来源或历史记录以获取确切日期，表现出在信息不完整时的谨慎态度。

### 来源节点分析
混合搜索示例中的来源节点包含一段关于美国对俄罗斯在乌克兰行动的回应的演讲摘录。这些节点强调了更广泛的地缘政治影响以及美国及其盟友对入侵采取的措施，但未提及具体的入侵日期。这与LLM的响应一致，后者正确地指出上下文缺乏确切的日期信息。


In [72]:
# Initialize hybrid retriever and query engine
semantic_reranker_retriever = index.as_retriever(vector_store_query_mode=VectorStoreQueryMode.SEMANTIC_HYBRID)
semantic_reranker_query_engine = RetrieverQueryEngine(retriever=semantic_reranker_retriever)

# Query execution
query = "What was the precise date when Russia invaded Ukraine?"
response = semantic_reranker_query_engine.query(query)

# Display the response
display(Markdown(f"{response}"))
print("\n")

# Print the source nodes
print("Source Nodes:")
for node in response.source_nodes:
    print(node.get_content(metadata_mode=MetadataMode.LLM))


 The provided context does not specify the exact date of Russia's invasion of Ukraine. However, it mentions that the event occurred six days before the speech was given. To determine the precise date, one would need to know the date of the speech.



Source Nodes:
file_path: data\txt\state_of_the_union.txt

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution. 

And with an unwavering resolve that freedom will always triumph over tyranny. 

Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. 

He thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. 

He met the Ukrainian people. 

From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world. 

### 混合模式与重排序：LLM响应分析
在混合模式与重排序的示例中，LLM的响应通过指出事件发生在演讲前六天，提供了额外的背景信息。这表明LLM能够根据演讲的时间推断出入侵的日期，尽管它仍然需要知道演讲的确切日期才能达到精确。

这一响应展示了利用上下文线索提供更具信息性的答案的改进能力。它突出了重排序的优势，即LLM能够访问并优先处理更相关的信息，从而更接近所需的细节（例如入侵日期）。

### 源节点分析
此示例中的源节点包括对俄罗斯入侵时间的参考，特别提到事件发生在演讲前六天。虽然确切日期仍未明确指出，但这些节点提供了时间背景，使LLM能够给出更细致的响应。这一细节的加入展示了重排序如何提升LLM从提供的上下文中提取和推断信息的能力，从而生成更准确、更具信息性的回答。


**注意：**  
在本笔记中，我们使用了来自 NVIDIA API 目录的 NVIDIA NIM 微服务。  
上述 API 包括 `NVIDIA (llms)`、`NVIDIAEmbedding` 和 [Azure AI Search 语义混合检索（内置重排序）](https://learn.microsoft.com/azure/search/semantic-search-overview)。  
请注意，上述 API 也支持自托管的微服务。

**示例：**  
```python
NVIDIA(model="meta/llama3-8b-instruct", base_url="http://your-nim-host-address:8000/v1")



---

**免责声明**：  
本文档使用AI翻译服务 [Co-op Translator](https://github.com/Azure/co-op-translator) 进行翻译。尽管我们努力确保翻译的准确性，但请注意，自动翻译可能包含错误或不准确之处。原始语言的文档应被视为权威来源。对于关键信息，建议使用专业人工翻译。我们不对因使用此翻译而产生的任何误解或误读承担责任。
