### 使用 HNSW 索引


~~把数据写入 csv 文件里，然后导入进新的数据库里面就好了~~

把数据写入 JSON 文件里，csv 文件在写入时会有些小问题（分隔符导致的）

导入之前，新数据库里面要有添加一个包含一致字段的实体，不然 attu 里面找不到，准备导入时删掉就好了

导出时要注意导出的数据 dense 要是原始的数据，而不是 np 的表示 -> 使用 `client.search` 就行了

In [None]:
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
import os

load_dotenv()

embed_model = OpenAIEmbeddings(
    api_key=os.getenv("BL_API_KEY"),
    base_url=os.getenv("BL_BASE_URL"),
    model="text-embedding-v3",
    dimensions=1024,
    check_embedding_ctx_length=False
)

In [None]:
# 创建单密集嵌入、混合嵌入的向量数据库
from langchain_milvus import Milvus, BM25BuiltInFunction

# 单密集嵌入
dense_vs_hnsw = Milvus(
    collection_name='dense_hotpotqa500_hnsw',
    embedding_function=embed_model,
    vector_field='dense',
    index_params={
        "metric_type": "L2",
        "index_type": "HNSW",
        "params": {
            "M": 64,
            "efConstruction": 400
        }
    }
)

# 混合嵌入
dense_index_parma = {
    "metric_type": "L2",
    "index_type": "HNSW",
    "params": {
        "M": 64,
        "efConstruction": 400
    }
}

sparse_index_param = {
    "metric_type": "BM25",
    "index_type": "AUTOINDEX",
    "params": {}
}

hybrid_vs_hnsw = Milvus(
    collection_name='hybrid_hotpotqa500_hnsw',
    embedding_function=embed_model,
    builtin_function=BM25BuiltInFunction(output_field_names='sparse'),
    vector_field=['dense', 'sparse'],
    index_params=[dense_index_parma, sparse_index_param]  # 顺序关系是对应的
)

In [None]:
# 加载 hybrid_hotpotqa_500 使用 search_iterator 将所有数据导出到 json 文件中
# csv 文件的格式和 attu 导出的一致

hybrid_vs = Milvus(
    collection_name='hybrid_hotpotqa_500',
    embedding_function=embed_model,
    builtin_function=BM25BuiltInFunction(output_field_names='sparse'),
    vector_field=['dense', 'sparse']
)

# 定义任何一个问题都可以
query_embedding = embed_model.embed_query('What is the name of the restaurant?')

In [None]:
from typing import List, Dict
from pymilvus.client.search_result import Hit
from pymilvus.orm.iterator import SearchPage

iterator = hybrid_vs.client.search_iterator(
    collection_name='hybrid_hotpotqa_500',
    data=[query_embedding],
    anns_field='dense',
    batch_size=1000,
    output_fields=['*']
)

# 使用 search_iterator 得到数据库中的所有数据
total = 0
res: List[Dict] = []
while True:
    result: SearchPage = iterator.next()
    if not result:
        iterator.close()
        break
    
    for hit in result:  # hit type: Hit
        hit_dict = {
            'dataset': hit['dataset'],
            'source': hit['source'],
            'page': hit['page'],
            'start_index': hit['start_index'],
            'question': hit['question'],
            'title': hit['title'],
            'dense': hit['dense'],
            'hotpotqa_id': hit['hotpotqa_id'],
            'text': hit['text'],
            'pk': hit['pk']
        }
        res.append(hit_dict)
    
    total += len(result)
    print(total)

print(f'Total results: {total}')

1000
2000
3000
4000
5000
6000
7000
8000
8293
Total results: 8293


In [None]:
# 将 res 导出到 json 文件中，之后可以用 attu 导入到数据库中
import json

output_filename1 = r'../../hybrid_hotpotqa_500_1.json'
with open(output_filename1, 'w', encoding='utf-8') as f:
    json.dump(res[:4000], f, ensure_ascii=False, indent=4)

output_filename2 = r'../../hybrid_hotpotqa_500_2.json'
with open(output_filename2, 'w', encoding='utf-8') as f:
    json.dump(res[4000:8000], f, ensure_ascii=False, indent=4)

output_filename3 = r'../../hybrid_hotpotqa_500_3.json'
with open(output_filename3, 'w', encoding='utf-8') as f:
    json.dump(res[8000:], f, ensure_ascii=False, indent=4)

In [None]:
# 先向 dense_hotpotqa500_hnsw 和 hybrid_hotpotqa500_hnsw 中导入一条数据
# 不然在 attu 中找不到，导入之后删掉
hybrid_vs.similarity_search(query='What is the name of the restaurant?', k=1)

[Document(metadata={'title': 'List of casual dining restaurant chains', 'start_index': 0, 'pk': '5b8f0b10-c321-4c35-9f41-ad6bce42e9bb', 'page': 0, 'hotpotqa_id': '5ae4932855429970de88d9b8', 'dataset': 'HotpotQA', 'source': 'HotpotQA_List of casual dining restaurant chains_5ae4932855429970de88d9b8', 'question': "Which  American chain of bakery-café fast casual restaurants sponsored Bill Steers Men's 4-Miler"}, page_content='This is a list of casual dining restaurant chains around the world, arranged in alphabetical order. A casual dining restaurant is a restaurant that serves moderately priced food in a casual atmosphere. Except for buffet-style restaurants and, more recently, fast casual restaurants, casual dining restaurants usually provide table service.')]

In [None]:
from langchain_core.documents import Document

insert_enity = [
    Document(
        page_content='This is a list of casual dining restaurant chains around the world, arranged in alphabetical order. A casual dining restaurant is a restaurant that serves moderately priced food in a casual atmosphere. Except for buffet-style restaurants and, more recently, fast casual restaurants, casual dining restaurants usually provide table service.',
        metadata={'title': 'List of casual dining restaurant chains', 'start_index': 0, 'page': 0, 'hotpotqa_id': '5ae4932855429970de88d9b8', 'dataset': 'HotpotQA', 'source': 'HotpotQA_List of casual dining restaurant chains_5ae4932855429970de88d9b8', 'question': "Which  American chain of bakery-café fast casual restaurants sponsored Bill Steers Men's 4-Miler"}
    )
]

In [None]:
from uuid import uuid4

dense_vs_hnsw.add_documents(documents=insert_enity, ids=[str(uuid4())])
hybrid_vs_hnsw.add_documents(documents=insert_enity, ids=[str(uuid4())])

['7e608905-ab91-43de-9b81-e91a30a74559']

之后：

1. 把刚刚插入的数据从 `dense_hotpotqa500_hnsw` 和 `hybrid_hotpotqa500_hnsw` 中删除

2. 在 attu 上将 JSON 文件导入到 `dense_hotpotqa500_hnsw` 和 `hybrid_hotpotqa500_hnsw` 中

In [None]:
dense_vs_hnsw.client.describe_index(
    collection_name='dense_hotpotqa500_hnsw',
    index_name='dense'
)

{'metric_type': 'L2',
 'index_type': 'HNSW',
 'params': {'M': 64, 'efConstruction': 400},
 'field_name': 'dense',
 'index_name': 'dense',
 'total_rows': 8293,
 'indexed_rows': 8293,
 'pending_index_rows': 0,
 'state': 'Finished'}

In [None]:
hybrid_vs_hnsw.client.describe_index(
    collection_name='hybrid_hotpotqa500_hnsw',
    index_name='dense'
)

{'metric_type': 'L2',
 'index_type': 'HNSW',
 'params': {'M': 64, 'efConstruction': 400},
 'field_name': 'dense',
 'index_name': 'dense',
 'total_rows': 8293,
 'indexed_rows': 8293,
 'pending_index_rows': 0,
 'state': 'Finished'}

In [None]:
# 定义判断检索结果是否一致的函数
from typing import Tuple, List
from langchain_core.documents import Document

def is_same_result(
        dense_res: List[Tuple[Document, float]],
        hybrid_res: List[Tuple[Document, float]]
) -> bool:
    """
    比较两个向量数据库进行密集检索的结果是否一致

    Args:
        dense_res (List[Tuple[Document, float]]): 密集检索的结果
        hybrid_res (List[Tuple[Document, float]]): 混合检索的结果

    Returns:
        bool: 检索结果是否一致
    """
    # 检查两个结果的数量是否一致
    if len(dense_res) != len(hybrid_res):
        return False

    # 检查两个结果里面的具体内容是否一样
    for d_res, h_res in zip(dense_res, hybrid_res):
        if (
            d_res[1] != h_res[1]  # 比较分数
            or d_res[0].page_content != h_res[0].page_content  # 比较文档内容
        ):
            return False

    return True

In [None]:
# 加载问题和对应的嵌入
import json
from loguru import logger as log
import pprint

all_questions = []
question_path = r'../../evaluation/hotpotqa/hybrid_test_data_w_embedding_500.json'
with open(question_path, 'r', encoding='utf-8') as f:
    all_questions = json.load(f)

log.info(f'加载了 {len(all_questions)} 个问题')
pprint.pprint(all_questions[0])

# 定义查询的参数
top_k = 10
search_params = {
    'metric_type': 'L2',
    'params': {'ef': 21}
}

[32m2025-07-28 14:43:05.128[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m11[0m - [1m加载了 500 个问题[0m


{'_id': '5a8ba761554299240d9c2066',
 'answer': 'nearly 80 years',
 'context': [['Bus Stop (TV series)',
              ['Bus Stop is a 26-episode American drama which aired on ABC '
               'from October 1, 1961, until March 25, 1962, starring Marilyn '
               'Maxwell as Grace Sherwood, the owner of a bus station and '
               'diner in the fictitious town of Sunrise in the Colorado '
               'Rockies.',
               ' The program was adapted from William Inge\'s play, "Bus '
               'Stop", and Inge was a script consultant for the series, which '
               'followed the lives of travelers passing through the bus '
               'station and the diner.',
               " Maxwell's co-stars were Richard Anderson as District Attorney "
               'Glenn Wagner, Rhodes Reason as Sheriff Will Mayberry, Joan '
               'Freeman as waitress Elma Gahrigner, Bernard Kates as Ralph the '
               'coroner, and Buddy Ebsen as Virge Bles

In [None]:
# 在两个向量数据库上使用 client.search 进行检索
from tqdm import tqdm

diff_res = []
same_res_num = 0

for question in tqdm(all_questions):
    raw_dense_res = dense_vs_hnsw.client.search(
        collection_name='dense_hotpotqa500_hnsw',
        data=[question['embedding']],
        anns_field='dense',  # 在 dense_hotpotqa_500 向量数据库中，密集嵌入的索引为 'vector'
        limit=top_k,
        search_params=search_params,
        output_fields=["*"]
    )
    dense_res = dense_vs_hnsw._parse_documents_from_search_results(raw_dense_res)


    raw_hybrid_res = hybrid_vs_hnsw.client.search(
        collection_name='hybrid_hotpotqa500_hnsw',
        data=[question['embedding']],
        anns_field='dense',
        limit=top_k,
        search_params=search_params,
        output_fields=["*"]
    )
    hybrid_res = hybrid_vs_hnsw._parse_documents_from_search_results(raw_hybrid_res)

    # 比较检索结果
    if is_same_result(dense_res, hybrid_res):
        same_res_num += 1
    else:
        tmp = {
            'question': question['question'],
            'dense_res': dense_res,
            'hybrid_res': hybrid_res
        }
        diff_res.append(tmp)

        # log.info(f'{question["question"]} 检索结果不一致')
        tqdm.write(f'{question["question"]} 检索结果不一致')
# 输出结果
log.info(f'检索结果一致的问题数量：{same_res_num} / {len(all_questions)}')
log.info(f'检索结果不一致的问题数量：{len(diff_res)}')

  2%|▏         | 9/500 [00:00<00:06, 80.52it/s]

How long did the career span of the actor who starred with Mickey Rooney and Marilyn Maxwell in Off Limits? 检索结果不一致


  6%|▌         | 28/500 [00:00<00:05, 82.12it/s]

Spider9 was founded in 2011 by the head of which subsidiary of Wanxiang Group? 检索结果不一致
ICI House is now named after the company that provides what type of item? 检索结果不一致


 13%|█▎        | 65/500 [00:00<00:05, 83.06it/s]

When is the football club which Stan Spinks played for founded 检索结果不一致
Which German project recorded a song that featured vocals by a duo from Silverdale, England? 检索结果不一致
Erica Packer was the second wife of what Australian businessman? 检索结果不一致


 21%|██        | 104/500 [00:01<00:04, 91.07it/s]

Which of the people featured on Wall of Fame is the daughter of Bernie Ecclestone? 检索结果不一致


 25%|██▌       | 125/500 [00:01<00:04, 86.94it/s]

In what county did Michael Ola attend high school? 检索结果不一致


 43%|████▎     | 213/500 [00:02<00:03, 92.37it/s]

What was Randy Shughart's rank when he died? 检索结果不一致


 51%|█████     | 253/500 [00:02<00:02, 91.25it/s]

What production company is owned by the director of "She Hate Me"? 检索结果不一致


 65%|██████▍   | 323/500 [00:03<00:01, 91.45it/s]

What was the formal name of the building that housed the scene that formed the band Hjertestop? 检索结果不一致


 68%|██████▊   | 342/500 [00:03<00:01, 85.24it/s]

Scout Tufankjian and Daron Malakian are both what? 检索结果不一致


 74%|███████▍  | 371/500 [00:04<00:01, 86.35it/s]

Which writer of French descent actually lived in France, Maurice Level or John Dufresne? 检索结果不一致
By what name is the King that Gothard Wilhelm Butler was captain of the guard for known in Poland? 检索结果不一致


 82%|████████▏ | 408/500 [00:04<00:01, 82.20it/s]

What did Karan Kapoor's maternal grandfather deliver? 检索结果不一致


 85%|████████▌ | 426/500 [00:04<00:00, 79.84it/s]

Do The Importance of Being Icelandic and The Five Obstructions belong to different film genres ? 检索结果不一致


 91%|█████████ | 455/500 [00:05<00:00, 84.42it/s]

What album succeeded Kendrick Lamar's album that had the song Money Trees in it? 检索结果不一致


100%|██████████| 500/500 [00:05<00:00, 86.81it/s]
[32m2025-07-28 14:45:13.309[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m43[0m - [1m检索结果一致的问题数量：483 / 500[0m
[32m2025-07-28 14:45:13.309[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m44[0m - [1m检索结果不一致的问题数量：17[0m


In [None]:
# 统计 500 个问题中，top_1 到 top_10 中不一样的分别有多少
from collections import defaultdict

res = defaultdict(int)
less_3 = 0

for diff in diff_res:
    d_res = diff['dense_res']
    h_res = diff['hybrid_res']

    i = 0
    for d_r, h_r in zip(d_res, h_res):
        if d_r[1] != h_r[1]:
            res[str(i)] += 1
            if i < 3:
                log.info(f'{i}: {diff['question']}')
                less_3 += 1
            break
        i += 1

In [None]:
res

defaultdict(int, {'9': 5, '7': 3, '8': 2, '4': 3, '6': 2, '5': 2})