## 基于 MyScale 的混合搜索

这篇示例将介绍如何使用基于 MyScale 联合查询的技术来提升用户文本搜索体检。

In [None]:
!pip3 install clickhouse-connect prettytable sentence-transformers

在进行实验之前, 我们需要在 [MyScale](https://myscale.com/) 官网注册并创建一个免费的 Cluster, 具体流程如下:
- Step1. 在 [MyScale](https://myscale.com/) 点击 `Free Sign Up` 注册账号并跳转到 [控制台](https://console.myscale.com/clusters)。
- Step2. 在控制台界面点击 `New Cluster` 创建一个免费的集群, 给自己的 `Cluster` 命名后保持默认设置即可点击 `Next` 创建集群。
- Step3. 在 `Cluster` 的 `Actions` 下拉菜单内找到 `Connection Details` 按钮, 点击之后将会看到集群连接信息。

我们需要将上述获得的集群连接信息填写到下面第一个 `Code Block` 内, 依次为 `host`、`username`、`password`。

In [9]:
import time
import uuid
import clickhouse_connect
from clickhouse_connect.driver.client import Client
from prettytable import PrettyTable
from sentence_transformers import SentenceTransformer

# MyScale connection information.
host = "msc-c6548c32.us-east-1.aws.staging.myscale.cloud"
username = "demo"
password = "myscale_rocks"

port = 443
database = "test"
table = "wiki_abstract_50w_1"
dataset_rows = 500000
dataset_url = "https://myscale-example-datasets.s3.amazonaws.com/wiki_abstract_with_vector_50w.parquet"


# get_client function is used to get a MyScale client.
def get_client(_host: str, _port: int, _username: str, _password: str) -> Client:
    return clickhouse_connect.get_client(host=_host, port=_port, user=_username, password=_password,
                                         session_id=str(uuid.uuid4()), send_receive_timeout=30)

# Print your content in table view.
def print_results(result_rows, field_names):
    x = PrettyTable()
    x.field_names = field_names
    for row in result_rows:
        x.add_row(row)
    x.set_style(13)
    print(x)

# Initialize MyScale client.
client = get_client(host, port, username, password)

# Use transformer all-MiniLM-L6-v2
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

在制作数据集的过程中, 我们修改了 RedisSearch 维护的 `Wikipedia abstract dataset`, 使用 [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) 模型将数据集内 `body` 列转换为 `384 dim` 的向量数据, 向量相似度距离计算为 `Cosine`。

本次试验使用了该数据集内 [500K 子数据](https://myscale-example-datasets.s3.amazonaws.com/wiki_abstract_with_vector_50w.parquet)。

首先, 我们需要在集群上创建 Table 并导入数据。

In [10]:
# Recreate a table.
client.command(f"DROP TABLE IF EXISTS {database}.{table} sync;")
client.command(f"""CREATE TABLE {database}.{table}
(
    `id` UInt64,
    `body` String,
    `title` String,
    `url` String,
    `body_vector` Array(Float32),
    CONSTRAINT check_length CHECK length(body_vector) = 384
)
ENGINE = MergeTree
ORDER BY id;""")
print(f"Tables in cluster:{client.query(f'SHOW TABLES IN {database}').result_rows}")

# Upload data from S3.
time_upload_data_begin = time.time()
try:
    print(f"Start uploading data from S3 to MyScale.")
    client.command(f"INSERT INTO {database}.{table} SELECT * FROM s3('{dataset_url}','Parquet');")
except Exception as e:
    print("Upload data from S3 to MyScale may need more time.")
    _client = get_client(host, port, username, password)
    while True:
        rows_count = _client.query(f"SELECT count(*) from {database}.{table}").result_rows[0][0]
        print(f"\rRows in Table:{rows_count}, time consume:{(time.time() - time_upload_data_begin):.2f} sec.", end='', flush=True)
        if rows_count >= dataset_rows:
            print("\nData has been uploaded completely.")
            break
        time.sleep(3)
print(f"Rows in Table:{client.query(f'SELECT count(*) from {database}.{table}').result_rows[0][0]}")


Tables in cluster:[('wiki_abstract_50w',), ('wiki_abstract_50w_1',)]
Start uploading data from S3 to MyScale.
Rows in Table:500000


数据集导入完毕后我们需要建立向量索引, 这可以加速向量搜索的过程。

In [11]:
# Create a vector index.
time_build_index_begin = time.time()
client.command(f"OPTIMIZE TABLE {database}.{table} FINAL;")
client.command(f"ALTER TABLE {database}.{table} DROP VECTOR INDEX IF EXISTS WIKI_MSTG;")
client.command(f"ALTER TABLE {database}.{table} ADD VECTOR INDEX WIKI_MSTG body_vector TYPE MSTG('metric_type=Cosine');")
while True:
    try:
        status = client.query(f"SELECT status FROM system.vector_indices WHERE database = '{database}' AND table = '{table}'").result_rows[0][0]
        print(f"\rBuilding vector index, status is {status}, time consume:{time.time() - time_build_index_begin:.2f} sec", end='.', flush=True)
        if status == 'Built':
            break
        time.sleep(1)
    except Exception as e:
        print(f"Exception happened when getting vector index build status, {e}")
print(f"\nTotal index build time consume:{(time.time() - time_build_index_begin):.2f} sec.")


Building vector index, status is Built, time consume:67.55 sec. sec.
Total index build time consume:67.55 sec.


向量搜索在短文本搜索中会发生语义不足的现象, 比如我们将文本 `"Islands discovered by BGLE"` 转换成为向量进行搜索, 我们将会得到以下结果, 这些结果并非我们期待的文章。
真正我们期望找到的岛屿应该是被这个组织发现的：
<iframe
	src="https://en.wikipedia.org/wiki/British_Graham_Land_expedition"
	frameborder="0"
	width="1080"
	height="500"
></iframe>

In [20]:
# Hybrid Search
terms = "Islands discovered by BGLE"
terms_embedding = model.encode([terms])[0]
extracted_terms = 'BGLE Island'
extracted_terms_pattern = [f'(?i){x}' for x in extracted_terms.split(' ')]

# Stage 1. Vector Recall
stage1 = f"""
SELECT id, title, body, distance('alpha=1') (body_vector,{list(terms_embedding)}) AS distance FROM {database}.{table}
ORDER BY distance ASC LIMIT 200"""

stage1_result = client.query(query=stage1)
print_results(stage1_result.result_rows[:5], ["ID", "Title", "Body", "vector_distance"])


|   ID   |          Title          |                  Body                  |   vector_distance   |
|:------:|:-----------------------:|:--------------------------------------:|:-------------------:|
| 11161  |    Rendezvous Islands   | | archipelago      = Discovery Islands | 0.34810376167297363 |
| 11158  |       Read Island       | | archipelago      = Discovery Islands | 0.34810376167297363 |
| 314932 |      Quadra Island      | | archipelago      = Discovery Islands | 0.34810376167297363 |
| 127502 |       Saint Kitts       |    | archipelago = Leeward Islands     |  0.3779163360595703 |
| 123268 | Geography of Montserrat |  | archipelago      = Leeward Islands  |  0.3779163360595703 |


显然，这个结果精确度不够高
<iframe
	src="https://en.wikipedia.org/wiki/Rendezvous_Islands"
	frameborder="0"
	width="1080"
	height="200"
></iframe>
<iframe
	src="https://en.wikipedia.org/wiki/Read_Island"
	frameborder="0"
	width="1080"
	height="200"
></iframe>
<iframe
	src="https://en.wikipedia.org/wiki/Quadra_Island "
	frameborder="0"
	width="1080"
	height="200"
></iframe>


我们考虑采用联合查询的方法去提升短文本/单词的搜索精度, 比如对于短文本 `"BGLE Island"`, 我们会分为两个阶段达成我们的目标:
- 使用向量搜索先获得 `200` 个候选结果。
- 使用 MyScale 内置的函数实现一个简化版的 `TF-IDF` 方法来对候选结果重排序。

运行下述 `Code Block` 可以看到结果已经符合我们的预期。

In [21]:

# Stage 2. Term Reranking
stage2 = f"""
SELECT tempt.id, tempt.title,tempt.body, FQ, TF_IDF FROM ({stage1}) tempt
ORDER BY length(multiMatchAllIndices(arrayStringConcat([body, title], ' '), {extracted_terms_pattern})) AS FQ DESC,
log(1 + countMatches(arrayStringConcat([title, body], ' '), '(?i)({extracted_terms.replace(' ', '|')})')) AS TF_IDF DESC limit 10
"""

time_hybrid_search_begin = time.time()
stage2_result = client.query(query=stage2)
print(f"Hybrid search time consume:{time.time() - time_hybrid_search_begin:.2f} sec.\n\n")
print_results(stage2_result.result_rows, ["ID", "Title", "Body", "MATCH_COUNT", "TF_IDF"])

Hybrid search time consume:0.27 sec.


|   ID   |          Title           |                                                                                                                                                                                                      Body                                                                                                                                                                                                     | MATCH_COUNT |       TF_IDF       |
|:------:|:------------------------:|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|:-----------:|:------------------:|
| 50978  |     