# ClickHouse向量搜索

>[ClickHouse](https://clickhouse.com/)是最快、最资源高效的开源数据库，适用于实时应用和分析，支持完整的SQL语句和各种函数，帮助用户编写分析查询。最近添加的数据结构和距离搜索函数（如`L2Distance`）以及[近似最近邻搜索索引](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/annindexes)使ClickHouse能够作为高性能扩展向量数据库，利用SQL存储和搜索向量。

该笔记本演示了如何使用与`ClickHouse`向量搜索相关的功能。

## 设置环境

使用docker设置本地clickhouse服务器（可选）

In [None]:
! docker run -d -p 8123:8123 -p9000:9000 --name langchain-clickhouse-server --ulimit nofile=262144:262144 clickhouse/clickhouse-server:23.4.2.11

设置clickhouse客户机驱动程序

In [None]:
!pip install clickhouse-connect

我们想要使用OpenAIEmbeddings，因此我们必须获取OpenAI API密钥。

In [1]:
import os
import getpass

if not os.environ['OPENAI_API_KEY']:
    os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')

In [2]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Clickhouse, ClickhouseSettings

In [3]:
from langchain.document_loaders import TextLoader
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

In [4]:
for d in docs:
    d.metadata = {'some': 'metadata'}
settings = ClickhouseSettings(table="clickhouse_vector_search_example")
docsearch = Clickhouse.from_documents(docs, embeddings, config=settings)

query = "What did the president say about Ketanji Brown Jackson"
docs = docsearch.similarity_search(query)

Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 2801.49it/s]


In [5]:
print(docs[0].page_content)

Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections. 

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service. 

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. 

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.


## 获取连接信息和数据架构

In [6]:
print(str(docsearch))

[92m[1mdefault.clickhouse_vector_search_example @ localhost:8123[0m

[1musername: None[0m

Table Schema:
---------------------------------------------------
|[94mid                      [0m|[96mNullable(String)        [0m|
|[94mdocument                [0m|[96mNullable(String)        [0m|
|[94membedding               [0m|[96mArray(Float32)          [0m|
|[94mmetadata                [0m|[96mObject('json')          [0m|
|[94muuid                    [0m|[96mUUID                    [0m|
---------------------------------------------------



### Clickhouse表模式

> 如果未指定，Clickhouse表将默认自动创建。高级用户可以使用优化设置预先创建表。对于带有切分的分布式Clickhouse集群，表引擎应配置为`Distributed`。点击这里查看表结构。

In [8]:
print(f"Clickhouse Table DDL:\n\n{docsearch.schema}")

Clickhouse Table DDL:

CREATE TABLE IF NOT EXISTS default.clickhouse_vector_search_example(
    id Nullable(String),
    document Nullable(String),
    embedding Array(Float32),
    metadata JSON,
    uuid UUID DEFAULT generateUUIDv4(),
    CONSTRAINT cons_vec_len CHECK length(embedding) = 1536,
    INDEX vec_idx embedding TYPE annoy(100,'L2Distance') GRANULARITY 1000
) ENGINE = MergeTree ORDER BY uuid SETTINGS index_granularity = 8192


## 过滤

您可以直接访问ClickHouse SQL where语句。您可以编写按标准SQL格式的`WHERE`子句。

**注意**: 请注意SQL注入，终端用户不能直接调用此接口。

如果您在设置下自定义了`column_map`，您可以使用以下过滤器进行搜索：

In [9]:
from langchain.vectorstores import Clickhouse, ClickhouseSettings
from langchain.document_loaders import TextLoader

loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()

for i, d in enumerate(docs):
    d.metadata = {'doc_id': i}

docsearch = Clickhouse.from_documents(docs, embeddings)

Inserting data...: 100%|██████████| 42/42 [00:00<00:00, 6939.56it/s]


In [10]:
meta = docsearch.metadata_column
output = docsearch.similarity_search_with_relevance_scores('What did the president say about Ketanji Brown Jackson?', 
                                                           k=4, where_str=f"{meta}.doc_id<10")
for d, dist in output:
    print(dist, d.metadata, d.page_content[:20] + '...')

0.6779101415357189 {'doc_id': 0} Madam Speaker, Madam...
0.6997970363474885 {'doc_id': 8} And so many families...
0.7044504914336727 {'doc_id': 1} Groups of citizens b...
0.7053558702165094 {'doc_id': 6} And I’m taking robus...


## 删除您的数据

In [11]:
docsearch.drop()