# 自制强化版向量引擎

Milvue 至今仍不支持设置索引过期时间。对我而言这是很重要的功能。毕竟本地机器内存有限，你也不想用着用着突然死机对吧。但我又眼馋 Milvue 方便的嵌入生成功能。怎么办呢？只能自己动手，写一个拥有嵌入生成功能的 Redis 向量索引工具。

In [1]:
# !pip install --upgrade sentence-transformers

In [2]:
import os
import numpy as np
import pandas as pd
from swifter import swifter

from sentence_transformers import SentenceTransformer

import util

In [3]:
MODEL_PATH = './all-MiniLM-L6-v2'
DEFAULT_MODEL = 'sentence-transformers/all-MiniLM-L6-v2'

## 1. 预训练模型 `all-MiniLM-L6-v2`

hf: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

In [4]:
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
os.environ['SENTENCE_TRANSFORMERS_HOME'] = MODEL_PATH

# 加载模型
model = SentenceTransformer(DEFAULT_MODEL)

df = pd.DataFrame({'text_column': ['This is an example sentence', 'Another sentence', 'Third sentence']})
df['embedding'] = df['text_column'].swifter.apply(lambda e: model.encode(str(e)))
df

Pandas Apply:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,text_column,embedding
0,This is an example sentence,"[0.067656875, 0.06349591, 0.0487131, 0.0793049..."
1,Another sentence,"[0.059439808, 0.052465204, 0.015320345, 0.0983..."
2,Third sentence,"[0.04853102, 0.043467816, 0.018068328, 0.07685..."


## 2. 自制向量引擎

集成 Redis 的向量存储和 `all-MiniLM-L6-v2` 的 embedding 生成功能。

features:

- 支持设置索引过期时间
- 支持通过 doc 取回对应 embedding
- 当前 doc 在 Redis 中不存在时，先计算对应 embedding，再存入 Redis，最后取回 embedding

In [5]:
# 实例化 VectorEngine 类
ve = util.VectorEngine(redis_client=util.RedisHandler(),
                       expire_time=10)

# 测试与 Redis 的连通性
ve.ping()

True

In [6]:
# 计算 doc 的 embedding
# 如果 Redis 中存在，直接取回；如果不存在，先计算，然后存入 Redis，最后取回
e = ve.get_and_set(doc='This is an example sentence')
len(e), type(e)

(384, list)

In [7]:
# 计算 DataFrame 中每一列的 embedding
df = pd.DataFrame({'text_column': ['This is an example sentence', 'Another sentence', 'Third sentence']})
df['embedding'] = df['text_column'].swifter.apply(lambda e: np.array(ve.get_and_set(str(e))))
df

Pandas Apply:   0%|          | 0/3 [00:00<?, ?it/s]

Unnamed: 0,text_column,embedding
0,This is an example sentence,"[0.06765687465667725, 0.06349591165781021, 0.0..."
1,Another sentence,"[0.05943980813026428, 0.052465204149484634, 0...."
2,Third sentence,"[0.04853101819753647, 0.04346781596541405, 0.0..."
