## Weaviate使用查询测试
演示一些在weaviate中的查询方法，演示了在1M数据集中不同的检索任务


#### 链接本地服务
> 注意执行前，请打开本地服务

In [8]:
import weaviate

client = weaviate.connect_to_local()

print(client.is_ready())

True


#### 链接OpenVid_1M Collections

In [9]:
jeopardy = client.collections.get("OpenVid_1M")

## 测试任务
1、 根据描述词和相机动作召回

2、 根据描述带有窗户、食物的关键词召回

3、 带有空间标量的召回策略

4、 其它召回方式


 更多的检索方法请参考 [search](https://weaviate.io/developers/weaviate/search), 


### 矢量召回策略

> 根据描述词和相机动作召回,
> 
> [向量相似度搜索](https://weaviate.io/developers/weaviate/search/similarity)：涵盖搜索与查询具有最相似向量表示的对象的 nearXXX 搜索。
> 
> 这个 [nearText](https://weaviate.io/developers/weaviate/api/graphql/search-operators#neartext) 运算符根据数据对象与自然语言查询的向量相似度查找数据对象。

In [10]:
from weaviate.classes.query import MetadataQuery


vector_names = ["Caption", "CameraMotion"]
response = jeopardy.query.near_text(
    query="A Ice cream appears in front of the window, and the view outside the window is very peaceful.",  
    target_vector=vector_names,  # Specify the target vector for named vector collections 
    limit=2,
    return_metadata=MetadataQuery(score=True, explain_score=True),
)



for o in response.objects:
    print(o.properties['caption'])
    print(o.properties['cameraMotion'])
    print(o.metadata.score, o.metadata.explain_score)

The video features a woman with blonde hair, wearing a red jacket, smiling and gesturing with her hands. She is seated in a coffee shop, surrounded by other patrons and a counter with various items. The setting is casual and inviting, with a warm and cozy atmosphere. The woman appears to be engaged in a conversation or sharing a story, as she smiles and makes expressive hand gestures. The video captures a moment of connection and enjoyment in a public space.
tilt_up
0.0 
The video features a man with a beard and dreadlocks, wearing a black leather vest and a black wristband. He is holding a microphone and appears to be speaking or singing. The man is gesturing with his right hand, pointing upwards towards the ceiling. The background is dark, with a blurred image that suggests an indoor setting, possibly a stage or a concert venue. The style of the video is a close-up shot of the man, focusing on his facial expressions and gestures, with the microphone and his hand as the main objects i

### BM25关键词召回，与窗户、食物有关的文本
 
> The [bm25](https://weaviate.io/developers/weaviate/search/bm25) 运算符执行关键字（稀疏向量）搜索，并使用 BM25F 排名函数对结果进行评分


In [11]:
from weaviate.classes.query import MetadataQuery

response = jeopardy.query.bm25(
    query="window food",  
    limit=2,
    query_properties=["Caption"],
    return_metadata=MetadataQuery(score=True, explain_score=True),
)



for o in response.objects:
    print(o.properties['caption'])
    print(o.properties['cameraMotion'])
    print(o.metadata.score, o.metadata.explain_score)

In the video, a man is seen in a kitchen, holding a large white bowl filled with a colorful assortment of food. He is wearing a white hat and a green shirt, and he is holding a spoon in his hand. The man is smiling and appears to be enjoying the food. The kitchen has white cabinets and a window in the background. The man is standing in front of the window, which lets in natural light. The food in the bowl includes various vegetables and meat, and it looks delicious. The man seems to be preparing to eat the food, as he is holding the spoon over the bowl. The overall atmosphere of the video is warm and inviting, with the man appearing to be in a good mood.
static
3.759178638458252 , BM25F_window_frequency:2, BM25F_window_propLength:61, BM25F_food_frequency:4, BM25F_food_propLength:61
In the video, a man is seen enjoying a meal at a table. He is wearing a red shirt and is seated in front of a white bowl filled with food. He is using a spoon to eat the food. The table is made of wood and i

### 矢量召回，增加距离排序

先筛选一个相似的范围，然后在这个空间中，按照词义进行标量排序，

这种方法适用于目标任务确定，结果符合幂等，静态数据检索等任务

> 注意： Rerank方法需要提前设置[Collections重排序模块](https://weaviate.io/developers/weaviate/concepts/reranking)，重排序模块根据不同的标准或不同的（例如更昂贵的）算法重新排列搜索结果集。

In [12]:
from weaviate.classes.query import Rerank, MetadataQuery

vector_names = ["Caption", "CameraMotion"]
response = jeopardy.query.near_text(
    query="A Ice cream appears in front of the window, and the view outside the window is very peaceful.",  
    target_vector=vector_names,  # Specify the target vector for named vector collections 
    limit=1,
    rerank=Rerank(
        prop="caption",
        query="food"
    ),
    return_metadata=MetadataQuery(score=True, explain_score=True, distance=True),
)



for o in response.objects:
    print(o.properties['caption'])
    print(o.properties['cameraMotion'])
    print(o.metadata)

The video is an aerial view of a large sports complex. The complex features a football field with a track around it, a large parking lot, and several buildings. The football field is green with white lines marking the field. The track is black with white lines marking the lanes. The parking lot is filled with cars and trucks. The buildings are large and appear to be made of brick. The complex is surrounded by trees and a river. The sky is cloudy and the lighting is overcast. The style of the video is realistic and it captures the details of the complex and its surroundings.
Undetermined
MetadataReturn(creation_time=None, last_update_time=None, distance=1.0403149127960205, certainty=None, score=0.0, explain_score='', is_consistent=None, rerank_score=-10.14773178100586)


#### 找到准确的标量的召回

观察到结果确实如我们想要的意义，一个带有房间窗户户外食物元素的内容召回

不过对于这种召回仍然有些瑕疵，具体还是因为我们仍然无法度量，我们的任务在这几个元素的占比

当空间标量无法度量（有聚集）时，这个召回策略仍然不可用
重排序仍然有个问题，当我们的数据样本很少，又需要把样本按照标量标记空间距离，这时一个准确的标量空间应该是尽可能的在一个聚集内

### 方式一

过滤检索范围，通过使用关键词筛选，缩小查询结果


In [13]:
from weaviate.classes.query import MetadataQuery
from weaviate.classes.query import Filter

## 查找关于食物特写的描述
response = jeopardy.query.bm25(
    query="Food",  
    limit=1,
    filters=Filter.by_property("caption").equal("a close-up view"),
    query_properties=["Caption"],
    return_metadata=MetadataQuery(score=True, explain_score=True),
)


## 查找关于食物牛油果的描述
response = jeopardy.query.bm25(
    query="Food",  
    limit=1,
    filters=Filter.by_property("caption").equal("avocado slices"),
    query_properties=["Caption"],
    return_metadata=MetadataQuery(score=True, explain_score=True),
)


for o in response.objects:
    print(o.uuid)
    print(o.properties['caption'])
    print(o.properties['cameraMotion'])
    print(o.metadata)

a5528dee-c3c1-452a-a554-737e52f05ff6
The video shows a close-up of a plate of food, which includes a serving of rice, beans, and avocado slices. The food is presented on a green plate with a striped pattern. The style of the video is simple and straightforward, focusing on the food without any additional context or background. The camera angle is slightly elevated, providing a clear view of the food on the plate. The lighting is bright, highlighting the colors and textures of the food. The video does not contain any text or additional elements, and the focus is solely on the plate of food.
Undetermined
MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=2.2844784259796143, explain_score=', BM25F_food_frequency:6, BM25F_food_propLength:53', is_consistent=None, rerank_score=None)


> 关键词召回确实限定了很多内容，但如果检索文本中不包含相关关键词，又或者关键词具有多种解释时，这种方法的效果将不能完成任务
> 
> 这里的BM25F_food_frequency、BM25F_food_propLength与bm25的评分算法有关，这个文本中有很多关于window的形容词，它们在不同领域是不同的术语，增加BM25的b参数可以调节 https://www.cnblogs.com/novwind/p/15177871.html

### 方式二

使用[语义分析](https://weaviate.io/developers/weaviate/api/graphql/search-operators#example-ii)可以让我们的检索效果更加精确，具体演示语义路径检索在另外章节

这里通过对目标矢量偏移，修改搜索词在空间向量的坐标，可控制的操作方法如，在空间距离上修改词的距离，远离或者接近



#### Semantic Path 语义路径 
仅适用于 text2vec-contextionary 模块

注意：仅当将 nearText: {} 操作符设置为探索术语时，才能构建语义路径，因为探索术语代表路径的开始，每个搜索结果代表路径的结束。由于 nearText: {} 查询目前仅在 GraphQL 中可行，因此 semanticPath 在 REST API 中不可用。

In [14]:
# Semantic path is not yet supported by the V4 client. Please use a raw GraphQL query instead.
response = client.graphql_raw_query(
  """
  {
    Get {
      OpenVidContext(
        nearText:{
          concepts: ["food"], 
          distance: 0.23, 
          moveAwayFrom: {
            concepts: ["finance"],
            force: 0.45
          },
          moveTo: {
            concepts: ["apples", "food"],
            force: 0.85
          }
        }
      ) {
        caption
        _additional {
          semanticPath {
            path {
              concept
              distanceToNext
              distanceToPrevious
              distanceToQuery
              distanceToResult
            }
          }
        }
      }
    }
  }
  """
) 

### 使用混合查询方法

可度量的混合搜索 度量方式  

混合搜索结果根据关键词组件或向量组件的权重参数。改变关键词和向量组件的相对权重，alpha 值可以做到不同比例的内容
 
- alpha 偏向 1 是一个纯向量搜索。 
- alpha 偏向 0 是一个纯关键词搜索。


In [15]:
from weaviate.classes.query import MetadataQuery

vector_names = ["Caption"]
response = jeopardy.query.hybrid(
    query="A Ice cream appears in front of the window, and the view outside the window is very peaceful.",  
    target_vector=vector_names,  # Specify the target vector for named vector collections
    limit=1,
    alpha=0.1,
    query_properties=["caption"],  
    return_metadata=MetadataQuery(score=True, explain_score=True, distance=True),
)



for o in response.objects:
    print(o.properties['caption'])
    print(o.properties['cameraMotion'])
    print(o.metadata)

The video captures a delightful scene of two scoops of vanilla ice cream on a white plate, placed on a counter. The ice cream is generously drizzled with a rich, golden caramel sauce, adding a sweet and sticky element to the dish. The plate is positioned in front of a window, which offers a glimpse of a lively park scene outside. People can be seen walking and enjoying the day, adding a sense of life and movement to the otherwise still image. The overall style of the video is simple yet appealing, focusing on the ice cream and the window view, while the background activity provides a sense of depth and context to the scene.
static
MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=0.8999999761581421, explain_score='\nHybrid (Result Set keyword,bm25) Document 12be133a-4a53-4cb6-bb4b-413980b5813b: original score 14.165811, normalized score: 0.9', is_consistent=None, rerank_score=None)


### 链接OpenVid_1M Collections

In [21]:

import transformers
from transformers import AutoModel
from transformers import AutoTokenizer
late_collection = client.collections.get("OpenVidLateContext_jina_v2_zh")

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('/mnt/ceph/develop/jiawei/model_checkpoint/jina-embeddings-v2-base-zh', trust_remote_code=True)
model = AutoModel.from_pretrained('/mnt/ceph/develop/jiawei/model_checkpoint/jina-embeddings-v2-base-zh', trust_remote_code=True).to(device='cuda:0') 

### Example Query
测试关于某一个食物召回结果

In [55]:
 
berlin_embedding = model.encode("有没有有草莓味的冰激凌")

results = late_collection.query.near_vector(
    near_vector=berlin_embedding.tolist(),
    limit = 25
)

videos = []

uuids = []
for o in results.objects:
    videos.append(o.properties['video'])
    uuids.append(o.uuid)
    print(o.properties['caption']) 
    
    print(o.metadata)

The video captures a delightful dessert scene. A crepe, dusted with powdered sugar, is the main focus. It's generously drizzled with chocolate sauce, adding a rich, sweet touch to the dish. A scoop of vanilla ice cream is placed on top of the crepe, its creamy texture contrasting with the crispy crepe. The ice cream is also drizzled with chocolate sauce, creating a visually appealing pattern. The entire dessert is presented on a plate, which is placed on a table. The background is blurred, drawing attention to the dessert in the foreground. The overall style of the video is simple yet elegant, focusing on the dessert without any distractions.
MetadataReturn(creation_time=None, last_update_time=None, distance=None, certainty=None, score=None, explain_score=None, is_consistent=None, rerank_score=None)
The video shows a person holding a double-flavored ice cream cone. The ice cream is a mix of yellow and pink, suggesting two different flavors. The cone is a classic waffle cone, which is a

In [56]:
 
set_1 = set(videos)

In [57]:

videos2 = ['6sqG1ObKfP4_29_0to221.mp4',
 'BZhj-4k2hHI_28_0to238.mp4',
 '6sqG1ObKfP4_29_0to221.mp4',
 '3njnhFP-ul4_99_0to123.mp4',
 'HneZ2wDm9rQ_16_51to201.mp4',
 'AUZHZ3PH6vM_1_0to102.mp4',
 '6JTizSIQSxg_1_162to294.mp4',
 '0J_bFot0a0c_15_0to535.mp4',
 'BZhj-4k2hHI_28_0to238.mp4',
 '3njnhFP-ul4_99_0to123.mp4',
 'H6MW6SGP5tg_27_0to109.mp4',
 '3Znh1WmcgzQ_0_0to125.mp4',
 '0J_bFot0a0c_15_0to535.mp4',
 'FkZd5XPDI2w_10_546to817.mp4',
 'Dcdh6ThJ4hA_0_0to136.mp4',
 'FkZd5XPDI2w_10_546to817.mp4',
 'GuOEbUnGFAo_10_0to847.mp4',
 'BZhj-4k2hHI_28_0to238.mp4',
 'Dcdh6ThJ4hA_0_0to136.mp4',
 '3njnhFP-ul4_99_0to123.mp4',
 'H6MW6SGP5tg_27_0to109.mp4',
 'FkZd5XPDI2w_10_546to817.mp4',
 '3Znh1WmcgzQ_0_0to125.mp4',
 'AUZHZ3PH6vM_1_0to102.mp4',
 '4tVq3Qvtiz4_55_163to283.mp4',
 'GuOEbUnGFAo_10_0to847.mp4',
 'GFqdkkxSQ0Q_1_0to105.mp4',
 '4ou92zGkVyQ_19_405to520.mp4',
 'Hh3F-GCH7ac_3_18to218.mp4',
 '8XhDnG694q4_14_0to208.mp4',
 '8WwRkvET5jQ_46_0to114.mp4',
 'J2Qra6J6t_c_7_0to126.mp4',
 'Hh3F-GCH7ac_3_18to218.mp4',
 'H6MW6SGP5tg_27_0to109.mp4',
 'AUZHZ3PH6vM_1_0to102.mp4',
 'Hh3F-GCH7ac_3_18to218.mp4',
 'D7UR9Kbj3Z0_0_0to163.mp4',
 '72B-21bz1TQ_0_0to461.mp4',
 'BZhj-4k2hHI_28_0to238.mp4',
 'Dcdh6ThJ4hA_0_0to136.mp4',
 '3njnhFP-ul4_99_0to123.mp4',
 'H6MW6SGP5tg_27_0to109.mp4',
 'FkZd5XPDI2w_10_546to817.mp4',
 '3Znh1WmcgzQ_0_0to125.mp4',
 'AUZHZ3PH6vM_1_0to102.mp4',
 '4tVq3Qvtiz4_55_163to283.mp4',
 'GuOEbUnGFAo_10_0to847.mp4',
 'GFqdkkxSQ0Q_1_0to105.mp4',
 '4ou92zGkVyQ_19_405to520.mp4',
 'Hh3F-GCH7ac_3_18to218.mp4',
 '8XhDnG694q4_14_0to208.mp4',
 '8WwRkvET5jQ_46_0to114.mp4']


set_2 = set(videos2)

### 观察到数据并没有很好的分布
按照数据统计，当存在出度入度很高的概念后，在相同数据创建的embedding中，并没有很好的适配数据任务，可能有几点，

1、两个模型jina-embedding-v2-base-zh、sentence-transformers-multi-qa-MiniLM-L6-cos-v1 关注的任务并不相同。

2、数据不够规模


In [63]:

# 计算并集
union_set = set_1.union(set_2)

intersection_set = set_1.intersection(set_2)
# 输出结果
print("集合 A:", set_1)
print("集合 B:", set_2)
print("A 和 B 的并集:", union_set)
print("A 和 B 的交集:", intersection_set)

集合 A: {'1RWayBcihHo_41_0to131.mp4', '7fGZ8KT6HDc_30_0to530.mp4', '3njnhFP-ul4_102_0to183.mp4', '0y6s0Po7tIs_0_46to155.mp4', '0rFgYEGkiqw_24_0to134.mp4', '827Vm-y6SdY_22_0to241.mp4', '2NG51dxVSJU_20_26to162.mp4', '4xcahdAZZqc_3_101to204.mp4', '-rnQ40pX-zM_49_25to142.mp4', '2hpsJDjoses_2_0to105.mp4', '1iKK2t-HlS8_44_0to141.mp4', '55q-hYYMxEo_47_0to283.mp4', '-t3UyHPaG3k_2_0to101.mp4'}
集合 B: {'4ou92zGkVyQ_19_405to520.mp4', 'D7UR9Kbj3Z0_0_0to163.mp4', '4tVq3Qvtiz4_55_163to283.mp4', '8XhDnG694q4_14_0to208.mp4', '6JTizSIQSxg_1_162to294.mp4', 'J2Qra6J6t_c_7_0to126.mp4', '72B-21bz1TQ_0_0to461.mp4', 'AUZHZ3PH6vM_1_0to102.mp4', 'H6MW6SGP5tg_27_0to109.mp4', 'Hh3F-GCH7ac_3_18to218.mp4', 'Dcdh6ThJ4hA_0_0to136.mp4', '0J_bFot0a0c_15_0to535.mp4', 'HneZ2wDm9rQ_16_51to201.mp4', 'FkZd5XPDI2w_10_546to817.mp4', 'GFqdkkxSQ0Q_1_0to105.mp4', 'GuOEbUnGFAo_10_0to847.mp4', '8WwRkvET5jQ_46_0to114.mp4', 'BZhj-4k2hHI_28_0to238.mp4', '3njnhFP-ul4_99_0to123.mp4', '6sqG1ObKfP4_29_0to221.mp4', '3Znh1WmcgzQ_0_0to125.mp4

### 检查交集的数据

In [82]:
jeopardy = client.collections.get("OpenVid_1M")
response = jeopardy.query.fetch_objects(
    filters=Filter.by_property("video").contains_any(list(union_set)), 
) 

documents = [] 
uuids = []
for o in results.objects:
    if o.properties['caption'] not in documents:
            
        documents.append(o.properties['caption']) 

In [75]:
from sentence_transformers import CrossEncoder

model = CrossEncoder(
    "/mnt/ceph/develop/jiawei/model_checkpoint/jina-reranker-v2-base-multilingual",
    automodel_args={"torch_dtype": "auto"},
    trust_remote_code=True,
)


In [83]:


# Example query and documents
query = "有没有有草莓味的冰激凌"
 
# construct sentence pairs
sentence_pairs = [[query, doc] for doc in documents]

scores = model.predict(sentence_pairs, convert_to_tensor=True).tolist()
 
scores

[0.10498046875,
 0.083984375,
 0.0673828125,
 0.23828125,
 0.040771484375,
 0.103515625,
 0.061767578125,
 0.046142578125,
 0.054931640625,
 0.040771484375,
 0.052734375,
 0.040771484375,
 0.0888671875]

In [84]:

rankings = model.rank(query, documents, return_documents=True, convert_to_tensor=True)
print(f"Query: {query}")
for ranking in rankings:
    print(f"ID: {ranking['corpus_id']}, Score: {ranking['score']:.4f}, Text: {ranking['text']}")


Query: 有没有有草莓味的冰激凌
ID: 3, Score: 0.2383, Text: The video is a delightful culinary journey featuring a jar of strawberry ice cream. The first frame shows the jar, filled with creamy white ice cream speckled with red strawberry pieces, placed on a wooden table. The second frame reveals the jar being opened, with the lid removed, revealing the enticing ice cream inside. The third frame captures the moment when a spoon is inserted into the jar, scooping out a generous serving of the delicious ice cream. The entire scene is set against a backdrop of a cozy kitchen, with a potted plant adding a touch of greenery to the scene. The video is a visual treat, capturing the simple joy of indulging in a sweet treat.
ID: 0, Score: 0.1050, Text: The video captures a delightful dessert scene. A crepe, dusted with powdered sugar, is the main focus. It's generously drizzled with chocolate sauce, adding a rich, sweet touch to the dish. A scoop of vanilla ice cream is placed on top of the crepe, its cream