## **关于 AOAI 的 Embeddings**

### **什么是向量数据库**

向量数据库是用于存储、管理和搜索嵌入向量的数据库。 近年来，由于人工智能在解决涉及自然语言、图像识别和其他用例的用例方面的有效性不断提高，使用嵌入将非结构化数据（文本、音频、视频等）编码为机器学习模型使用的向量的方式呈爆炸式增长。 其他非结构化形式的数据。 向量数据库已成为企业交付和扩展这些用例的有效解决方案。

### **为什么使用向量数据库**

向量数据库使企业能够采用我们在此存储库中共享的许多嵌入用例（例如问答、聊天机器人和推荐服务），并在安全、可扩展的环境中使用它们。 我们的许多客户都通过嵌入解决了小规模的问题，但性能和安全性阻碍了他们投入生产 - 我们将矢量数据库视为解决该问题的关键组成部分，在本指南中，我们将介绍嵌入文本的基础知识 数据，将其存储在向量数据库中并使用它进行语义搜索。

In [1]:
!pip install qdrant-client

!pip install wget



In [2]:
from openai import AzureOpenAI

from typing import List, Iterator
import pandas as pd
import numpy as np
import os
import wget
from ast import literal_eval

In [3]:
article_df = pd.read_csv('./data/vector_database_wikipedia_articles_embedded.csv')

In [4]:
article_df.head()

Unnamed: 0,id,url,title,text,title_vector,content_vector,vector_id
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,"[0.001009464613161981, -0.020700545981526375, ...","[-0.011253940872848034, -0.013491976074874401,...",0
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,"[0.0009286514250561595, 0.000820168002974242, ...","[0.0003609954728744924, 0.007262262050062418, ...",1
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,"[0.003393713850528002, 0.0061537534929811954, ...","[-0.004959689453244209, 0.015772193670272827, ...",2
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,"[0.0153952119871974, -0.013759135268628597, 0....","[0.024894846603274345, -0.022186409682035446, ...",3
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,"[0.02224554680287838, -0.02044147066771984, -0...","[0.021524671465158463, 0.018522677943110466, -...",4


In [5]:
# Read vectors from strings back into a list
article_df['title_vector'] = article_df.title_vector.apply(literal_eval)
article_df['content_vector'] = article_df.content_vector.apply(literal_eval)

# Set vector_id to be a string
article_df['vector_id'] = article_df['vector_id'].apply(str)

In [6]:
article_df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              25000 non-null  int64 
 1   url             25000 non-null  object
 2   title           25000 non-null  object
 3   text            25000 non-null  object
 4   title_vector    25000 non-null  object
 5   content_vector  25000 non-null  object
 6   vector_id       25000 non-null  object
dtypes: int64(1), object(6)
memory usage: 1.3+ MB


In [7]:
import qdrant_client

In [8]:
qdrant = qdrant_client.QdrantClient(host='localhost',port=6333)

In [9]:
qdrant.get_collections()

CollectionsResponse(collections=[CollectionDescription(name='Articles')])

In [10]:
from qdrant_client.http import models as rest

In [11]:
vector_size = len(article_df['content_vector'][0])

In [12]:
qdrant.recreate_collection(
    collection_name='Articles',
    vectors_config={
        'title': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
        'content': rest.VectorParams(
            distance=rest.Distance.COSINE,
            size=vector_size,
        ),
    }
)

True

In [18]:
qdrant.upsert(
    collection_name='Articles',
    points=[
        rest.PointStruct(
            id=k,
            vector={
                'title': v['title_vector'],
                'content': v['content_vector'],
            },
            payload=v.to_dict(),
        )
        for k, v in article_df.iloc[:100].iterrows()
    ],
)

UpdateResult(operation_id=1, status=<UpdateStatus.COMPLETED: 'completed'>)

In [19]:
qdrant.count(collection_name='Articles')

CountResult(count=100)

In [22]:
def query_qdrant(query, collection_name, vector_name='title', top_k=5):

    client = AzureOpenAI(
        api_key = 'ae5560ef94924f4ba27120814bacbae3',  
        api_version = "2023-12-01-preview",
        azure_endpoint ='https://kinfey-aoai.openai.azure.com/' 
        )

    embedded_query = client.embeddings.create(
            input = query,
            model= "EmbeddingModel"
        )
    
    query_results = qdrant.search(
        collection_name=collection_name,
        query_vector=(
            vector_name, embedded_query.data[0].embedding
        ),
        limit=top_k,
    )
    
    return query_results

In [23]:
query_results = query_qdrant('modern art in Europe', 'Articles')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]} (Score: {round(article.score, 3)})')

1. Art (Score: 0.842)
2. Architecture (Score: 0.815)
3. Belgium (Score: 0.808)
4. Austria (Score: 0.802)
5. Archaeology (Score: 0.796)


In [24]:
query_results = query_qdrant('Famous battles in Scottish history', 'Articles', 'content')
for i, article in enumerate(query_results):
    print(f'{i + 1}. {article.payload["title"]} (Score: {round(article.score, 3)})')

1. Belgium (Score: 0.739)
2. Alan Turing (Score: 0.733)
3. China (Score: 0.722)
4. Australia (Score: 0.722)
5. Colchester (Score: 0.72)
