# Azure Cognitive Search as a vector database for OpenAI embeddings

This Notebook provides step by step instuctions on using Azure Cognitive Search as a vector database with OpenAI embeddings. Azure Cognitive Search (formerly known as "Azure Search") is a cloud search service that gives developers infrastructure, APIs, and tools for building a rich search experience over private, heterogeneous content in web, mobile, and enterprise applications.

## Prerequistites:
For the purposes of this exercise you must have the following:
- [Azure Cognitive Search Service](https://learn.microsoft.com/azure/search/)
- [OpenAI Key](https://platform.openai.com/account/api-keys) or [Azure OpenAI credentials](https://learn.microsoft.com/azure/cognitive-services/openai/)

In [None]:
! pip install wget
! pip install --index-url=https://pkgs.dev.azure.com/azure-sdk/public/_packaging/azure-sdk-for-python/pypi/simple/ azure-search-documents==11.4.0a20230509004
! pip install azure-identity


## Import required libraries

In [2]:
import openai
import os  
import json  
import openai
import wget
import pandas as pd
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.models import Vector  
from azure.search.documents.indexes.models import (  
    SearchIndex,  
    SearchField,  
    SearchFieldDataType,  
    SimpleField,  
    SearchableField,  
    SearchIndex,  
    SearchField,  
    VectorSearch,  
    VectorSearchAlgorithmConfiguration,  
) 


## Configure OpenAI settings

In [4]:
openai.api_type = "azure"
openai.api_base = "YOUR-AZURE-OPENAI-ENDPOINT"
openai.api_version = "2023-05-15"
openai.api_key = "YOUR-AZURE-OPENAI-KEY"
model: str = "text-embedding-ada-002"

## Configure Azure Cognitive Search Vector Store settings

In [5]:
search_service_endpoint: str = "YOUR_AZURE_SEARCH_ENDPOINT"
search_service_api_key: str = "YOUR_AZURE_SEARCH_ADAMIN_KEY"
index_name: str = "azure-cognitive-search-vector-demo"
credential = AzureKeyCredential(search_service_api_key)


## Load data

In [6]:
embeddings_url = "https://cdn.openai.com/API/examples/data/vector_database_wikipedia_articles_embedded.zip"

# The file is ~700 MB so this will take some time
wget.download(embeddings_url)

'vector_database_wikipedia_articles_embedded.zip'

In [7]:
import zipfile

with zipfile.ZipFile("vector_database_wikipedia_articles_embedded.zip","r") as zip_ref:
    zip_ref.extractall("../../data")

In [8]:
from ast import literal_eval

article_df = pd.read_csv('../../data/vector_database_wikipedia_articles_embedded.csv')
# Read vectors from strings back into a list
article_df["title_vector"] = article_df.title_vector.apply(literal_eval)
article_df["content_vector"] = article_df.content_vector.apply(literal_eval)
article_df['vector_id'] = article_df['vector_id'].apply(str)
article_df.head()

Unnamed: 0,id,url,title,text,title_vector,content_vector,vector_id
0,1,https://simple.wikipedia.org/wiki/April,April,April is the fourth month of the year in the J...,"[0.001009464613161981, -0.020700545981526375, ...","[-0.011253940872848034, -0.013491976074874401,...",0
1,2,https://simple.wikipedia.org/wiki/August,August,August (Aug.) is the eighth month of the year ...,"[0.0009286514250561595, 0.000820168002974242, ...","[0.0003609954728744924, 0.007262262050062418, ...",1
2,6,https://simple.wikipedia.org/wiki/Art,Art,Art is a creative activity that expresses imag...,"[0.003393713850528002, 0.0061537534929811954, ...","[-0.004959689453244209, 0.015772193670272827, ...",2
3,8,https://simple.wikipedia.org/wiki/A,A,A or a is the first letter of the English alph...,"[0.0153952119871974, -0.013759135268628597, 0....","[0.024894846603274345, -0.022186409682035446, ...",3
4,9,https://simple.wikipedia.org/wiki/Air,Air,Air refers to the Earth's atmosphere. Air is a...,"[0.02224554680287838, -0.02044147066771984, -0...","[0.021524671465158463, 0.018522677943110466, -...",4


In [40]:
import pandas as pd
import json

# Read the CSV file into a DataFrame
article_df = pd.read_csv('../../data/vector_database_wikipedia_articles_embedded.csv')

# Convert the 'id' and 'vector_id' columns to string
article_df['id'] = article_df['id'].astype(str)
article_df['vector_id'] = article_df['vector_id'].astype(str)

# Convert the 'title_vector' and 'content_vector' columns to arrays of floats
article_df['title_vector'] = article_df['title_vector'].apply(json.loads)
article_df['content_vector'] = article_df['content_vector'].apply(json.loads)

# Convert the DataFrame to the desired JSON format
output = []
for _, row in article_df.iterrows():
    item = {
        'id': row['id'],
        'vector_id': row['vector_id'],
        'url': row['url'],
        'title': row['title'],
        'text': row['text'],
        'content_vector': row['content_vector'],
        'title_vector': row['title_vector']
    }
    output.append(item)

# Save the JSON to a file
with open('output.json', 'w') as file:
    json.dump(output, file)


##  Create an index

In [14]:
# Create a search index
index_client = SearchIndexClient(
    endpoint=search_service_endpoint, credential=credential)
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String),
    SimpleField(name="vector_id", type=SearchFieldDataType.String, key=True),
    SearchableField(name="url", type=SearchFieldDataType.String, searchable=True, retrievable=True),
    SearchableField(name="title", type=SearchFieldDataType.String, searchable=True, retrievable=True),
    SearchableField(name="text", type=SearchFieldDataType.String, searchable=True, retrievable=True),
    SearchField(name="title_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, dimensions=1536, vector_search_configuration="my-vector-config"),
    SearchField(name="content_vector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, dimensions=1536, vector_search_configuration="my-vector-config"),
]

vector_search = VectorSearch(
    algorithm_configurations=[
        VectorSearchAlgorithmConfiguration(
            name="my-vector-config",
            kind="hnsw",
            hnsw_parameters={
                "m": 4,
                "efConstruction": 400,
                "efSearch": 500,
                "metric": "cosine"
            }
        )
    ]
)

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields,
                    vector_search=vector_search)
result = index_client.create_or_update_index(index)
print(f'{result.name} created')


azure-cognitive-search-vector-demo created


## Insert text and embeddings into vector store

In [15]:
import json

# Read the JSON file into a list
with open('output.json', 'r') as file:
    documents = json.load(file)
search_client = SearchClient(endpoint=search_service_endpoint, index_name=index_name, credential=credential)
# Define the batch size
batch_size = 250

# Split the documents into batches
batches = [documents[i:i + batch_size] for i in range(0, len(documents), batch_size)]

# Upload each batch of documents
for batch in batches:
    result = search_client.upload_documents(batch)
    print(f"Uploaded {len(batch)} documents")

print(f"Uploaded {len(documents)} documents in total")

25000 documents uploaded in total


## Perform a vector similarity search

In [9]:
# Function to generate query embedding
def generate_embeddings(text):
    response = openai.Embedding.create(
        input=text, engine=model)
    embeddings = response['data'][0]['embedding']
    return embeddings

# Pure Vector Search
query = "modern art in Europe"  
  
search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key))  
  
results = search_client.search(  
    search_text="",  
    vector=Vector(value=generate_embeddings(query), k=3, fields="content_vector"),  
    select=["title", "text", "url"] 
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"URL: {result['url']}\n")  

Title: Documenta
Score: 0.8599451
URL: https://simple.wikipedia.org/wiki/Documenta

Title: Museum of Modern Art
Score: 0.85260946
URL: https://simple.wikipedia.org/wiki/Museum%20of%20Modern%20Art

Title: Expressionism
Score: 0.85235393
URL: https://simple.wikipedia.org/wiki/Expressionism



## Perform a Hybrid Search

In [56]:
# Hyrbid Search
query = "Famous battles in Scottish history"  
  
search_client = SearchClient(search_service_endpoint, index_name, AzureKeyCredential(search_service_api_key))  
  
results = search_client.search(  
    search_text=query,  
    vector=Vector(value=generate_embeddings(query), k=3, fields="content_vector"),  
    select=["title", "text", "url"],
    top=3
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"URL: {result['url']}\n")  

Title: Wars of Scottish Independence
Score: 0.03131881356239319
URL: https://simple.wikipedia.org/wiki/Wars%20of%20Scottish%20Independence

Title: Battle of Bannockburn
Score: 0.02187500149011612
URL: https://simple.wikipedia.org/wiki/Battle%20of%20Bannockburn

Title: Scottish
Score: 0.01666666753590107
URL: https://simple.wikipedia.org/wiki/Scottish

