# Azure AI Search Query Patterns
Azure OpenAI and Azure AI Search Code samples for various search methods in Azure AI Search based on the Azure AI Search Python client library. forked from [https://github.com/Azure/azure-search-vector-samples]

## Prerequisites
Configure a Python virtual environment for 3.10 or later: 
 1. open the Command Palette (Ctrl+Shift+P).
 1. Search for Python: Create Environment.
 1. select Venv / Conda and choose where to create the new environment.
 1. Select the Python interpreter version. Create with version 3.10 or later.

For a dependency installation, run the code below to install the packages required to run it. 

```bash
# Create a virtual environment
python -m venv venv

# Activate the virtual environment
# On Windows
venv\Scripts\activate

# On macOS/Linux
source venv/bin/activate

pip install -r requirements.txt
```

## Set up your environment
Git clone the repository to your local machine. 

```bash
git clone https://github.com/hyogrin/Azure_OpenAI_samples.git
```

Create an .env file based on the .env-sample file. Copy the new .env file to the folder containing your notebook and update the variables.

## Import required libraries and environment variables

In [3]:
import os
import json
from openai import AzureOpenAI
import sys
import pandas as pd
import tiktoken
import re
from dotenv import load_dotenv

from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.models import VectorizableTextQuery
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SemanticConfiguration,
    SemanticSearch,
    SemanticPrioritizedFields,
    SemanticField,
    SearchIndex
)

from tenacity import retry, wait_random_exponential, stop_after_attempt

In [4]:
# Configure environment variables
load_dotenv()
index_name = "quickdemo"
MAX_RETRIES = 3
search_endpoint = os.getenv("AZURE_SEARCH_ENDPOINT")
key = os.getenv("AZURE_SEARCH_ADMIN_KEY")


client = AzureOpenAI(
    azure_endpoint = os.getenv("AZURE_OPENAI_ENDPOINT"),        # Azure OpenAI base URL
    api_key        = os.getenv("AZURE_OPENAI_API_KEY"),
    api_version    = os.getenv("AZURE_OPENAI_CHAT_API_VERSION"),
    max_retries    = MAX_RETRIES
)

deployment_name = os.getenv("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
deployment_embedding_name = os.getenv("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME")

credential = AzureKeyCredential(key)

## Create embeddings
- Insert Azure AI Search indexes by reading data, creating OpenAI embeddings, and exporting to a valid format

In [8]:
!ls data/text-sample.json -lh

-rw-rw-r-- 1 azureuser azureuser 74K Mar  5 03:22 data/text-sample.json


In [9]:
# Generate Document Embeddings using OpenAI Ada 002

# Read the text-sample.json
# with open("text-sample.json", "r", encoding="utf-8") as file:
#     input_data = json.load(file)

df_input_data = pd.read_json(os.path.join(os.getcwd(),'data/text-sample.json'))    

In [10]:
#pd.options.mode.chained_assignment = None #https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters

# s is input text
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

df_input_data['content']= df_input_data["content"].apply(lambda x : normalize_text(x))

- To take advantage of the Embedding API provided by Azure OpenAI, we check that the document does not have more than 8,192 tokens of text in the document

In [11]:
tokenizer = tiktoken.get_encoding("cl100k_base")
df_input_data['n_tokens'] = df_input_data["content"].apply(lambda x: len(tokenizer.encode(x)))
df_input_data = df_input_data[df_input_data.n_tokens<8192]
len(df_input_data)
df_input_data

Unnamed: 0,id,title,content,category,n_tokens
0,1,Azure App Service,Azure App Service is a fully managed platform ...,Web,93
1,2,Azure Functions,Azure Functions is a serverless compute servic...,Compute,85
2,3,Azure Cognitive Services,Azure Cognitive Services are a set of AI servi...,AI + Machine Learning,91
3,4,Azure Storage,"Azure Storage is a scalable, durable, and high...",Storage,95
4,5,Azure SQL Database,Azure SQL Database is a fully managed relation...,Databases,93
...,...,...,...,...,...
103,104,Azure Site Recovery,"Azure Site Recovery is a fully managed, disast...",Management + Governance,103
104,105,Azure Web PubSub,"Azure Web PubSub is a fully managed, real-time...",Web,100
105,106,Azure Data Factory,Azure Data Factory is a cloud-based data integ...,Analytics,108
106,107,Azure Data Bricks,"Azure Data Bricks is a fully managed, Apache S...",Analytics,110


- After completing the verification, delete the columns that are no longer needed, and save the data for insertion into Azure AI Search. 

In [None]:
df_input_data = df_input_data.drop('n_tokens', axis=1)

In [27]:
# Generate Document Embeddings using OpenAI Ada 002
@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
# Function to generate embeddings for title and content fields, also used for query embeddings
def get_embedding(text, model=deployment_embedding_name): # model = "deployment_name"
    return client.embeddings.create(input = [text], model=model).data[0].embedding

In [None]:
# model should be set to the deployment name you chose when you deployed the text-embedding-ada-002 (Version 2) model
df_input_data['titleVector'] = df_input_data["title"].apply(lambda x : get_embedding (x, model = deployment_embedding_name)) 
df_input_data['contentVector'] = df_input_data["content"].apply(lambda x : get_embedding (x, model = deployment_embedding_name)) 
df_input_data

In [29]:
df_input_data.to_csv(os.path.join(os.getcwd(),'data/embedding_input_data.csv'), index=False)

## Create your search index
Create your search index schema and vector search configuration:

In [None]:
# Create a search index
index_client = SearchIndexClient(endpoint=search_endpoint, credential=credential)
fields = [
    SearchField(name="id", key=True, type=SearchFieldDataType.String, sortable=True, filterable=True, facetable=False),  
    SearchField(name="title", type=SearchFieldDataType.String, searchable=True),
    SearchField(name="content", type=SearchFieldDataType.String, searchable=True),
    SearchField(name="content_ko", type=SearchFieldDataType.String, searchable=True, analyzer_name="ko.lucene"),
    SearchField(name="category", type=SearchFieldDataType.String, filterable=True),
    SearchField(
        name="titleVector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1536,
        vector_search_profile_name="myHnswProfile",  # Ensure vector_search_profile is set
    ),
    SearchField(
        name="contentVector",
        type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
        searchable=True,
        vector_search_dimensions=1536,
        vector_search_profile_name="myHnswProfile",  # Ensure vector_search_profile is set
    ),
]
# Configure the vector search configuration  
vector_search = VectorSearch(  
    algorithms=[  
        HnswAlgorithmConfiguration(name="myHnsw"),
    ],  
    profiles=[  
        VectorSearchProfile(  
            name="myHnswProfile",  
            algorithm_configuration_name="myHnsw",  
            vectorizer_name="myOpenAI",  
        )
    ],  
    vectorizers=[  
        AzureOpenAIVectorizer(  
            vectorizer_name="myOpenAI",  
            kind="azureOpenAI",  
            azure_open_ai_parameters=AzureOpenAIVectorizerParameters(  
                resource_uri=os.getenv("AZURE_OPENAI_ENDPOINT"),  
                deployment_id=deployment_embedding_name,
                model_name=deployment_embedding_name,
                api_key=os.getenv("AZURE_OPENAI_API_KEY"),
            ),
        ),  
    ],  
)  


  
semantic_config = SemanticConfiguration(  
    name="my-semantic-config",  
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="title"),
        content_fields=[SemanticField(field_name="content")]  
    ),  
)

# Create the semantic search with the configuration  
semantic_search = SemanticSearch(configurations=[semantic_config])  
  
# Create the search index
index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search, semantic_search=semantic_search)  
result = index_client.create_or_update_index(index)  
print(f"{result.name} created")  

## Insert text and embeddings into vector store
Add texts and metadata from the JSON data to the vector store:

In [None]:
# Upload data_embeddings.csv to search_client.upload_documents to create the index 
ddocuments = df_input_data.to_dict(orient='records')
for doc in documents:
    doc['id'] = str(doc['id'])
search_client = SearchClient(
    endpoint=search_endpoint, index_name=index_name, credential=credential
)
result = search_client.upload_documents(documents)

print(f"Uploaded {len(documents)} documents")

## Perform a vector similarity search

In [None]:
from azure.search.documents.models import VectorizedQuery

# Pure Vector Search
query = "I want to monitor my applications"


search_client = SearchClient(search_endpoint, index_name, credential=credential)
embedding = get_embedding (query, model = deployment_embedding_name)

vector_query = VectorizedQuery(vector=embedding, k_nearest_neighbors=3, fields="contentVector")


results = search_client.search(
    search_text=None,
    vector_queries=[vector_query],
    select=["title", "content", "category"],
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}\n")

- We utilize the VectorizableTextQuery class provided by Azure AI Search to find similar documents immediately without embedding and querying the query separately.
- The same can be used for queries in other languages.

In [None]:
# Pure Vector Search multi-lingual (e.g 'tools for software development' in Dutch)
query = "Application을 모니터링하고 싶어"

search_client = SearchClient(search_endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=3, fields="contentVector")

results = search_client.search(
    search_text=None,
    vector_queries=[vector_query],
    select=["title", "content", "category"],
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}\n")

- This method ignores indexing and performs an exhaustive KNN search on the vector index. This can be used in combination with filters. Allows you to check the ground-truth value as a reference point for tuning. 

In [None]:
# Pure Vector Search
query = "tools for software development"  
  
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=3, fields="contentVector", exhaustive=True)
  
results = search_client.search(  
    search_text=None,  
    vector_queries= [vector_query],
    select=["title", "content", "category"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['category']}\n")  

## Cross-Field Vector Search

- In addition to ContentVector, you can use titleVector to perform Cross-Field Vector Search.

In [None]:
# Cross-Field Vector Search
query = "I want to monitor my applications"

search_client = SearchClient(search_endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=3, fields="titleVector, contentVector")

results = search_client.search(
    search_text=None,
    vector_queries=[vector_query],
    select=["title", "content", "category"],
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}\n")

## Perform a Multi-Vector Search

- You can use one query for two VectorFields or two queries for one VectorField to perform a Multi-Vector Search.
- For example, you can perform a Multi-Vector Search by creating additional queries through query rewrite.
- This can be weighted to adjust the results.

In [None]:
# Multi-Vector Search
query1 = "software development for ds"
query2 = "데이터를 가지고 실험을 하거나 통계를 보고 모델을 제작할 때 사용할 수 있는 도구"

search_client = SearchClient(search_endpoint, index_name, credential=credential)
vector_query1 = VectorizableTextQuery(text=query1, k_nearest_neighbors=3, fields="titleVector, contentVector", weight=2)
vector_query2 = VectorizableTextQuery(text=query2, k_nearest_neighbors=3, fields="titleVector, contentVector", weight=0.5)
# vector_query1 = VectorizableTextQuery(text=query1, k_nearest_neighbors=3, fields="titleVector, contentVector", weight=2)
# vector_query2 = VectorizableTextQuery(text=query2, k_nearest_neighbors=3, fields="titleVector, contentVector", weight=0.5)



results = search_client.search(
    search_text=None,
    vector_queries=[vector_query1, vector_query2],
    select=["title", "content", "category"],
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}\n")

## Perform a Pure Vector Search with a filter

- You can add filters to your Vector Search to limit your results, especially if your search is broad and you have a lot of data. 

In [None]:
from azure.search.documents.models import VectorFilterMode

# Pure Vector Search with Filter
query = "tools for software development"

search_client = SearchClient(search_endpoint, index_name, credential=credential)
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=3, fields="contentVector")

results = search_client.search(
    search_text=None,
    vector_queries=[vector_query],
    vector_filter_mode=VectorFilterMode.PRE_FILTER,
    filter="category eq 'Developer Tools'",
    select=["title", "content", "category"]
)

for result in results:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}\n")

## Perform a Hybrid Search

- You can perform searches using a mix of searchtext and vector search.

In [None]:
# Hybrid Search
query = "integration"

search_client = SearchClient(search_endpoint, index_name, AzureKeyCredential(key))
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=3, fields="contentVector")

vector_query_result = search_client.search(
    search_text=None, vector_queries=[vector_query], select=["title", "content", "category"], top=3)

hybrid_query_result = search_client.search(
    search_text=query, vector_queries=[vector_query], select=["title", "content", "category"], top=3)

print("----------------------------vector search---------------------------------")

for result in vector_query_result:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}\n")

print("----------------------------hybrid search---------------------------------")

for result in hybrid_query_result:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}\n")    

- You can adjust the results by weighting the vector query.

In [None]:
# Hybrid Search
query = "integration"

search_client = SearchClient(search_endpoint, index_name, AzureKeyCredential(key))
vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=3, fields="contentVector", weight=0.2)

vector_query_result = search_client.search(
    search_text=None, vector_queries=[vector_query], select=["title", "content", "category"], top=3)

hybrid_query_result = search_client.search(
    search_text=query, vector_queries=[vector_query], select=["title", "content", "category"], top=3)

print("----------------------------vector search---------------------------------")

for result in vector_query_result:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}\n")

print("----------------------------hybrid search---------------------------------")

for result in hybrid_query_result:
    print(f"Title: {result['title']}")
    print(f"Score: {result['@search.score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}\n")    

## hybrid search + Semantic reranking

- You can perform hybrid searches by mixing semantic reranking with the results of a hybrid search.

In [None]:
from azure.search.documents.models import QueryType, QueryCaptionType, QueryAnswerType

# Semantic Hybrid Search
query = "what is azure sarch?"

vector_query = VectorizableTextQuery(text=query, k_nearest_neighbors=3, fields="contentVector", exhaustive=True)

results = search_client.search(  
    search_text=query,  
    vector_queries=[vector_query],
    select=["title", "content", "category"],
    query_type=QueryType.SEMANTIC, semantic_configuration_name='my-semantic-config', query_caption=QueryCaptionType.EXTRACTIVE, query_answer=QueryAnswerType.EXTRACTIVE,
    top=3
)

semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"Title: {result['title']}")
    print(f"Reranker Score: {result['@search.reranker_score']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}")

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")
