# Azure Cognitive Search Vector Search Code Sample with Azure OpenAI
This code demonstrates how to use Azure Cognitive Search with OpenAI and Azure Python SDK
## Prerequisites
To run the code, install the following packages. Please use the latest pre-release version `pip install azure-search-documents --pre`.

In [None]:
! pip install azure-search-documents --pre
! pip install openai
! pip install python-dotenv

## Import required libraries and environment variables

In [2]:
# Import required libraries  
import os  
import json  
import openai  
from dotenv import load_dotenv  
from tenacity import retry, wait_random_exponential, stop_after_attempt  
from azure.core.credentials import AzureKeyCredential  
from azure.search.documents import SearchClient  
from azure.search.documents.indexes import SearchIndexClient  
from azure.search.documents.models import Vector  
from azure.search.documents.indexes.models import (  
    SearchIndex,  
    SearchField,  
    SearchFieldDataType,  
    SimpleField,  
    SearchableField,  
    SearchIndex,  
    SemanticConfiguration,  
    PrioritizedFields,  
    SemanticField,  
    SearchField,  
    SemanticSettings,  
    VectorSearch,  
    HnswVectorSearchAlgorithmConfiguration,  
)  
  
# Configure environment variables  
load_dotenv()  
service_endpoint = os.getenv("AZURE_SEARCH_SERVICE_ENDPOINT") 
index_name = os.getenv("AZURE_SEARCH_INDEX_NAME") 
key = os.getenv("AZURE_SEARCH_ADMIN_KEY") 
openai.api_type = "azure"  
openai.api_key = os.getenv("AZURE_OPENAI_API_KEY")  
openai.api_base = os.getenv("AZURE_OPENAI_ENDPOINT")  
openai.api_version = os.getenv("AZURE_OPENAI_API_VERSION")  
credential = AzureKeyCredential(key)

## Create embeddings - with json file
Read your data, generate OpenAI embeddings and export to a format to insert your Azure Cognitive Search index:

In [2]:
# Generate Document Embeddings using OpenAI Ada 002

# Read the text-sample.json
with open('../data/text-sample.json', 'r', encoding='utf-8') as file:
    input_data = json.load(file)

@retry(wait=wait_random_exponential(min=1, max=20), stop=stop_after_attempt(6))
# Function to generate embeddings for title and content fields, also used for query embeddings
def generate_embeddings(text):
    response = openai.Embedding.create(
        input=text, engine="text-embedding-ada-002")
    embeddings = response['data'][0]['embedding']
    return embeddings


# Generate embeddings for title and content fields
for item in input_data:
    title = item['title']
    content = item['content']
    title_embeddings = generate_embeddings(title)
    content_embeddings = generate_embeddings(content)
    item['titleVector'] = title_embeddings
    item['contentVector'] = content_embeddings

# Output embeddings to docVectors.json file
with open("../output/docVectors.json", "w") as f:
    json.dump(input_data, f)

## Create dataframe with embeddings from txt files in a directory
Read your txt files into a dataframe that can be used to load the index:

In [8]:
import uuid
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken
import pandas as pd

# Function to generate embeddings for title and content fields, also used for query embeddings
def generate_embeddings(text):
    response = openai.Embedding.create(
        input=text, engine="text-embedding-ada-002")
    embeddings = response['data'][0]['embedding']
    return embeddings

# get a UUID - URL safe, Base64
def get_a_uuid():
    return str(uuid.uuid4())

# method to get the token length with the encoding
tokenizer_name = tiktoken.get_encoding("cl100k_base")
tokenizer = tiktoken.get_encoding(tokenizer_name.name)

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(text, disallowed_special=())
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=8000, # this depends on which model you might use, for example with the 16k GPT models setting this to 8k is reasonable and maybe higher
    chunk_overlap=200,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""],
)

#function to return the number of tokens in a string
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    token_integers = encoding.encode(string)
    num_tokens = len(token_integers)

    return num_tokens

# open and read all the txt files and put them into chuncks in a dataframe, this takes the contents of
# the file and splits based on the text splitter.  this needs to be split because of the embeddings
# columns will be title, tokens, content

directory = "../data//txt/challenger_customer/"
chunk = {}
txt = []

for filename in os.listdir(directory):
    if filename.endswith(".txt"):
        with open(os.path.join(directory, filename), "r") as f:
                text = f.read()
                texts = text_splitter.create_documents([text])
                for i in texts:
                        chunk = {
                                "id": get_a_uuid(),  # generate a random uuid for the document
                                "title": filename[:-4],  # remove the .txt extension from the filename and use this as the title
                                "content": i.page_content,
                                "content_tokens": num_tokens_from_string(i.page_content, "cl100k_base"),
                                "category": "Challenger Customer",
                                "titleVector": generate_embeddings(filename[:-4]), # remove the .txt extension from the filename and use this as the title
                                "contentVector": generate_embeddings(i.page_content)
                                }
                        txt.append(chunk)

df = pd.DataFrame(txt)
df

Unnamed: 0,id,title,content,content_tokens,category,titleVector,contentVector
0,b077ee1e-70a1-48fb-877d-c56b6c665fa8,Building Commercial Insight The Power of Menta...,Chapter 4. Building commercial insight. In wor...,7226,Challenger Customer,"[-0.011274265125393867, 0.005391454789787531, ...","[-0.005733810365200043, 0.025400370359420776, ..."
1,dd801ad1-b682-486b-92d2-5eb181c9c6e1,Commercial Insight in Action A Case Study of X...,Chapter 5. Commercial insight in action. Case...,5097,Challenger Customer,"[-0.0017058451194316149, 0.0035176645033061504...","[0.0021748587023466825, 0.017426013946533203, ..."
2,78f594eb-489f-4df2-906f-1962d2b8d084,Focusing on Sales Strategies,Chapter One. The dark side of customer consens...,8000,Challenger Customer,"[-0.04257185384631157, -0.006318849511444569, ...","[-0.013553818687796593, -0.006251673214137554,..."
3,6dde5e98-6a33-4fc0-81a5-c5d37516c3b3,Focusing on Sales Strategies,"for customers New World Buying dysfunction, le...",5239,Challenger Customer,"[-0.04257185384631157, -0.006318849511444569, ...","[-0.005606968887150288, -0.00190691277384758, ..."
4,882c038b-f9a1-461c-8976-fd656adf1a29,Making Collective Learning Happen,Chapter 9. Making collective learning happen. ...,7730,Challenger Customer,"[-0.028264524415135384, 0.0051315901800990105,...","[-0.013146437704563141, -0.0011667633662000299..."
5,474727be-59e9-4b02-9c69-df301e889778,Shifting to a Challenger Commercial Model - Im...,Chapter 10. Shifting to a Challenger commercia...,8000,Challenger Customer,"[0.0013302754377946258, -0.012963037937879562,...","[-0.028557902202010155, -0.0024416313972324133..."
6,fc089783-37d9-4eb2-8585-9eeca9e4d180,Shifting to a Challenger Commercial Model - Im...,social selling happening in exactly this way. ...,3431,Challenger Customer,"[0.0013302754377946258, -0.012963037937879562,...","[-0.03337668627500534, -0.008122767321765423, ..."
7,c6c9e6f5-9c8c-43aa-b95a-8cdbb60a0655,Summarizing Sales Methodology Books The Mobilizer,"Chapter 2. The mobilizer. In many ways, the q...",7999,Challenger Customer,"[-0.029752077534794807, -0.007925529964268208,...","[-0.02085736021399498, -0.0066637154668569565,..."
8,54bd5ab4-5baa-428d-a140-a21525c8a5ea,Summarizing Sales Methodology Books The Mobilizer,learned is you're mobilizer may not fully know...,692,Challenger Customer,"[-0.029752077534794807, -0.007925529964268208,...","[-0.028304407373070717, -0.009351048618555069,..."
9,70107de5-45e5-412b-86b1-6d3f4c9b20bf,Teaching Mobilizers Where They Learn,Chapter 6. Teaching mobilizers where they lear...,6046,Challenger Customer,"[-0.02905644103884697, -0.003611951367929578, ...","[-0.014422357082366943, -0.003622485091909766,..."


## Create your search index
Create your search index schema and vector search configuration:

In [9]:
# Create a search index
index_client = SearchIndexClient(
    endpoint=service_endpoint, credential=credential)
fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True),
    SearchableField(name="title", type=SearchFieldDataType.String),
    SearchableField(name="content", type=SearchFieldDataType.String),
    SearchableField(name="category", type=SearchFieldDataType.String,
                    filterable=True),
    SearchField(name="titleVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_configuration="my-vector-config"),
    SearchField(name="contentVector", type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True, vector_search_dimensions=1536, vector_search_configuration="my-vector-config"),
]

vector_search = VectorSearch(
    algorithm_configurations=[
        HnswVectorSearchAlgorithmConfiguration(
            name="my-vector-config",
            kind="hnsw",
            parameters={
                "m": 4,
                "efConstruction": 400,
                "efSearch": 500,
                "metric": "cosine"
            }
        )
    ]
)

semantic_config = SemanticConfiguration(
    name="my-semantic-config",
    prioritized_fields=PrioritizedFields(
        title_field=SemanticField(field_name="title"),
        prioritized_keywords_fields=[SemanticField(field_name="category")],
        prioritized_content_fields=[SemanticField(field_name="content")]
    )
)

# Create the semantic settings with the configuration
semantic_settings = SemanticSettings(configurations=[semantic_config])

# Create the search index with the semantic settings
index = SearchIndex(name=index_name, fields=fields,
                    vector_search=vector_search, semantic_settings=semantic_settings)
result = index_client.create_or_update_index(index)
print(f' {result.name} created')


 shadow-vector-search created


## Insert text and embeddings into vector store from json file
Add texts and metadata from the JSON data to the vector store:

In [5]:
# Upload some documents to the index
with open('../output/docVectors.json', 'r') as file:  
    documents = json.load(file)  
search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)
result = search_client.upload_documents(documents)  
print(f"Uploaded {len(documents)} documents") 

Uploaded 108 documents


## Insert text and embeddings into vector store from data frame
Creates a list called sections from the dataframe to be loaded into the index:

In [10]:
# populate a list with the data we will use to store in the index
import re

def create_sections(df):
    for index, row in df.iterrows():
        yield {
            "id": row["id"],
            "title": row["title"],
            "content": row["content"],
            "category": row["category"],
            "titleVector": row["titleVector"],
            "contentVector": row["contentVector"],
            "@search.action": "upload",
        }
        
sections = create_sections(df)

## Load the sections list into the index
Loops thru a list called sections and creates a document ofr each item in the index:

In [11]:
def index_sections(sections):
    print(
        f"Indexing sections into search index '{index_name}'"
    )

    search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=credential)

    i = 0
    batch = []
    for s in sections:
        batch.append(s)
        i += 1
        if i % 1000 == 0:
            results = search_client.upload_documents(documents=batch)
            succeeded = sum([1 for r in results if r.succeeded])
            print(f"\tIndexed {len(results)} sections, {succeeded} succeeded")
            batch = []

    if len(batch) > 0:
        results = search_client.upload_documents(documents=batch)
        succeeded = sum([1 for r in results if r.succeeded])
        print(f"\tIndexed {len(results)} sections, {succeeded} succeeded")
        
index_sections(sections)

Indexing sections into search index 'shadow-vector-search'
	Indexed 16 sections, 16 succeeded


## Perform a vector similarity search

In [12]:
# Pure Vector Search
query = "challenger sales model"  
  
search_client = SearchClient(service_endpoint, index_name, credential=credential)
vector = Vector(value=generate_embeddings(query), k=3, fields="contentVector")
  
results = search_client.search(  
    search_text=None,  
    vectors= [vector],
    select=["title", "content", "category"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['category']}\n")  


Title: Shifting to a Challenger Commercial Model - Implications and Implementation Lessons
Score: 0.8590963
Content: Chapter 10. Shifting to a Challenger commercial model - implications and implementation lessons.  To win today, you need a challenger inside the customer organization. That is the central premise of this entire book. It turns out the far bigger story isn't about suppliers struggle to sell solutions, It's the customers struggle to buy them. Arming mobilizers with world class commercial insights and supporting their efforts to rally consensus for that insight and inherently then for your solution, requires a new go to market strategy. The implications of such a significant shift in commercial strategy are numerous. Sales and marketing must work together, bound by a new common language of disruption. Seller skills, something we covered in depth in the Challenger sale, must be reconsidered. Marketing content must be atomized and lead customers to a clear narrative that discl

In [13]:
# Pure Vector Search multi-lingual (e.g 'challenger sales model' in French)  
query = "Modèle de ventes Challenger"  
  
search_client = SearchClient(service_endpoint, index_name, credential=credential)
vector = Vector(value=generate_embeddings(query), k=3, fields="contentVector")  
  
results = search_client.search(  
    search_text=None,  
    vectors=[vector],
    select=["title", "content", "category"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['category']}\n")  


Title: Shifting to a Challenger Commercial Model - Implications and Implementation Lessons
Score: 0.80100083
Content: Chapter 10. Shifting to a Challenger commercial model - implications and implementation lessons.  To win today, you need a challenger inside the customer organization. That is the central premise of this entire book. It turns out the far bigger story isn't about suppliers struggle to sell solutions, It's the customers struggle to buy them. Arming mobilizers with world class commercial insights and supporting their efforts to rally consensus for that insight and inherently then for your solution, requires a new go to market strategy. The implications of such a significant shift in commercial strategy are numerous. Sales and marketing must work together, bound by a new common language of disruption. Seller skills, something we covered in depth in the Challenger sale, must be reconsidered. Marketing content must be atomized and lead customers to a clear narrative that disc

## Perform a Cross-Field Vector Search

In [14]:
# Cross-Field Vector Search
query = "challenger sales model"  
  
search_client = SearchClient(service_endpoint, index_name, credential=credential)  
vector = Vector(value=generate_embeddings(query), k=3, fields="titleVector, contentVector")  
  
results = search_client.search(  
    search_text=None,  
    vectors=[vector],
    select=["title", "content", "category"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['category']}\n")  


Title: Shifting to a Challenger Commercial Model - Implications and Implementation Lessons
Score: 0.03306011110544205
Content: Chapter 10. Shifting to a Challenger commercial model - implications and implementation lessons.  To win today, you need a challenger inside the customer organization. That is the central premise of this entire book. It turns out the far bigger story isn't about suppliers struggle to sell solutions, It's the customers struggle to buy them. Arming mobilizers with world class commercial insights and supporting their efforts to rally consensus for that insight and inherently then for your solution, requires a new go to market strategy. The implications of such a significant shift in commercial strategy are numerous. Sales and marketing must work together, bound by a new common language of disruption. Seller skills, something we covered in depth in the Challenger sale, must be reconsidered. Marketing content must be atomized and lead customers to a clear narrative 

## Perform a Multi-Vector Search

In [15]:
# Multi-Vector Search
query = "challenger sales model"  
  
search_client = SearchClient(service_endpoint, index_name, credential=credential)  
vector1 = Vector(value=generate_embeddings(query), k=3, fields="titleVector")  
vector2 = Vector(value=generate_embeddings(query), k=3, fields="contentVector")  
  
results = search_client.search(  
    search_text=None,  
    vectors=[vector1, vector2],
    select=["title", "content", "category"],
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['category']}\n")  


Title: Shifting to a Challenger Commercial Model - Implications and Implementation Lessons
Score: 0.03306011110544205
Content: Chapter 10. Shifting to a Challenger commercial model - implications and implementation lessons.  To win today, you need a challenger inside the customer organization. That is the central premise of this entire book. It turns out the far bigger story isn't about suppliers struggle to sell solutions, It's the customers struggle to buy them. Arming mobilizers with world class commercial insights and supporting their efforts to rally consensus for that insight and inherently then for your solution, requires a new go to market strategy. The implications of such a significant shift in commercial strategy are numerous. Sales and marketing must work together, bound by a new common language of disruption. Seller skills, something we covered in depth in the Challenger sale, must be reconsidered. Marketing content must be atomized and lead customers to a clear narrative 

## Perform a Pure Vector Search with a filter

In [16]:
# Pure Vector Search with Filter
query = "tools for software development"  
  
search_client = SearchClient(service_endpoint, index_name, credential=credential)  
vector = Vector(value=generate_embeddings(query), k=3, fields="contentVector")  

results = search_client.search(  
    search_text=None,  
    vectors=[vector],
    filter="category eq 'Challenger Customer'",
    select=["title", "content", "category"]
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['category']}\n")  


Title: Making Collective Learning Happen
Score: 0.7986147
Category: Challenger Customer

Title: Shifting to a Challenger Commercial Model - Implications and Implementation Lessons
Score: 0.79488087
Content: social selling happening in exactly this way. No one has any patience for posts like. Just wanted to let you know our new XP Nine 100 is out next month pre-order today. We've all been on the receiving end of this kind of message on LinkedIn. That's not what this channel is for. Rather, it's about actively engaging in a productive, interesting conversation. Ideally, conversation that teaches. It fits right into the spark introduced Confront model from Chapter 6. Social watering holes are the ideal place to spark a target audience into exploring your ideas by sharing surprising data, insights and provocative viewpoints. Because commercial insights are by definition not about you as a supplier, but about customers, they are much less likely to be rejected on the grounds of being commer

## Perform a Hybrid Search

In [17]:
# Hybrid Search
query = "challenger sales model"  
  
search_client = SearchClient(service_endpoint, index_name, AzureKeyCredential(key))  
vector = Vector(value=generate_embeddings(query), k=3, fields="contentVector")  

results = search_client.search(  
    search_text=query,  
    vectors=[vector],
    select=["title", "content", "category"],
    top=3
)  
  
for result in results:  
    print(f"Title: {result['title']}")  
    print(f"Score: {result['@search.score']}")  
    print(f"Content: {result['content']}")  
    print(f"Category: {result['category']}\n")  


Title: Shifting to a Challenger Commercial Model - Implications and Implementation Lessons
Score: 0.03333333507180214
Content: Chapter 10. Shifting to a Challenger commercial model - implications and implementation lessons.  To win today, you need a challenger inside the customer organization. That is the central premise of this entire book. It turns out the far bigger story isn't about suppliers struggle to sell solutions, It's the customers struggle to buy them. Arming mobilizers with world class commercial insights and supporting their efforts to rally consensus for that insight and inherently then for your solution, requires a new go to market strategy. The implications of such a significant shift in commercial strategy are numerous. Sales and marketing must work together, bound by a new common language of disruption. Seller skills, something we covered in depth in the Challenger sale, must be reconsidered. Marketing content must be atomized and lead customers to a clear narrative 

## Perform a Semantic Hybrid Search

In [19]:
# Semantic Hybrid Search
query = "what is the challenger sales model?"

search_client = SearchClient(service_endpoint, index_name, AzureKeyCredential(key))
vector = Vector(value=generate_embeddings(query), k=3, fields="contentVector")  

results = search_client.search(  
    search_text=query,  
    vectors=[vector],
    select=["title", "content", "category"],
    query_type="semantic", query_language="en-us", semantic_configuration_name='my-semantic-config', query_caption="extractive", query_answer="extractive",
    top=3
)

semantic_answers = results.get_answers()
for answer in semantic_answers:
    if answer.highlights:
        print(f"Semantic Answer: {answer.highlights}")
    else:
        print(f"Semantic Answer: {answer.text}")
    print(f"Semantic Answer Score: {answer.score}\n")

for result in results:
    print(f"Title: {result['title']}")
    print(f"Content: {result['content']}")
    print(f"Category: {result['category']}")

    captions = result["@search.captions"]
    if captions:
        caption = captions[0]
        if caption.highlights:
            print(f"Caption: {caption.highlights}\n")
        else:
            print(f"Caption: {caption.text}\n")


Title: Shifting to a Challenger Commercial Model - Implications and Implementation Lessons
Content: Chapter 10. Shifting to a Challenger commercial model - implications and implementation lessons.  To win today, you need a challenger inside the customer organization. That is the central premise of this entire book. It turns out the far bigger story isn't about suppliers struggle to sell solutions, It's the customers struggle to buy them. Arming mobilizers with world class commercial insights and supporting their efforts to rally consensus for that insight and inherently then for your solution, requires a new go to market strategy. The implications of such a significant shift in commercial strategy are numerous. Sales and marketing must work together, bound by a new common language of disruption. Seller skills, something we covered in depth in the Challenger sale, must be reconsidered. Marketing content must be atomized and lead customers to a clear narrative that discloses what they've